πŸš€ DRONE: Cross-modal Representation Shift Refinement for Point-supervised Video Moment Retrieval

Kun Wang1  Yupeng Hu1βœ‰  Hao Liu1  Jiang Shao1  Liqiang Nie2

1School of Software, Shandong University, Jinan, China
2School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China
βœ‰Corresponding author

These are the official implementation, pre-trained model weights, and configuration files for DRONE, a point-supervised Video Moment Retrieval (VMR) framework designed to mitigate cross-modal representation shift.

πŸ”— Paper: Accepted by ACM TOIS 2026 πŸ”— GitHub Repository: iLearn-Lab/DRONE


πŸ“Œ Model Information

1. Model Name

DRONE (Cross-modal Representation Shift Refinement)

2. Task Type & Applicable Tasks

  • Task Type: Point-supervised Video Moment Retrieval (VMR) / Vision-Language / Multimodal Learning
  • Applicable Tasks: Localizing temporal segments in untrimmed videos that match natural language queries, utilizing only point-level supervision to reduce annotation costs while actively addressing cross-modal representation shifts.

3. Project Introduction

Point-supervised Video Moment Retrieval (VMR) aims to localize the temporal segment in a video that matches a natural language query using only single-frame annotations. DRONE addresses the cross-modal representation shift issue inherent in this setting, which progressively improves temporal alignment and semantic consistency between video and text representations.

πŸ’‘ Method Highlight: DRONE introduces Pseudo-Frame Temporal Alignment (PTA) and Curriculum-Guided Semantic Refinement (CSR). Together, these modules systematically mitigate representation shifts, allowing the model to bridge the semantic gap between visual frames and textual queries effectively.

4. Training Data Source

The model supports and is evaluated on three standard VMR datasets:

  • ActivityNet Captions
  • Charades-STA
  • TACoS (Follows splits and feature preparation from ViGA)

πŸš€ Usage & Basic Inference

Step 1: Prepare the Environment

Clone the GitHub repository and set up the virtual environment:

git clone https://github.com/iLearn-Lab/DRONE.git
cd DRONE
python -m venv .venv
source .venv/bin/activate   # Linux / Mac
# .venv\Scripts\activate    # Windows
pip install numpy scipy pyyaml tqdm

Step 2: Download Model Weights & Data

  1. Pre-trained Checkpoints: Download the model checkpoints (includes Act_ckpt/, Cha_ckpt/, and TACoS_ckpt/).
  2. Datasets & Features: Follow ViGA's dataset preparation guidelines for ActivityNet Captions, Charades-STA, and TACoS.
  3. Configuration: Before running, ensure you replace the local dataset root and feature paths in src/config.yaml and src/utils/utils.py with your actual local paths.

Step 3: Run Training & Evaluation

Training from Scratch: Depending on the dataset you want to train on, run the following commands:

For ActivityNet Captions

python -m src.experiment.train --task activitynetcaptions

For Charades-STA

python -m src.experiment.train --task charadessta

For TACoS

python -m src.experiment.train --task tacos

Evaluation (Eval): To evaluate a trained experiment folder (which should contain config.yaml and model_best.pt), run:

python -m src.experiment.eval --exp path/to/your/experiment_folder


⚠️ Limitations & Notes

Disclaimer: This framework and its pre-trained weights are intended for academic research purposes only.

  • The model requires access to the original source datasets (ActivityNet Captions, Charades-STA, TACoS) for full evaluation.
  • While designed to mitigate cross-modal representation shifts, performance relies on the quality of the point-level annotations and the inherent capacities of the selected visual backbones (C3D, I3D, VGG).

🀝 Acknowledgements & Contact

  • Acknowledgement: This implementation and data organization are inspired by the ViGA open-source community. Thanks to all collaborators and contributors of this project.
  • Contact: If you have any questions, feel free to contact me at khylon.kun.wang@gmail.com.

πŸ“β­οΈ Citation

If you find our work or this repository useful in your research, please consider citing our paper:

@article{wang2026cross, title={Cross-Modal Representation Shift Refinement for Point-supervised Video Moment Retrieval}, author={Wang, Kun and Hu, Yupeng and Liu, Hao and Shao, Jiang and Nie, Liqiang}, journal={ACM Transactions on Information Systems}, volume={44}, number={3}, pages={1--30}, year={2026}, publisher={ACM New York, NY} }

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support