π DRONE: Cross-modal Representation Shift Refinement for Point-supervised Video Moment Retrieval
Kun Wang1 Yupeng Hu1β Hao Liu1 Jiang Shao1 Liqiang Nie2
1School of Software, Shandong University, Jinan, China
2School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China
βCorresponding author
These are the official implementation, pre-trained model weights, and configuration files for DRONE, a point-supervised Video Moment Retrieval (VMR) framework designed to mitigate cross-modal representation shift.
π Paper: Accepted by ACM TOIS 2026 π GitHub Repository: iLearn-Lab/DRONE
π Model Information
1. Model Name
DRONE (Cross-modal Representation Shift Refinement)
2. Task Type & Applicable Tasks
- Task Type: Point-supervised Video Moment Retrieval (VMR) / Vision-Language / Multimodal Learning
- Applicable Tasks: Localizing temporal segments in untrimmed videos that match natural language queries, utilizing only point-level supervision to reduce annotation costs while actively addressing cross-modal representation shifts.
3. Project Introduction
Point-supervised Video Moment Retrieval (VMR) aims to localize the temporal segment in a video that matches a natural language query using only single-frame annotations. DRONE addresses the cross-modal representation shift issue inherent in this setting, which progressively improves temporal alignment and semantic consistency between video and text representations.
π‘ Method Highlight: DRONE introduces Pseudo-Frame Temporal Alignment (PTA) and Curriculum-Guided Semantic Refinement (CSR). Together, these modules systematically mitigate representation shifts, allowing the model to bridge the semantic gap between visual frames and textual queries effectively.
4. Training Data Source
The model supports and is evaluated on three standard VMR datasets:
- ActivityNet Captions
- Charades-STA
- TACoS (Follows splits and feature preparation from ViGA)
π Usage & Basic Inference
Step 1: Prepare the Environment
Clone the GitHub repository and set up the virtual environment:
git clone https://github.com/iLearn-Lab/DRONE.git
cd DRONE
python -m venv .venv
source .venv/bin/activate # Linux / Mac
# .venv\Scripts\activate # Windows
pip install numpy scipy pyyaml tqdm
Step 2: Download Model Weights & Data
- Pre-trained Checkpoints: Download the model checkpoints (includes
Act_ckpt/,Cha_ckpt/, andTACoS_ckpt/). - Datasets & Features: Follow ViGA's dataset preparation guidelines for ActivityNet Captions, Charades-STA, and TACoS.
- Configuration: Before running, ensure you replace the local dataset root and feature paths in
src/config.yamlandsrc/utils/utils.pywith your actual local paths.
Step 3: Run Training & Evaluation
Training from Scratch: Depending on the dataset you want to train on, run the following commands:
For ActivityNet Captions
python -m src.experiment.train --task activitynetcaptions
For Charades-STA
python -m src.experiment.train --task charadessta
For TACoS
python -m src.experiment.train --task tacos
Evaluation (Eval):
To evaluate a trained experiment folder (which should contain config.yaml and model_best.pt), run:
python -m src.experiment.eval --exp path/to/your/experiment_folder
β οΈ Limitations & Notes
Disclaimer: This framework and its pre-trained weights are intended for academic research purposes only.
- The model requires access to the original source datasets (ActivityNet Captions, Charades-STA, TACoS) for full evaluation.
- While designed to mitigate cross-modal representation shifts, performance relies on the quality of the point-level annotations and the inherent capacities of the selected visual backbones (C3D, I3D, VGG).
π€ Acknowledgements & Contact
- Acknowledgement: This implementation and data organization are inspired by the ViGA open-source community. Thanks to all collaborators and contributors of this project.
- Contact: If you have any questions, feel free to contact me at
khylon.kun.wang@gmail.com.
πβοΈ Citation
If you find our work or this repository useful in your research, please consider citing our paper:
@article{wang2026cross, title={Cross-Modal Representation Shift Refinement for Point-supervised Video Moment Retrieval}, author={Wang, Kun and Hu, Yupeng and Liu, Hao and Shao, Jiang and Nie, Liqiang}, journal={ACM Transactions on Information Systems}, volume={44}, number={3}, pages={1--30}, year={2026}, publisher={ACM New York, NY} }