🚀 DRONE: Cross-modal Representation Shift Refinement for Point-supervised Video Moment Retrieval

Kun Wang¹ Yupeng Hu^1✉ Hao Liu¹ Jiang Shao¹ Liqiang Nie²

¹School of Software, Shandong University, Jinan, China
²School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China
^✉Corresponding author

These are the official implementation, pre-trained model weights, and configuration files for DRONE, a point-supervised Video Moment Retrieval (VMR) framework designed to mitigate cross-modal representation shift.

🔗 Paper: Accepted by ACM TOIS 2026 🔗 GitHub Repository: iLearn-Lab/DRONE

📌 Model Information

1. Model Name

DRONE (Cross-modal Representation Shift Refinement)

2. Task Type & Applicable Tasks

Task Type: Point-supervised Video Moment Retrieval (VMR) / Vision-Language / Multimodal Learning
Applicable Tasks: Localizing temporal segments in untrimmed videos that match natural language queries, utilizing only point-level supervision to reduce annotation costs while actively addressing cross-modal representation shifts.

3. Project Introduction

Point-supervised Video Moment Retrieval (VMR) aims to localize the temporal segment in a video that matches a natural language query using only single-frame annotations. DRONE addresses the cross-modal representation shift issue inherent in this setting, which progressively improves temporal alignment and semantic consistency between video and text representations.

💡 Method Highlight: DRONE introduces Pseudo-Frame Temporal Alignment (PTA) and Curriculum-Guided Semantic Refinement (CSR). Together, these modules systematically mitigate representation shifts, allowing the model to bridge the semantic gap between visual frames and textual queries effectively.

4. Training Data Source

The model supports and is evaluated on three standard VMR datasets:

ActivityNet Captions
Charades-STA
TACoS (Follows splits and feature preparation from ViGA)

🚀 Usage & Basic Inference

Step 1: Prepare the Environment

Clone the GitHub repository and set up the virtual environment:

git clone https://github.com/iLearn-Lab/DRONE.git
cd DRONE

python -m venv .venv
source .venv/bin/activate   # Linux / Mac
# .venv\Scripts\activate    # Windows

pip install numpy scipy pyyaml tqdm

Step 2: Download Model Weights & Data

Pre-trained Checkpoints: Download the model checkpoints (includes Act_ckpt/, Cha_ckpt/, and TACoS_ckpt/).
Datasets & Features: Follow ViGA's dataset preparation guidelines for ActivityNet Captions, Charades-STA, and TACoS.
Configuration: Before running, ensure you replace the local dataset root and feature paths in src/config.yaml and src/utils/utils.py with your actual local paths.

Step 3: Run Training & Evaluation

Training from Scratch: Depending on the dataset you want to train on, run the following commands:

For ActivityNet Captions

python -m src.experiment.train --task activitynetcaptions

For Charades-STA

python -m src.experiment.train --task charadessta

For TACoS

python -m src.experiment.train --task tacos

Evaluation (Eval): To evaluate a trained experiment folder (which should contain config.yaml and model_best.pt), run:

python -m src.experiment.eval --exp path/to/your/experiment_folder

⚠️ Limitations & Notes

Disclaimer: This framework and its pre-trained weights are intended for academic research purposes only.

The model requires access to the original source datasets (ActivityNet Captions, Charades-STA, TACoS) for full evaluation.
While designed to mitigate cross-modal representation shifts, performance relies on the quality of the point-level annotations and the inherent capacities of the selected visual backbones (C3D, I3D, VGG).

🤝 Acknowledgements & Contact

Acknowledgement: This implementation and data organization are inspired by the ViGA open-source community. Thanks to all collaborators and contributors of this project.
Contact: If you have any questions, feel free to contact me at khylon.kun.wang@gmail.com.

📝⭐️ Citation

If you find our work or this repository useful in your research, please consider citing our paper:

@article{wang2026cross, title={Cross-Modal Representation Shift Refinement for Point-supervised Video Moment Retrieval}, author={Wang, Kun and Hu, Yupeng and Liu, Hao and Shao, Jiang and Nie, Liqiang}, journal={ACM Transactions on Information Systems}, volume={44}, number={3}, pages={1--30}, year={2026}, publisher={ACM New York, NY} }

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support