π SG-SCI: Explicit Granularity and Implicit Scale Correspondence Learning for Point-Supervised Video Moment Localization
Kun Wang1 Hao Liu1 Lirong Jie1 Zixu Li1 Yupeng Hu1 Liqiang Nie2
1School of Software, Shandong University, Jinan, China
2School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
These are the official implementation, pre-trained model weights, and configuration files for SG-SCI, a novel framework explicitly designed to address explicit granularity alignment and implicit scale perception in Video Moment Localization (VML).
π Paper: Accepted by ACM MM 2024 π GitHub Repository: iLearn-Lab/SG-SCI
π Model Information
1. Model Name
SG-SCI (Semantic Granularity and Scale Correspondence Integration)
2. Task Type & Applicable Tasks
- Task Type: Point-supervised Video Moment Localization (VML) / Vision-Language / Multimodal Learning
- Applicable Tasks: Localizing specific moments in untrimmed videos based on textual queries, utilizing only single-frame (point-level) annotations during training to reduce annotation costs.
3. Project Introduction
Existing point-supervised Video Moment Localization (VML) methods often struggle with explicit granularity alignment and implicit scale perception. SG-SCI introduces a Semantic Granularity and Scale Correspondence Integration framework to model the semantic alignment between video and text. This approach helps the model effectively enhance and utilize modal feature representations of varying granularities and scales.
π‘ Method Highlight: SG-SCI explicitly models semantic relations of different feature granularities (via the GCA module) and adaptively mines implicit semantic scales (via the SCL strategy). It fully supports end-to-end model training and multi-modal interaction using single-frame annotations.
4. Training Data Source
The model is evaluated on standard VML datasets using point-level supervision:
- Charades-STA
- TACoS (Splits produced by ViGA)
π Usage & Basic Inference
Step 1: Prepare the Environment
Clone the GitHub repository and set up the Conda environment (evaluated on Python 3.7 and PyTorch 1.10.0):
git clone https://github.com/iLearn-Lab/SG-SCI.git
cd SG-SCI
conda create --name sg-sci python=3.7 -y
source activate sg-sci
conda install pytorch=1.10.0 cudatoolkit=11.3.1 -y
pip install numpy scipy pyyaml tqdm
Step 2: Download Model Weights & Data
- Pre-trained Checkpoints: Download the model checkpoints and place them in your designated
LOGGER_PATH. - Datasets: Ensure the Charades-STA and TACoS datasets are properly structured in the
src/dataset/directory according to the splits provided by ViGA.
Step 3: Run Training & Evaluation
Training from Scratch: Depending on the dataset you want to train on, run the following commands:
For TACoS
python -m experiment.train --task tacos
For Charades-STA
python -m experiment.train --task charadessta
Evaluation (Eval):
Put the downloaded checkpoints in your LOGGER_PATH, then run:
python -m src.experiment.eval --exp $LOGGER_PATH
β οΈ Limitations & Notes
Disclaimer: This framework and its pre-trained weights are intended for academic research purposes only.
- The model requires access to the original source datasets (Charades-STA, TACoS) for full evaluation.
- As a point-supervised method, while it significantly reduces annotation costs compared to fully-supervised methods, localization boundary precision may still be inherently limited by the single-frame nature of the ground-truth signals.
π€ Acknowledgements & Contact
- Acknowledgement: Thanks to the ViGA open-source community for strong baselines and tooling. Thanks to all collaborators and contributors of this project.
- Contact: If you have any questions, feel free to contact me at
khylon.kun.wang@gmail.com.
πβοΈ Citation
If you find our work or this repository useful in your research, please consider citing our paper:
@inproceedings{wang2024explicit, title={Explicit granularity and implicit scale correspondence learning for point-supervised video moment localization}, author={Wang, Kun and Liu, Hao and Jie, Lirong and Li, Zixu and Hu, Yupeng and Nie, Liqiang}, booktitle={Proceedings of the 32nd ACM International Conference on Multimedia}, pages={9214--9223}, year={2024} }