πŸš€ SG-SCI: Explicit Granularity and Implicit Scale Correspondence Learning for Point-Supervised Video Moment Localization

Kun Wang1  Hao Liu1  Lirong Jie1  Zixu Li1  Yupeng Hu1  Liqiang Nie2

1School of Software, Shandong University, Jinan, China
2School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China

These are the official implementation, pre-trained model weights, and configuration files for SG-SCI, a novel framework explicitly designed to address explicit granularity alignment and implicit scale perception in Video Moment Localization (VML).

πŸ”— Paper: Accepted by ACM MM 2024 πŸ”— GitHub Repository: iLearn-Lab/SG-SCI


πŸ“Œ Model Information

1. Model Name

SG-SCI (Semantic Granularity and Scale Correspondence Integration)

2. Task Type & Applicable Tasks

  • Task Type: Point-supervised Video Moment Localization (VML) / Vision-Language / Multimodal Learning
  • Applicable Tasks: Localizing specific moments in untrimmed videos based on textual queries, utilizing only single-frame (point-level) annotations during training to reduce annotation costs.

3. Project Introduction

Existing point-supervised Video Moment Localization (VML) methods often struggle with explicit granularity alignment and implicit scale perception. SG-SCI introduces a Semantic Granularity and Scale Correspondence Integration framework to model the semantic alignment between video and text. This approach helps the model effectively enhance and utilize modal feature representations of varying granularities and scales.

πŸ’‘ Method Highlight: SG-SCI explicitly models semantic relations of different feature granularities (via the GCA module) and adaptively mines implicit semantic scales (via the SCL strategy). It fully supports end-to-end model training and multi-modal interaction using single-frame annotations.

4. Training Data Source

The model is evaluated on standard VML datasets using point-level supervision:

  • Charades-STA
  • TACoS (Splits produced by ViGA)

πŸš€ Usage & Basic Inference

Step 1: Prepare the Environment

Clone the GitHub repository and set up the Conda environment (evaluated on Python 3.7 and PyTorch 1.10.0):

git clone https://github.com/iLearn-Lab/SG-SCI.git
cd SG-SCI
conda create --name sg-sci python=3.7 -y
source activate sg-sci
conda install pytorch=1.10.0 cudatoolkit=11.3.1 -y
pip install numpy scipy pyyaml tqdm

Step 2: Download Model Weights & Data

  1. Pre-trained Checkpoints: Download the model checkpoints and place them in your designated LOGGER_PATH.
  2. Datasets: Ensure the Charades-STA and TACoS datasets are properly structured in the src/dataset/ directory according to the splits provided by ViGA.

Step 3: Run Training & Evaluation

Training from Scratch: Depending on the dataset you want to train on, run the following commands:

For TACoS

python -m experiment.train --task tacos

For Charades-STA

python -m experiment.train --task charadessta

Evaluation (Eval): Put the downloaded checkpoints in your LOGGER_PATH, then run:

python -m src.experiment.eval --exp $LOGGER_PATH


⚠️ Limitations & Notes

Disclaimer: This framework and its pre-trained weights are intended for academic research purposes only.

  • The model requires access to the original source datasets (Charades-STA, TACoS) for full evaluation.
  • As a point-supervised method, while it significantly reduces annotation costs compared to fully-supervised methods, localization boundary precision may still be inherently limited by the single-frame nature of the ground-truth signals.

🀝 Acknowledgements & Contact

  • Acknowledgement: Thanks to the ViGA open-source community for strong baselines and tooling. Thanks to all collaborators and contributors of this project.
  • Contact: If you have any questions, feel free to contact me at khylon.kun.wang@gmail.com.

πŸ“β­οΈ Citation

If you find our work or this repository useful in your research, please consider citing our paper:

@inproceedings{wang2024explicit, title={Explicit granularity and implicit scale correspondence learning for point-supervised video moment localization}, author={Wang, Kun and Liu, Hao and Jie, Lirong and Li, Zixu and Hu, Yupeng and Nie, Liqiang}, booktitle={Proceedings of the 32nd ACM International Conference on Multimedia}, pages={9214--9223}, year={2024} }

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support