π MEET: Redundancy Mitigation: Towards Accurate and Efficient Image-Text Retrieval
Kun Wang1 Yupeng Hu1 Hao Liu1 Lirong Jie1 Liqiang Nie2
1School of Software, Shandong University, Jinan, China
2School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
These are the official implementation, pre-trained model weights, and configuration files for MEET, a novel framework explicitly designed to address semantic and relationship redundancy in Image-Text Retrieval (ITR).
π Paper: Accepted by TCSVT 2025 π GitHub Repository: iLearn-Lab/TCSVT25-MEET
π Model Information
1. Model Name
MEET (iMage-text retrieval rEdundancy miTigation)
2. Task Type & Applicable Tasks
- Task Type: Image-Text Retrieval (ITR) / Vision-Language / Multimodal Learning
- Applicable Tasks: Accurate and efficient cross-modal retrieval. It specifically addresses redundancy by mitigating semantic redundancy within unimodal representations and relationship redundancy in cross-modal alignments.
3. Project Introduction
Existing Image-Text Retrieval methods often suffer from a fundamental yet overlooked challenge: redundancy. MEET introduces an iMage-text retrieval rEdundancy miTigation framework to explicitly analyze and address the ITR problem from a redundancy perspective. This approach helps the model effectively produce compact yet highly discriminative representations for accurate and efficient retrieval.
π‘ Method Highlight: MEET mitigates semantic redundancy by repurposing deep hashing and quantization, and progressively refines the cross-modal alignment space by filtering misleading negative samples and adaptively reweighting informative pairs. It supports end-to-end model training, diverse feature encoders, and unified optimization.
4. Training Data Source
The model is evaluated using features from Bi-GRU and BERT on standard ITR datasets:
- MSCOCO (1K and 5K splits)
- Flickr30K (Splits produced by HREM)
π Usage & Basic Inference
Step 1: Prepare the Environment
Clone the GitHub repository and ensure you have the required dependencies (evaluated on Python >= 3.8 and PyTorch >= 1.7.0):
git clone https://github.com/iLearn-Lab/TCSVT25-MEET.git
cd MEET
pip install torchvision>=0.8.0 transformers>=2.1.1 opencv-python tensorboard
Step 2: Download Model Weights & Data
- Pre-trained Checkpoints: Download the model checkpoints and place them in your designated
LOGGER_PATH. - Language Models & Features:
- Datasets: Structure the MSCOCO and Flickr30K datasets as outlined in the data tree structure (e.g.,
coco_precomp,f30k_precomp,vocab,VSE).
Step 3: Run Training & Evaluation
Evaluation (Eval):
Depending on the text features you are using, open the corresponding script (at/lib/test.py for BiGRU, at_bert/lib/test.py for BERT) and modify the RUN_PATH.
For MSCOCO 1K 5-fold splits, first generate the folds:
python scripts/make_coco_1k_folds.py
Run testing (ensure MODEL_PATH is set to the correct VSE weights)
PYTHONPATH=. python -m lib.test
Training from Scratch:
Make sure to specify the dataset name (coco_precomp or f30k_precomp) after the --data_name flag:
PYTHONPATH=. python hq_train.py --num_epochs 12 --batch_size 128 --workers 8 --H 64 --M 8 --K 8 --data_name
β οΈ Limitations & Notes
Disclaimer: This framework and its pre-trained weights are intended for academic research purposes only.
- The model requires access to the original source datasets (MSCOCO, Flickr30K) for full evaluation.
- While designed for redundancy mitigation, the performance may still fluctuate based on extreme domain shifts not covered by the training distribution.
β οΈ Acknowledgements & Contact
- Acknowledgement: Thanks to the HREM open-source community for strong baselines and tooling. Thanks to all collaborators and contributors of this project.
- Contact: If you have any questions, feel free to contact me at
khylon.kun.wang@gmail.com.
πβοΈ Citation
If you find our work or this repository useful in your research, please consider citing our paper:
@article{wang2025redundancy, title={Redundancy Mitigation: Towards Accurate and Efficient Image-Text Retrieval}, author={Wang, Kun and Hu, Yupeng and Liu, Hao and Jie, Lirong and Nie, Liqiang}, journal={IEEE Transactions on Circuits and Systems for Video Technology}, year={2025}, publisher={IEEE} }