πŸš€ MEET: Redundancy Mitigation: Towards Accurate and Efficient Image-Text Retrieval

Kun Wang1  Yupeng Hu1  Hao Liu1  Lirong Jie1  Liqiang Nie2

1School of Software, Shandong University, Jinan, China
2School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China

These are the official implementation, pre-trained model weights, and configuration files for MEET, a novel framework explicitly designed to address semantic and relationship redundancy in Image-Text Retrieval (ITR).

πŸ”— Paper: Accepted by TCSVT 2025 πŸ”— GitHub Repository: iLearn-Lab/TCSVT25-MEET


πŸ“Œ Model Information

1. Model Name

MEET (iMage-text retrieval rEdundancy miTigation)

2. Task Type & Applicable Tasks

  • Task Type: Image-Text Retrieval (ITR) / Vision-Language / Multimodal Learning
  • Applicable Tasks: Accurate and efficient cross-modal retrieval. It specifically addresses redundancy by mitigating semantic redundancy within unimodal representations and relationship redundancy in cross-modal alignments.

3. Project Introduction

Existing Image-Text Retrieval methods often suffer from a fundamental yet overlooked challenge: redundancy. MEET introduces an iMage-text retrieval rEdundancy miTigation framework to explicitly analyze and address the ITR problem from a redundancy perspective. This approach helps the model effectively produce compact yet highly discriminative representations for accurate and efficient retrieval.

πŸ’‘ Method Highlight: MEET mitigates semantic redundancy by repurposing deep hashing and quantization, and progressively refines the cross-modal alignment space by filtering misleading negative samples and adaptively reweighting informative pairs. It supports end-to-end model training, diverse feature encoders, and unified optimization.

4. Training Data Source

The model is evaluated using features from Bi-GRU and BERT on standard ITR datasets:

  • MSCOCO (1K and 5K splits)
  • Flickr30K (Splits produced by HREM)

πŸš€ Usage & Basic Inference

Step 1: Prepare the Environment

Clone the GitHub repository and ensure you have the required dependencies (evaluated on Python >= 3.8 and PyTorch >= 1.7.0):

git clone https://github.com/iLearn-Lab/TCSVT25-MEET.git
cd MEET

pip install torchvision>=0.8.0 transformers>=2.1.1 opencv-python tensorboard

Step 2: Download Model Weights & Data

  1. Pre-trained Checkpoints: Download the model checkpoints and place them in your designated LOGGER_PATH.
  2. Language Models & Features:
    • Obtain pretrained files for BERT-base.
    • Obtain pretrained VSE model checkpoints (e.g., ESA as an example).
  3. Datasets: Structure the MSCOCO and Flickr30K datasets as outlined in the data tree structure (e.g., coco_precomp, f30k_precomp, vocab, VSE).

Step 3: Run Training & Evaluation

Evaluation (Eval): Depending on the text features you are using, open the corresponding script (at/lib/test.py for BiGRU, at_bert/lib/test.py for BERT) and modify the RUN_PATH.

For MSCOCO 1K 5-fold splits, first generate the folds:

python scripts/make_coco_1k_folds.py

Run testing (ensure MODEL_PATH is set to the correct VSE weights)

PYTHONPATH=. python -m lib.test

Training from Scratch: Make sure to specify the dataset name (coco_precomp or f30k_precomp) after the --data_name flag:

PYTHONPATH=. python hq_train.py --num_epochs 12 --batch_size 128 --workers 8 --H 64 --M 8 --K 8 --data_name


⚠️ Limitations & Notes

Disclaimer: This framework and its pre-trained weights are intended for academic research purposes only.

  • The model requires access to the original source datasets (MSCOCO, Flickr30K) for full evaluation.
  • While designed for redundancy mitigation, the performance may still fluctuate based on extreme domain shifts not covered by the training distribution.

⚠️ Acknowledgements & Contact

  • Acknowledgement: Thanks to the HREM open-source community for strong baselines and tooling. Thanks to all collaborators and contributors of this project.
  • Contact: If you have any questions, feel free to contact me at khylon.kun.wang@gmail.com.

πŸ“β­οΈ Citation

If you find our work or this repository useful in your research, please consider citing our paper:

@article{wang2025redundancy, title={Redundancy Mitigation: Towards Accurate and Efficient Image-Text Retrieval}, author={Wang, Kun and Hu, Yupeng and Liu, Hao and Jie, Lirong and Nie, Liqiang}, journal={IEEE Transactions on Circuits and Systems for Video Technology}, year={2025}, publisher={IEEE} }

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support