π CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation
Mingzhu Xu1 Tianxiang Xiao1 Yutong Liu1 Haoyu Tang1 Yupeng Hu1β Liqiang Nie1
1Affiliation (Please update if needed)
These are the official implementation details and pre-trained models for CMIRNet, a Cross-Modal Interactive Reasoning Network designed for Referring Image Segmentation (RIS).
π Paper: IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2024
π Task: Referring Image Segmentation (RIS)
π Framework: PyTorch
π Model Information
1. Model Name
CMIRNet (Cross-Modal Interactive Reasoning Network)
2. Task Type & Applicable Tasks
- Task Type: Vision-Language / Multimodal Learning
- Core Task: Referring Image Segmentation (RIS)
- Applicable Scenarios:
- Language-guided object segmentation
- Cross-modal reasoning
- Vision-language alignment
- Scene understanding with textual queries
3. Project Introduction
Referring Image Segmentation (RIS) aims to segment target objects in an image based on natural language descriptions. The key challenge lies in fine-grained cross-modal alignment and complex reasoning between visual and linguistic modalities.
CMIRNet proposes a Cross-Modal Interactive Reasoning framework, which:
- Introduces interactive reasoning mechanisms between visual and textual features
- Enhances semantic alignment via multi-stage cross-modal fusion
- Incorporates graph-based reasoning to capture complex relationships
- Improves robustness under ambiguous or complex referring expressions
4. Training Data Source
The model is trained and evaluated on:
- RefCOCO
- RefCOCO+
- RefCOCOg
- RefCLEF
Image data is based on:
- MS COCO 2014 Train Set (83K images)
π Usage & Basic Inference
Step 1: Prepare Pre-trained Weights
Download backbone weights:
- ResNet-50
- ResNet-101
- Swin-Transformer-Base
- Swin-Transformer-Large
Step 2: Dataset Preparation
- Download COCO 2014 training images
- Extract to:
./data/images/
- Download referring datasets:
https://github.com/lichengunc/refer
Step 3: Training
ResNet-based Training
python train_resnet.py --model_id cmirnet_refcoco_res --device cuda:0
python train_resnet.py --model_id cmirnet_refcocop_res --device cuda:0 --dataset refcoco+
python train_resnet.py --model_id cmirnet_refcocog_res --device cuda:0 --dataset refcocog --splitBy umd
Swin-Transformer-based Training
python train_swin.py --model_id cmirnet_refcoco_swin --device cuda:0
python train_swin.py --model_id cmirnet_refcocop_swin --device cuda:0 --dataset refcoco+
python train_swin.py --model_id cmirnet_refcocog_swin --device cuda:0 --dataset refcocog --splitBy umd
Step 4: Testing / Inference
ResNet-based Testing
python test_resnet.py --device cuda:0 --resume path/to/weights
python test_resnet.py --device cuda:0 --resume path/to/weights --dataset refcoco+
python test_resnet.py --device cuda:0 --resume path/to/weights --dataset refcocog --splitBy umd
Swin-Transformer-based Testing
python test_swin.py --device cuda:0 --resume path/to/weights --window12
python test_swin.py --device cuda:0 --resume path/to/weights --dataset refcoco+ --window12
python test_swin.py --device cuda:0 --resume path/to/weights --dataset refcocog --splitBy umd --window12
β οΈ Limitations & Notes
- For academic research use only
- Performance depends on dataset quality and referring expression clarity
- May degrade under:
- ambiguous language
- complex scenes
- domain shift
- Requires substantial GPU resources for training
πβοΈ Citation
@ARTICLE{CMIRNet,
author={Xu, Mingzhu and Xiao, Tianxiang and Liu, Yutong and Tang, Haoyu and Hu, Yupeng and Nie, Liqiang},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
title={CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation},
year={2024},
pages={1-1},
keywords={Referring Image Segmentation; Vision-Language; Cross Modal Reasoning; Graph Neural Network},
doi={10.1109/TCSVT.2024.3508752}
}
β Acknowledgement
This work builds upon advances in:
- Vision-language modeling
- Transformer architectures
- Graph neural networks
π¬ Contact
For questions or collaboration, please contact the corresponding author.