SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning
Abstract
SpatialBoost enhances vision encoders' 3D spatial awareness by integrating linguistic 3D spatial knowledge through a multi-turn Chain-of-Thought reasoning process with Large Language Models.
Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.
Community
SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Vision-aligned Latent Reasoning for Multi-modal Large Language Model (2026)
- StemVLA:An Open-Source Vision-Language-Action Model with Future 3D Spatial Geometry Knowledge and 4D Historical Representation (2026)
- Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning (2026)
- Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models (2026)
- Perception-Aware Multimodal Spatial Reasoning from Monocular Images (2026)
- iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding (2026)
- 3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper