MWM: Mobile World Models for Action-Conditioned Consistent Prediction
Abstract
A mobile world model for image-goal navigation that improves action-conditioned rollout consistency through structured pretraining and distillation techniques.
World models enable planning in imagined future predicted space, offering a promising framework for embodied navigation. However, existing navigation world models often lack action-conditioned consistency, so visually plausible predictions can still drift under multi-step rollout and degrade planning. Moreover, efficient deployment requires few-step diffusion inference, but existing distillation methods do not explicitly preserve rollout consistency, creating a training-inference mismatch. To address these challenges, we propose MWM, a mobile world model for planning-based image-goal navigation. Specifically, we introduce a two-stage training framework that combines structure pretraining with Action-Conditioned Consistency (ACC) post-training to improve action-conditioned rollout consistency. We further introduce Inference-Consistent State Distillation (ICSD) for few-step diffusion distillation with improved rollout consistency. Our experiments on benchmark and real-world tasks demonstrate consistent gains in visual fidelity, trajectory accuracy, planning success, and inference efficiency. Code: https://github.com/AIGeeksGroup/MWM. Website: https://aigeeksgroup.github.io/MWM.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL (2026)
- Causal World Modeling for Robot Control (2026)
- Generative Scenario Rollouts for End-to-End Autonomous Driving (2026)
- LIVE: Long-horizon Interactive Video World Modeling (2026)
- Say, Dream, and Act: Learning Video World Models for Instruction-Driven Robot Manipulation (2026)
- An Efficient and Multi-Modal Navigation System with One-Step World Model (2026)
- TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic Environments (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper