arxiv:2603.17117

MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

Published on Mar 17

· Submitted by

Ligong Han on Mar 19

#2 Paper of the day

Georgia Institute of Technology

Upvote

Authors:

Wei Yu ,

Yumeng Li ,

Yidi Li ,

Abstract

Video diffusion models use hybrid spatial memory to maintain consistency under camera motion and enable long-term scene editing and navigation.

AI-generated summary

Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model's native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.

View arXiv page View PDF Project page Add to collection

Community

yuweiao

Paper author about 16 hours ago

https://mosaicmem.github.io/mosaicmem/

ligongh

Paper submitter about 15 hours ago

TL;DR: MosaicMem is a hybrid spatial memory for video world models that bridges explicit 3D memory and implicit latent frames. It retrieves spatially aligned 3D patches to preserve persistent scene structure, improving camera consistency while supporting dynamic scene modeling, long-horizon navigation, and memory-based editing.

zbhpku

about 14 hours ago

Excellent work! could you please to leave a issue in our unified framework : https://github.com/OpenDCAI/OpenWorldLib and let our team record your impressive work.

avahal

about 3 hours ago

lowkey the most interesting bit here is the patch-to-3d memory plus the two alignment tricks, Warped RoPE and Warped Latent, which try to fuse explicit geometry with implicit generation in a clean way. lifting 2d patches into 3d and reprojecting them for view-aligned conditioning is clever, but it sits on the quality of the depth estimator, so i’d want to see how it holds up under occlusion and non-rigid motion. the claim that you can run this on Wan 2.2 without fine-tuning is nice, but i’d like to see a small sensitivity study of how different diffusion priors or camera priors affect memory coherence. btw the arxivlens breakdown helped me parse the method details, especially the memory alignment sections; it’s a solid walkthrough that covers this well (https://arxivlens.com/PaperView/Details/mosaicmem-hybrid-spatial-memory-for-controllable-video-world-models-4632-cc6a7fd6). my one worry is long-horizon durability with fast dynamics — a targeted ablation removing one alignment method would show which piece actually drives the improvement.