FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching
Abstract
FlowInOne presents a vision-centric multimodal generation framework that unifies diverse input modalities into a single visual representation, enabling coherent image generation and editing through a unified flow matching model.
Multimodal generation has long been dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. We challenge this paradigm by asking whether all modalities, including textual descriptions, spatial layouts, and editing instructions, can be unified into a single visual representation. We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm. To support this, we introduce VisPrompt-5M, a large-scale dataset of 5 million visual prompt pairs spanning diverse tasks including physics-aware force dynamics and trajectory prediction, alongside VP-Bench, a rigorously curated benchmark assessing instruction faithfulness, spatial precision, visual realism, and content consistency. Extensive experiments demonstrate that FlowInOne achieves state-of-the-art performance across all unified generation tasks, surpassing both open-source models and competitive commercial systems, establishing a new foundation for fully vision-centric generative modeling where perception and creation coexist within a single continuous visual space.
Community
We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model.
Code, data and model weights are available. Homepage: https://csu-jpg.github.io/FlowInOne.github.io/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance (2026)
- GEditBench v2: A Human-Aligned Benchmark for General Image Editing (2026)
- GIDE: Unlocking Diffusion LLMs for Precise Training-Free Image Editing (2026)
- OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning (2026)
- UniRef-Image-Edit: Towards Scalable and Consistent Multi-Reference Image Editing (2026)
- CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation (2026)
- Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.06757 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper