Papers
arxiv:2603.18524

3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

Published on Mar 19
· Submitted by
Lani Ko
on Mar 20
#3 Paper of the day
Authors:
,
,
,
,

Abstract

A novel 3D-aware video customization framework is presented that decouples spatial geometry from temporal motion using a 1-frame optimization approach and incorporates a visual conditioning module for enhanced texture generation.

AI-generated summary

Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/

Community

Paper submitter

Key idea:

  • 3DreamBooth treats subjects as 3D entities, not 2D — enabling multiview-consistent video generation by baking a spatial prior through 1-frame optimization.

Highlights:

  • 3DreamBooth decouples 3D geometry from temporal motion via 1-frame spatial optimization
  • 3Dapter accelerates convergence with multi-view joint attention using shared weights
  • Outperforms single-view baselines (VACE, Phantom) on identity preservation and 3D geometric fidelity
  • Model-agnostic: works on both HunyuanVideo and WanVideo 2.1

the 3dapter as a dynamic selective router that queries view-specific geometric hints from a small reference set is the standout idea here. the asymmetrical conditioning, with a single-view pretraining stage followed by multi-view joint optimization, feels like a clean split between learning 3d identity and motion priors. my concern is how sensitive this routing is to the reference set's coverage of viewpoints; would the method degrade gracefully if a novel angle is underrepresented? the arXivLens breakdown helped me parse the method details, and i found a solid walkthrough here: https://arxivlens.com/PaperView/Details/3dreambooth-high-fidelity-3d-subject-driven-video-generation-model-5322-5abc5bbb. btw a small ablation showing what happens if 3dapter is frozen during joint fine-tuning would help isolate its contribution to geometry fidelity vs the diffusion backbone.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.18524 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.18524 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.18524 in a Space README.md to link it from this page.

Collections including this paper 1