Papers
arxiv:2604.04934

Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision

Published on Apr 6
· Submitted by
Hyunsoo Cha
on Apr 8
Authors:
,
,

Abstract

Vanast is a unified framework that generates garment-transferred human animation videos by combining image-based virtual try-on and pose-driven animation in a single process, addressing issues like identity drift and garment distortion through triplet supervision and dual module architecture.

AI-generated summary

We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.

Community

Paper author Paper submitter

Given a human image and one or more garment images, our method generates virtual try-on with human image animation conditioned on a pose video while preserving identity.

the dual module fusion in vanast—splitting garment transfer from pose-guided animation inside a video diffusion transformer—feels like a crisp way to keep pretrained generative quality while aligning garments to motion. i'd love to see an ablation where you remove the garment-transfer stream to quantify how much identity and garment fidelity actually come from the motion path versus the garment conditioning. the synthetic triplet supervision is bold, but i wonder how the approach handles tricky garments with non-rigid drape or accessories that weren’t well represented in the triplets. the arxivlens breakdown helped me parse the method details, especially the multi-level conditioning, and it's a nice reference if you’re planning a reproduction pass: https://arxivlens.com/PaperView/Details/vanast-virtual-try-on-with-human-image-animation-via-synthetic-triplet-supervision-942-3b31657a

Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision

Vanast is a unified single-step framework for garment-transferred human animation that replaces the conventional two-stage pipeline of virtual try-on followed by animation. By constructing large-scale triplet supervision data and introducing a Dual Module architecture for video diffusion transformers, Vanast preserves identity and garment accuracy while supporting zero-shot pose interpolation.

Key Idea

Existing approaches treat virtual try-on and human animation as separate stages, leading to error accumulation and inconsistent results. Vanast unifies both tasks into a single forward pass through a video diffusion transformer, directly generating an animated video of a person wearing a target garment in a target pose sequence — without any intermediate try-on image.

TwoStageVsUnified

Method / Approach

The method relies on two key components. First, a large-scale synthetic triplet dataset is constructed, where each sample contains a reference person image, a target garment, and a target pose sequence. This provides the dense supervision needed for end-to-end training. Second, a Dual Module architecture is integrated into the video diffusion transformer — one branch encodes identity and appearance, the other encodes garment detail — enabling the model to disentangle and faithfully reconstruct both in the output video.

TripletData

DualModule

Results

Vanast achieves state-of-the-art results on garment-transferred human animation benchmarks, outperforming two-stage baselines in both garment fidelity and motion quality. The unified design also enables zero-shot interpolation between poses and garments not seen during training.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.04934
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.04934 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.04934 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.04934 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.