arxiv:2603.15132

WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation

Published on Mar 16

· Submitted by

Mingjia Li on Mar 18

Tianjin University

Upvote

Authors:

Mingjia Li ,

Xiaojie Guo

Abstract

Waypoint Diffusion Transformers address trajectory conflicts in pixel-space flow matching by using semantic waypoints from pre-trained vision models to disentangle generation paths and accelerate training convergence.

AI-generated summary

While recent Flow Matching models avoid the reconstruction bottlenecks of latent autoencoders by operating directly in pixel space, the lack of semantic continuity in the pixel manifold severely intertwines optimal transport paths. This induces severe trajectory conflicts near intersections, yielding sub-optimal solutions. Rather than bypassing this issue via information-lossy latent representations, we directly untangle the pixel-space trajectories by proposing Waypoint Diffusion Transformers (WiT). WiT factorizes the continuous vector field via intermediate semantic waypoints projected from pre-trained vision models. It effectively disentangles the generation trajectories by breaking the optimal transport into prior-to-waypoint and waypoint-to-pixel segments. Specifically, during the iterative denoising process, a lightweight generator dynamically infers these intermediate waypoints from the current noisy state. They then continuously condition the primary diffusion transformer via the Just-Pixel AdaLN mechanism, steering the evolution towards the next state, ultimately yielding the final RGB pixels. Evaluated on ImageNet 256x256, WiT beats strong pixel-space baselines, accelerating JiT training convergence by 2.2x. Code will be publicly released at https://github.com/hainuo-wang/WiT.git.

View arXiv page View PDF Project page GitHub 7 Add to collection

Community

lime-j

Paper author Paper submitter 1 day ago

Recent flow matching models avoid VAE reconstruction bottlenecks by operating directly in pixel space, but the pixel manifold lacks semantic continuity. Optimal transport paths for different semantic endpoints overlap and intersect, causing severe trajectory conflict and slow convergence.

WiT introduces discriminative semantic waypoints projected from pretrained vision models, then factors the transport into two easier mappings: prior-to-waypoint and waypoint-to-pixel. A lightweight waypoint generator predicts these semantic anchors from the current noisy state, and the primary diffusion transformer consumes them via Just-Pixel AdaLN for dense spatial modulation.

On ImageNet 256x256, WiT outperforms strong pixel-space baselines, matches JiT-L/16 at 600 epochs after only 265 epochs, and pushes pure pixel-space generation closer to or beyond heavyweight latent-space diffusion models.