WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation
Abstract
Waypoint Diffusion Transformers address trajectory conflicts in pixel-space flow matching by using semantic waypoints from pre-trained vision models to disentangle generation paths and accelerate training convergence.
While recent Flow Matching models avoid the reconstruction bottlenecks of latent autoencoders by operating directly in pixel space, the lack of semantic continuity in the pixel manifold severely intertwines optimal transport paths. This induces severe trajectory conflicts near intersections, yielding sub-optimal solutions. Rather than bypassing this issue via information-lossy latent representations, we directly untangle the pixel-space trajectories by proposing Waypoint Diffusion Transformers (WiT). WiT factorizes the continuous vector field via intermediate semantic waypoints projected from pre-trained vision models. It effectively disentangles the generation trajectories by breaking the optimal transport into prior-to-waypoint and waypoint-to-pixel segments. Specifically, during the iterative denoising process, a lightweight generator dynamically infers these intermediate waypoints from the current noisy state. They then continuously condition the primary diffusion transformer via the Just-Pixel AdaLN mechanism, steering the evolution towards the next state, ultimately yielding the final RGB pixels. Evaluated on ImageNet 256x256, WiT beats strong pixel-space baselines, accelerating JiT training convergence by 2.2x. Code will be publicly released at https://github.com/hainuo-wang/WiT.git.
Community
Recent flow matching models avoid VAE reconstruction bottlenecks by operating directly in pixel space, but the pixel manifold lacks semantic continuity. Optimal transport paths for different semantic endpoints overlap and intersect, causing severe trajectory conflict and slow convergence.
WiT introduces discriminative semantic waypoints projected from pretrained vision models, then factors the transport into two easier mappings: prior-to-waypoint and waypoint-to-pixel. A lightweight waypoint generator predicts these semantic anchors from the current noisy state, and the primary diffusion transformer consumes them via Just-Pixel AdaLN for dense spatial modulation.
On ImageNet 256x256, WiT outperforms strong pixel-space baselines, matches JiT-L/16 at 600 epochs after only 265 epochs, and pushes pure pixel-space generation closer to or beyond heavyweight latent-space diffusion models.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders (2026)
- Geometric Autoencoder for Diffusion Models (2026)
- DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation (2026)
- SRA 2: Variational Autoencoder Self-Representation Alignment for Efficient Diffusion Training (2026)
- Representation Alignment for Just Image Transformers is not Easier than You Think (2026)
- PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss (2026)
- Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper