LTX-2.3
Identity
| Property |
Value |
| ID |
ltx-2.3 |
| Parameters |
22B (dev/distilled checkpoints) |
| HuggingFace |
affectively-ai/ltx-2.3 (base: Lightricks/LTX-2.3) |
| Quantization |
SafeTensors checkpoints (BF16/FP16) |
| License |
LTX-2 Community License Agreement (HF metadata: other) |
Axis 1: Architecture
| Property |
Value |
| Family |
Diffusion Transformer (DiT), joint audio-video generation |
| Checkpoint set |
ltx-2.3-22b-dev, ltx-2.3-22b-distilled, LoRA + temporal/spatial upscalers |
| Primary task design |
Text/Image conditioned video synthesis, optional audio generation |
| Attention/latent internals |
Not fully specified in published card metadata |
| Training/release orientation |
Open-weight foundation + distilled inference checkpoint |
Architecture assessment: LTX-2.3 is a specialist diffusion family for generative video. It is not a chat transformer and should be routed through a diffusion-native runtime path for full quality.
Axis 2: Runtime
| Runtime |
Viable |
Notes |
| WASM (browser) |
No |
22B checkpoint family is beyond practical browser constraints |
| ONNX/WebGPU |
No |
No maintained ONNX/WebGPU path in this deployment |
| Native (device) |
Conditional |
Possible on high-end local GPU setups |
| Edge Worker |
No |
Worker memory/runtime ceilings are too small |
| Cloud Run (distributed CPU lane) |
Yes |
Current Aether route for API readiness and compatibility |
| Cloud GPU |
Conditional |
Best fit for full-quality denoising pipelines |
Primary runtime: Cloud Run distributed coordinator/layer topology for routing and compatibility today; dedicated diffusion-native execution is the quality path.
Axis 3: Modality
| Property |
Value |
| Input |
Text and/or image conditioning (plus optional audio workflows upstream) |
| Output |
Video (with optional synchronized audio in upstream LTX workflows) |
| Category |
Image-to-video / text-to-video |
Axis 4: Task Fitness
| Task |
Fitness |
Notes |
| Prompted short-form video generation |
Very good |
Core capability of the model family |
| Image-conditioned video generation |
Very good |
First-class upstream task |
| Audio-synchronized AV generation |
Good |
Supported in upstream LTX stack; runtime integration maturity varies |
| Document OCR / VLM reasoning |
Poor |
Wrong model class for extraction/reasoning tasks |
Role in the zoo: Primary modern video-generation specialist. Route video requests here instead of overloading vision-language chat stacks.
Axis 5: Operational Cost
| Property |
Value |
| Checkpoint footprint |
~44 GB for the core 22B checkpoints (plus upscalers) |
| Cloud Run topology |
1 coordinator + 4 layer services (current distributed lane) |
| Cloud Run resources |
2 vCPU / 4 GiB per service (current baseline config) |
| Timeout profile |
600s request budget for long diffusion-style operations |
| Idle cost |
~$0/month when min-instances remain 0 |
| Cold start profile |
Noticeable on scale-from-zero; acceptable for non-realtime video jobs |
Verdict
LTX-2.3 is the right specialist for video synthesis workloads in this model zoo. Keep it on an explicit video/diffusion routing lane, and avoid treating it like a chat model. For best quality, prioritize a dedicated diffusion-native runtime over compatibility inference shims.