Amber-Image
Efficient Compression of Large-Scale Diffusion Transformers
π¨ Amber-Image
Amber-Image is a family of efficient text-to-image (T2I) generation models built through a dedicated compression pipeline that integrates structured pruning, architectural evolution, and knowledge distillation. Rather than training from scratch, Amber-Image transforms the 60-layer, 20B-parameter dual-stream MMDiT backbone of Qwen-Image into lightweight variants β Amber-Image-10B and Amber-Image-6B β reducing parameters by up to 70% while maintaining competitive generation quality.
The compression pipeline operates in two stages:
Amber-Image-10B: Derived via timestep-sensitive depth pruning, removing 30 of the 60 MMDiT layers identified as least critical. Retained layers are reinitialized through local weight averaging and recovered via layer-wise distillation from the original Qwen-Image, followed by full-parameter fine-tuning.
Amber-Image-6B: Introduces a hybrid-stream architecture where the first 10 layers retain dual-stream processing for modality-specific feature extraction, while the deeper 20 layers are converted to a single stream initialized from the image branch. Knowledge is transferred from Amber-Image-10B via progressive distillation and lightweight fine-tuning.
π Key Features
- No Training from Scratch: Operates entirely through strategic compression and refinement of an existing foundation model, dramatically reducing both computational budget and data requirements.
- Structured Depth Pruning with Fidelity-Aware Initialization: A layer importance estimation method accounts for global fidelity impact and timestep sensitivity, enabling safe removal of half the layers. Retained layers are initialized via arithmetic averaging of pruned neighboring blocks for a high-quality warm start.
- Hybrid-Stream Architecture: Early layers retain dual-stream processing for modality-specific feature extraction, while deeper layers are converted to a single stream β further reducing parameters by 40% with minimal quality loss.
- Two-Stage Knowledge Transfer: Layer-wise distillation from the full model recovers pruning-induced degradation, followed by distillation from the intermediate pruned model to align the single-stream layers. Both stages require only limited fine-tuning on a small, high-quality dataset.
- Competitive Benchmarks: Amber-Image achieves state-of-the-art results on DPG-Bench and GenEval, surpassing all compared models including closed-source systems and the 20B teacher. On text rendering benchmarks (LongText-Bench, CVTG-2K), Amber-Image-10B outperforms several closed-source baselines while maintaining competitive fidelity.
π Amber-Image-10B vs Amber-Image-6B
| Aspect | Amber-Image-10B | Amber-Image-6B |
|---|---|---|
| Parameters | ~10B | ~6B |
| Backbone Layers | 30 | 30 (10 dual-stream + 20 single-stream) |
| Architecture | Dual-Stream MMDiT | Hybrid-Stream (Dual + Single) |
| Compression Ratio | 50% depth reduction | 70% parameter reduction |
| Base Model | Qwen-Image (20B) | Amber-Image-10B |
| Text Encoder | Qwen2.5-VL-7B | Qwen2.5-VL-7B |
| VAE | Qwen-Image VAE | Qwen-Image VAE |
π Benchmark Results
General Text-to-Image Generation
DPG-Bench β Dense prompt following with 1,065 semantically rich prompts. Both Amber-Image variants achieve the highest overall scores among all compared models, surpassing closed-source Seedream 3.0 and GPT Image 1, the 20B teacher Qwen-Image, and all 7B-class open-source competitors.
| Model | Global | Entity | Attribute | Relation | Other | Overall |
|---|---|---|---|---|---|---|
| Seedream 3.0 | 94.31 | 92.65 | 91.36 | 92.78 | 88.24 | 88.27 |
| GPT Image 1 | 88.89 | 88.94 | 89.84 | 92.63 | 90.96 | 85.15 |
| Qwen-Image | 91.32 | 91.56 | 92.02 | 94.31 | 92.73 | 88.32 |
| Z-Image | 93.39 | 91.22 | 93.16 | 92.22 | 91.52 | 88.14 |
| LongCat-Image | 89.10 | 92.54 | 92.00 | 93.28 | 87.50 | 86.80 |
| Ovis-Image | 82.37 | 92.38 | 90.42 | 93.98 | 91.20 | 86.59 |
| PPCL-OPPO-10B | 85.0 | 86.8 | 85.6 | 90.5 | 87.3 | 81.7 |
| Amber-Image-10B | 83.28 | 92.54 | 90.16 | 94.47 | 87.60 | 89.61 |
| Amber-Image-6B | 79.73 | 90.45 | 91.64 | 93.87 | 89.11 | 88.96 |
GenEval β Semantic reasoning and object-centric grounding. Both Amber-Image variants achieve the best overall scores, outperforming the teacher Qwen-Image, closed-source systems, and all 7B-class open-source competitors. Notably strong in "Position" and "Attribute" dimensions.
| Model | Single | Two | Counting | Colors | Position | Attribute | Overall |
|---|---|---|---|---|---|---|---|
| Seedream 3.0 | 0.990 | 0.960 | 0.910 | 0.930 | 0.470 | 0.800 | 0.840 |
| GPT Image 1 | 0.990 | 0.920 | 0.850 | 0.920 | 0.750 | 0.610 | 0.840 |
| Qwen-Image | 0.990 | 0.920 | 0.890 | 0.880 | 0.760 | 0.770 | 0.870 |
| Z-Image | 1.000 | 0.940 | 0.780 | 0.930 | 0.620 | 0.770 | 0.840 |
| LongCat-Image | 0.990 | 0.980 | 0.860 | 0.860 | 0.750 | 0.730 | 0.870 |
| Ovis-Image | 1.000 | 0.970 | 0.760 | 0.860 | 0.670 | 0.800 | 0.840 |
| PPCL-OPPO-10B | 0.968 | 0.885 | 0.822 | 0.840 | 0.521 | 0.670 | 0.784 |
| Amber-Image-10B | 0.963 | 0.849 | 0.900 | 0.862 | 0.850 | 0.860 | 0.881 |
| Amber-Image-6B | 0.963 | 0.879 | 0.875 | 0.894 | 0.880 | 0.810 | 0.883 |
OneIG-Bench β Multi-faceted instruction following (English / Chinese). Amber-Image maintains competitive "Text" rendering scores approaching the teacher Qwen-Image, while a gap remains in "Style" and "Diversity" dimensions β attributed to the limited diversity of fine-tuning data and aesthetic priors lost during compression.
| Model | EN Overall | ZH Overall |
|---|---|---|
| Seedream 3.0 | 0.530 | 0.528 |
| GPT Image 1 | 0.533 | 0.474 |
| Qwen-Image | 0.539 | 0.548 |
| Z-Image | 0.546 | 0.535 |
| Ovis-Image | 0.530 | 0.521 |
| PPCL-OPPO-10B | 0.4856 | 0.501 |
| Amber-Image-10B | 0.504 | 0.502 |
| Amber-Image-6B | 0.491 | 0.486 |
Text Rendering
LongText-Bench β Extended bilingual text rendering. Amber-Image-10B outperforms the closed-source Seedream 3.0 on both English and Chinese splits, and significantly surpasses GPT Image 1 on Chinese text rendering. The 6B variant still exceeds many larger baselines such as OmniGen2 and FLUX.1[Dev].
| Model | EN | ZH |
|---|---|---|
| Seedream 3.0 | 0.896 | 0.878 |
| GPT Image 1 | 0.956 | 0.619 |
| Qwen-Image | 0.943 | 0.946 |
| Z-Image | 0.935 | 0.936 |
| Ovis-Image | 0.922 | 0.964 |
| PPCL-OPPO-10B | 0.871 | 0.885 |
| Amber-Image-10B | 0.911 | 0.915 |
| Amber-Image-6B | 0.870 | 0.876 |
CVTG-2K β Complex visual text generation. Amber-Image-10B achieves the highest CLIPScore among all compared models, indicating strong semantic alignment. Word accuracy remains stable across increasing region counts.
| Model | NED | CLIPScore | 2 regions | 3 regions | 4 regions | 5 regions | Average |
|---|---|---|---|---|---|---|---|
| GPT Image 1 | 0.9478 | 0.7982 | 0.8779 | 0.8659 | 0.8731 | 0.8218 | 0.8569 |
| Qwen-Image | 0.9116 | 0.8017 | 0.8370 | 0.8364 | 0.8313 | 0.8158 | 0.8288 |
| Z-Image | 0.9367 | 0.7969 | 0.9006 | 0.8722 | 0.8652 | 0.8512 | 0.8671 |
| Ovis-Image | 0.9695 | 0.8368 | 0.9248 | 0.9239 | 0.9180 | 0.9166 | 0.9200 |
| LongCat-Image | 0.9361 | 0.7859 | 0.9129 | 0.8737 | 0.8557 | 0.8310 | 0.8658 |
| Amber-Image-10B | 0.8938 | 0.8116 | 0.8791 | 0.8339 | 0.7959 | 0.6952 | 0.8011 |
| Amber-Image-6B | 0.8523 | 0.8047 | 0.8669 | 0.7994 | 0.7200 | 0.6428 | 0.7573 |
π Citation
If you find our work useful in your research, please consider citing:
@article{hellogroup2026amberimage,
title={Amber-Image: Efficient Compression of Large-Scale Diffusion Transformers},
author={Computational Intelligence Dept, HelloGroup Inc.},
year={2026}
}