Amber-Image

Efficient Compression of Large-Scale Diffusion Transformers

Representative samples generated by Amber-Image.

🎨 Amber-Image

Amber-Image is a family of efficient text-to-image (T2I) generation models built through a dedicated compression pipeline that integrates structured pruning, architectural evolution, and knowledge distillation. Rather than training from scratch, Amber-Image transforms the 60-layer, 20B-parameter dual-stream MMDiT backbone of Qwen-Image into lightweight variants — Amber-Image-10B and Amber-Image-6B — reducing parameters by up to 70% while maintaining competitive generation quality.

The compression pipeline operates in two stages:

Amber-Image-10B: Derived via timestep-sensitive depth pruning, removing 30 of the 60 MMDiT layers identified as least critical. Retained layers are reinitialized through local weight averaging and recovered via layer-wise distillation from the original Qwen-Image, followed by full-parameter fine-tuning.
Amber-Image-6B: Introduces a hybrid-stream architecture where the first 10 layers retain dual-stream processing for modality-specific feature extraction, while the deeper 20 layers are converted to a single stream initialized from the image branch. Knowledge is transferred from Amber-Image-10B via progressive distillation and lightweight fine-tuning.

Overview of the Amber-Image compression pipeline.

🌟 Key Features

No Training from Scratch: Operates entirely through strategic compression and refinement of an existing foundation model, dramatically reducing both computational budget and data requirements.
Structured Depth Pruning with Fidelity-Aware Initialization: A layer importance estimation method accounts for global fidelity impact and timestep sensitivity, enabling safe removal of half the layers. Retained layers are initialized via arithmetic averaging of pruned neighboring blocks for a high-quality warm start.
Hybrid-Stream Architecture: Early layers retain dual-stream processing for modality-specific feature extraction, while deeper layers are converted to a single stream — further reducing parameters by 40% with minimal quality loss.
Two-Stage Knowledge Transfer: Layer-wise distillation from the full model recovers pruning-induced degradation, followed by distillation from the intermediate pruned model to align the single-stream layers. Both stages require only limited fine-tuning on a small, high-quality dataset.
Competitive Benchmarks: Amber-Image achieves state-of-the-art results on DPG-Bench and GenEval, surpassing all compared models including closed-source systems and the 20B teacher. On text rendering benchmarks (LongText-Bench, CVTG-2K), Amber-Image-10B outperforms several closed-source baselines while maintaining competitive fidelity.

🆚 Amber-Image-10B vs Amber-Image-6B

Aspect	Amber-Image-10B	Amber-Image-6B
Parameters	~10B	~6B
Backbone Layers	30	30 (10 dual-stream + 20 single-stream)
Architecture	Dual-Stream MMDiT	Hybrid-Stream (Dual + Single)
Compression Ratio	50% depth reduction	70% parameter reduction
Base Model	Qwen-Image (20B)	Amber-Image-10B
Text Encoder	Qwen2.5-VL-7B	Qwen2.5-VL-7B
VAE	Qwen-Image VAE	Qwen-Image VAE

📊 Benchmark Results

General Text-to-Image Generation

DPG-Bench — Dense prompt following with 1,065 semantically rich prompts. Both Amber-Image variants achieve the highest overall scores among all compared models, surpassing closed-source Seedream 3.0 and GPT Image 1, the 20B teacher Qwen-Image, and all 7B-class open-source competitors.

Model	Global	Entity	Attribute	Relation	Other	Overall
Seedream 3.0	94.31	92.65	91.36	92.78	88.24	88.27
GPT Image 1	88.89	88.94	89.84	92.63	90.96	85.15
Qwen-Image	91.32	91.56	92.02	94.31	92.73	88.32
Z-Image	93.39	91.22	93.16	92.22	91.52	88.14
LongCat-Image	89.10	92.54	92.00	93.28	87.50	86.80
Ovis-Image	82.37	92.38	90.42	93.98	91.20	86.59
PPCL-OPPO-10B	85.0	86.8	85.6	90.5	87.3	81.7
Amber-Image-10B	83.28	92.54	90.16	94.47	87.60	89.61
Amber-Image-6B	79.73	90.45	91.64	93.87	89.11	88.96

GenEval — Semantic reasoning and object-centric grounding. Both Amber-Image variants achieve the best overall scores, outperforming the teacher Qwen-Image, closed-source systems, and all 7B-class open-source competitors. Notably strong in "Position" and "Attribute" dimensions.

Model	Single	Two	Counting	Colors	Position	Attribute	Overall
Seedream 3.0	0.990	0.960	0.910	0.930	0.470	0.800	0.840
GPT Image 1	0.990	0.920	0.850	0.920	0.750	0.610	0.840
Qwen-Image	0.990	0.920	0.890	0.880	0.760	0.770	0.870
Z-Image	1.000	0.940	0.780	0.930	0.620	0.770	0.840
LongCat-Image	0.990	0.980	0.860	0.860	0.750	0.730	0.870
Ovis-Image	1.000	0.970	0.760	0.860	0.670	0.800	0.840
PPCL-OPPO-10B	0.968	0.885	0.822	0.840	0.521	0.670	0.784
Amber-Image-10B	0.963	0.849	0.900	0.862	0.850	0.860	0.881
Amber-Image-6B	0.963	0.879	0.875	0.894	0.880	0.810	0.883

OneIG-Bench — Multi-faceted instruction following (English / Chinese). Amber-Image maintains competitive "Text" rendering scores approaching the teacher Qwen-Image, while a gap remains in "Style" and "Diversity" dimensions — attributed to the limited diversity of fine-tuning data and aesthetic priors lost during compression.

Model	EN Overall	ZH Overall
Seedream 3.0	0.530	0.528
GPT Image 1	0.533	0.474
Qwen-Image	0.539	0.548
Z-Image	0.546	0.535
Ovis-Image	0.530	0.521
PPCL-OPPO-10B	0.4856	0.501
Amber-Image-10B	0.504	0.502
Amber-Image-6B	0.491	0.486

Text Rendering

LongText-Bench — Extended bilingual text rendering. Amber-Image-10B outperforms the closed-source Seedream 3.0 on both English and Chinese splits, and significantly surpasses GPT Image 1 on Chinese text rendering. The 6B variant still exceeds many larger baselines such as OmniGen2 and FLUX.1[Dev].

Model	EN	ZH
Seedream 3.0	0.896	0.878
GPT Image 1	0.956	0.619
Qwen-Image	0.943	0.946
Z-Image	0.935	0.936
Ovis-Image	0.922	0.964
PPCL-OPPO-10B	0.871	0.885
Amber-Image-10B	0.911	0.915
Amber-Image-6B	0.870	0.876

CVTG-2K — Complex visual text generation. Amber-Image-10B achieves the highest CLIPScore among all compared models, indicating strong semantic alignment. Word accuracy remains stable across increasing region counts.

Model	NED	CLIPScore	2 regions	3 regions	4 regions	5 regions	Average
GPT Image 1	0.9478	0.7982	0.8779	0.8659	0.8731	0.8218	0.8569
Qwen-Image	0.9116	0.8017	0.8370	0.8364	0.8313	0.8158	0.8288
Z-Image	0.9367	0.7969	0.9006	0.8722	0.8652	0.8512	0.8671
Ovis-Image	0.9695	0.8368	0.9248	0.9239	0.9180	0.9166	0.9200
LongCat-Image	0.9361	0.7859	0.9129	0.8737	0.8557	0.8310	0.8658
Amber-Image-10B	0.8938	0.8116	0.8791	0.8339	0.7959	0.6952	0.8011
Amber-Image-6B	0.8523	0.8047	0.8669	0.7994	0.7200	0.6428	0.7573

📜 Citation

If you find our work useful in your research, please consider citing:

@article{hellogroup2026amberimage,
  title={Amber-Image: Efficient Compression of Large-Scale Diffusion Transformers},
  author={Computational Intelligence Dept, HelloGroup Inc.},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for hellogroup-opensource/AMBER-IMAGE

Amber-Image: Efficient Compression of Large-Scale Diffusion Transformers

Paper • 2602.17047 • Published 12 days ago