Amber-Image

Efficient Compression of Large-Scale Diffusion Transformers

  GitHub Hugging Face

Representative samples generated by Amber-Image.

🎨 Amber-Image

Amber-Image is a family of efficient text-to-image (T2I) generation models built through a dedicated compression pipeline that integrates structured pruning, architectural evolution, and knowledge distillation. Rather than training from scratch, Amber-Image transforms the 60-layer, 20B-parameter dual-stream MMDiT backbone of Qwen-Image into lightweight variants β€” Amber-Image-10B and Amber-Image-6B β€” reducing parameters by up to 70% while maintaining competitive generation quality.

The compression pipeline operates in two stages:

  1. Amber-Image-10B: Derived via timestep-sensitive depth pruning, removing 30 of the 60 MMDiT layers identified as least critical. Retained layers are reinitialized through local weight averaging and recovered via layer-wise distillation from the original Qwen-Image, followed by full-parameter fine-tuning.

  2. Amber-Image-6B: Introduces a hybrid-stream architecture where the first 10 layers retain dual-stream processing for modality-specific feature extraction, while the deeper 20 layers are converted to a single stream initialized from the image branch. Knowledge is transferred from Amber-Image-10B via progressive distillation and lightweight fine-tuning.

Overview of the Amber-Image compression pipeline.

🌟 Key Features

  • No Training from Scratch: Operates entirely through strategic compression and refinement of an existing foundation model, dramatically reducing both computational budget and data requirements.
  • Structured Depth Pruning with Fidelity-Aware Initialization: A layer importance estimation method accounts for global fidelity impact and timestep sensitivity, enabling safe removal of half the layers. Retained layers are initialized via arithmetic averaging of pruned neighboring blocks for a high-quality warm start.
  • Hybrid-Stream Architecture: Early layers retain dual-stream processing for modality-specific feature extraction, while deeper layers are converted to a single stream β€” further reducing parameters by 40% with minimal quality loss.
  • Two-Stage Knowledge Transfer: Layer-wise distillation from the full model recovers pruning-induced degradation, followed by distillation from the intermediate pruned model to align the single-stream layers. Both stages require only limited fine-tuning on a small, high-quality dataset.
  • Competitive Benchmarks: Amber-Image achieves state-of-the-art results on DPG-Bench and GenEval, surpassing all compared models including closed-source systems and the 20B teacher. On text rendering benchmarks (LongText-Bench, CVTG-2K), Amber-Image-10B outperforms several closed-source baselines while maintaining competitive fidelity.

πŸ†š Amber-Image-10B vs Amber-Image-6B

Aspect Amber-Image-10B Amber-Image-6B
Parameters ~10B ~6B
Backbone Layers 30 30 (10 dual-stream + 20 single-stream)
Architecture Dual-Stream MMDiT Hybrid-Stream (Dual + Single)
Compression Ratio 50% depth reduction 70% parameter reduction
Base Model Qwen-Image (20B) Amber-Image-10B
Text Encoder Qwen2.5-VL-7B Qwen2.5-VL-7B
VAE Qwen-Image VAE Qwen-Image VAE

πŸ“Š Benchmark Results

General Text-to-Image Generation

DPG-Bench β€” Dense prompt following with 1,065 semantically rich prompts. Both Amber-Image variants achieve the highest overall scores among all compared models, surpassing closed-source Seedream 3.0 and GPT Image 1, the 20B teacher Qwen-Image, and all 7B-class open-source competitors.

Model Global Entity Attribute Relation Other Overall
Seedream 3.0 94.31 92.65 91.36 92.78 88.24 88.27
GPT Image 1 88.89 88.94 89.84 92.63 90.96 85.15
Qwen-Image 91.32 91.56 92.02 94.31 92.73 88.32
Z-Image 93.39 91.22 93.16 92.22 91.52 88.14
LongCat-Image 89.10 92.54 92.00 93.28 87.50 86.80
Ovis-Image 82.37 92.38 90.42 93.98 91.20 86.59
PPCL-OPPO-10B 85.0 86.8 85.6 90.5 87.3 81.7
Amber-Image-10B 83.28 92.54 90.16 94.47 87.60 89.61
Amber-Image-6B 79.73 90.45 91.64 93.87 89.11 88.96

GenEval β€” Semantic reasoning and object-centric grounding. Both Amber-Image variants achieve the best overall scores, outperforming the teacher Qwen-Image, closed-source systems, and all 7B-class open-source competitors. Notably strong in "Position" and "Attribute" dimensions.

Model Single Two Counting Colors Position Attribute Overall
Seedream 3.0 0.990 0.960 0.910 0.930 0.470 0.800 0.840
GPT Image 1 0.990 0.920 0.850 0.920 0.750 0.610 0.840
Qwen-Image 0.990 0.920 0.890 0.880 0.760 0.770 0.870
Z-Image 1.000 0.940 0.780 0.930 0.620 0.770 0.840
LongCat-Image 0.990 0.980 0.860 0.860 0.750 0.730 0.870
Ovis-Image 1.000 0.970 0.760 0.860 0.670 0.800 0.840
PPCL-OPPO-10B 0.968 0.885 0.822 0.840 0.521 0.670 0.784
Amber-Image-10B 0.963 0.849 0.900 0.862 0.850 0.860 0.881
Amber-Image-6B 0.963 0.879 0.875 0.894 0.880 0.810 0.883

OneIG-Bench β€” Multi-faceted instruction following (English / Chinese). Amber-Image maintains competitive "Text" rendering scores approaching the teacher Qwen-Image, while a gap remains in "Style" and "Diversity" dimensions β€” attributed to the limited diversity of fine-tuning data and aesthetic priors lost during compression.

Model EN Overall ZH Overall
Seedream 3.0 0.530 0.528
GPT Image 1 0.533 0.474
Qwen-Image 0.539 0.548
Z-Image 0.546 0.535
Ovis-Image 0.530 0.521
PPCL-OPPO-10B 0.4856 0.501
Amber-Image-10B 0.504 0.502
Amber-Image-6B 0.491 0.486

Text Rendering

LongText-Bench β€” Extended bilingual text rendering. Amber-Image-10B outperforms the closed-source Seedream 3.0 on both English and Chinese splits, and significantly surpasses GPT Image 1 on Chinese text rendering. The 6B variant still exceeds many larger baselines such as OmniGen2 and FLUX.1[Dev].

Model EN ZH
Seedream 3.0 0.896 0.878
GPT Image 1 0.956 0.619
Qwen-Image 0.943 0.946
Z-Image 0.935 0.936
Ovis-Image 0.922 0.964
PPCL-OPPO-10B 0.871 0.885
Amber-Image-10B 0.911 0.915
Amber-Image-6B 0.870 0.876

CVTG-2K β€” Complex visual text generation. Amber-Image-10B achieves the highest CLIPScore among all compared models, indicating strong semantic alignment. Word accuracy remains stable across increasing region counts.

Model NED CLIPScore 2 regions 3 regions 4 regions 5 regions Average
GPT Image 1 0.9478 0.7982 0.8779 0.8659 0.8731 0.8218 0.8569
Qwen-Image 0.9116 0.8017 0.8370 0.8364 0.8313 0.8158 0.8288
Z-Image 0.9367 0.7969 0.9006 0.8722 0.8652 0.8512 0.8671
Ovis-Image 0.9695 0.8368 0.9248 0.9239 0.9180 0.9166 0.9200
LongCat-Image 0.9361 0.7859 0.9129 0.8737 0.8557 0.8310 0.8658
Amber-Image-10B 0.8938 0.8116 0.8791 0.8339 0.7959 0.6952 0.8011
Amber-Image-6B 0.8523 0.8047 0.8669 0.7994 0.7200 0.6428 0.7573

πŸ“œ Citation

If you find our work useful in your research, please consider citing:

@article{hellogroup2026amberimage,
  title={Amber-Image: Efficient Compression of Large-Scale Diffusion Transformers},
  author={Computational Intelligence Dept, HelloGroup Inc.},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for hellogroup-opensource/AMBER-IMAGE