GitHub Repo | Technical Report | Join Us

👋 Contact us in Discord and WeChat

🏆 2026 Sparse Operator Acceleration & Race (SOAR) is Now Live!

"The MiniCPM-SALA architecture is just the beginning. Realizing its full potential requires deep system-level synergy and cross-layer compilation optimization."

In collaboration with SGLang and NVIDIA, OpenBMB invites global geeks to push the boundaries of 9B-scale, 1M-token inference on NVIDIA 6000D.

💰 Prize Pool: >$100,000 USD (🥇 Top Prize: $89,000) | 🚀 Challenge: Single & Multi-batch Optimization

👉 Click Here to Join the Race @ soar.openbmb.cn

What's New

  • [2026.02.11] MiniCPM-SALA is released! This is the first large-scale hybrid model effectively integrating sparse and linear attention for million-token context modeling. You can find technical report here.🔥🔥🔥

Highlights

MiniCPM-SALA (Sparse Attention and Linear Attention) is the first large-scale hybrid model effectively integrating sparse and linear attention for million-token context modeling

✅ Innovative Hybrid Architecture: Synergizes 25% Sparse Attention (InfLLM-v2) for high-fidelity long context modeling with 75% Linear Attention (Lightning Attention) for global efficiency.

✅ Shattering Efficiency Walls: Breaks the "Compute Wall" and the "Memory Wall," achieving 3.5× inference speed and significantly lower KV-cache overhead compared to dense baselines.

✅ Million-Token Context: Empowered by HyPE (Hybrid Positional Embedding), it scales to 1M+ tokens while maintaining strong length generalization.

✅ HALO Adaptation: Utilizes Hybrid Attention via Layer Optimization (HALO), a novel distillation recipe that effectively transfers dense attention capabilities to the hybrid architecture, avoiding the severe performance degradation typical of pure linear models.

Introduction

MiniCPM-SALA is an efficient hybrid model in which 25% of the layers adopt InfLLM-V2 and the remaining 75% utilize Lightning Attention. This architecture enables inference of one million tokens on consumer GPUs such as the NVIDIA RTX 5090.

  • SALA Hybrid Attention Mechanism

    • Integrates 25% InfLLM-V2 and 75% Lightning Attention, effectively leveraging the granular focus of sparse attention for local details and the high efficiency of linear attention for broad context.
  • Transformer-to-Hybrid Continue Training

    • Circumvents the inefficiencies of cold-start training by performing an architectural transformation on the pre-trained weights, thereby reducing the total training budget to approximately 25% relative to training a comparable model from scratch.
  • HyPE (Hybrid Positional Encoding)

    • Harmonizes the performance across both short and long contexts, which can maintain general capabilities (e.g., knowledge, mathematics, and coding) comparable to modern full-attention models like Qwen3-8B and achieve substantial advantages across multiple long-context benchmarks.
  • Efficient Inference on Long Sequences

    • Achieves up to 3.5x the inference speed of Qwen3-8B at a sequence length of 256K tokens on A6000D, supports inference at context lengths of up to 1M tokens on both NVIDIA A6000D and 5090 GPUs, whereas Qwen3-8B fails at this length due to out-of-memory (OOM) errors.

Inference

To achieve optimal performance, we recommend using Temperature=0.9.

HuggingFace

Our model is readily compatible with 🤗 Hugging Face transformers. You can perform inference with our model as follows:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "openbmb/MiniCPM-SALA"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, device_map="auto")
model.eval()

prompts = ["My name is", "The capital of China is"]
with torch.no_grad():
    inputs = tokenizer(prompts, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs)
output_texts = tokenizer.batch_decode(outputs)
print(output_texts)

SGLang

Requirements

  • CUDA 12.x or higher
  • gcc / g++ compiler
  • uv package manager (script will check)

Installation

# Clone repository
git clone -b minicpm_sala https://github.com/OpenBMB/sglang.git
cd sglang

# One-click installation (creates venv and compiles all dependencies)
bash install_minicpm_sala.sh

# Or specify PyPI mirror
bash install_minicpm_sala.sh https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

The installation script performs the following steps:

  1. Creates sglang_minicpm_sala_env virtual environment (Python 3.12)
  2. Clones dependencies to 3rdparty/ (infllmv2) and initializes submodules (sparse_kernel)
  3. Installs MiniCPM-SALA (current repo)
  4. Compiles and installs infllmv2_cuda_impl
  5. Compiles and installs sparse_kernel
  6. Installs tilelang & flash-linear-attention

Usage

# Activate environment
source sglang_minicpm_sala_env/bin/activate

# Launch Inference Server (Replace MODEL_PATH with actual path)
MODEL_PATH=/path/to/your/MiniCPM-SALA

python3 -m sglang.launch_server \
    --model ${MODEL_PATH} \
    --trust-remote-code \
    --disable-radix-cache \
    --attention-backend minicpm_flashinfer \
    --chunked-prefill-size 8192 \
    --max-running-requests 32 \
    --skip-server-warmup \
    --port 31111 \
    --dense-as-sparse
Parameter Description
--trust-remote-code Allow custom code in model
--disable-radix-cache Disable RadixAttention prefix cache
--attention-backend minicpm_flashinfer Use MiniCPM FlashInfer backend
--chunked-prefill-size 8192 Chunked prefill size
--max-running-requests 32 Max concurrent requests
--skip-server-warmup Skip server warmup
--port 31111 Server port
--dense-as-sparse Use dense-as-sparse mode

Manual Installation

If the script doesn't work for you, follow these steps:

# 0. Ensure uv is installed
pip install uv

# 1. Create venv
uv venv --python 3.12 sglang_minicpm_sala_env
source sglang_minicpm_sala_env/bin/activate

# 2. Install SGLang
uv pip install --upgrade pip setuptools wheel
uv pip install -e ./python[all]

# 3. Compile CUDA Extensions
# (Ensure dependencies are cloned to 3rdparty/)
cd 3rdparty/infllmv2_cuda_impl && python setup.py install && cd ../..
cd 3rdparty/sparse_kernel && python setup.py install && cd ../..

# 4. Install extra deps
uv pip install tilelang flash-linear-attention

Q&A

Q: CUDA extension compilation failed?

  • Ensure CUDA 12+ is installed (nvcc --version).
  • Ensure gcc / g++ are available.
  • If CXX is set to clang++ -pthread, manually export CXX=g++.

Evaluation Results

Efficiency Evaluation

inference_speed_a6000d

inference_speed_5090

Long-Context Evaluation

long_text_evaluation

Ultra-long Context Evaluation

ultra_long_text_evaluation

Standard Evaluation

benchmark

Statement

  • As a language model, MiniCPM-SALA generates content by learning from a vast amount of text.
  • However, it does not possess the ability to comprehend or express personal opinions or value judgments.
  • Any content generated by MiniCPM-SALA does not represent the viewpoints or positions of the model developers.
  • Therefore, when using content generated by MiniCPM-SALA, users should take full responsibility for evaluating and verifying it on their own.

LICENSE

  • This repository and MiniCPM models are released under the Apache-2.0 License.

Citation

  • Please cite our paper if you find our work valuable.
@article{minicpm4,
  title={{MiniCPM-SALA}: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling},
  author={MiniCPM Team},
  year={2026}
}
Downloads last month
-
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including openbmb/MiniCPM-SALA

Papers for openbmb/MiniCPM-SALA