🎯 Brief Introduction

Youtu-LLM is a new, small, yet powerful LLM, contains only 1.96B parameters, supports 128k long context, and has native agentic talents. On general evaluations, Youtu-LLM significantly outperforms SOTA LLMs of similar size in terms of Commonsense, STEM, Coding and Long Context capabilities; in agent-related testing, Youtu-LLM surpasses larger-sized leaders and is truly capable of completing multiple end2end agent tasks.

Youtu-LLM has the following features:

  • Type: Autoregressive Causal Language Models with Dense MLA
  • Release versions: Base and Instruct
  • Number of Parameters: 1.96B
  • Number of Layers: 32
  • Number of Attention Heads (MLA): 16 for Q/K/V
  • MLA Rank: 1,536 for Q, 512 for K/V
  • MLA Dim: 128 for QK Nope, 64 for QK Rope, and 128 for V
  • Context Length: 131,072
  • Vocabulary Size: 128,256

πŸ“Š Performance Comparisons

Base Model

Comparison between Youtu-LLM-2B-Base and baselines

General Benchmarks

Type Benchmark (Metric) # Shots Qwen3-1.7B-Base SmoLM3-3B-Base Gemma3-4B-Base Qwen3-4B-Base Llama3.1-8B Youtu-LLM-2B-Base
Commonsense MMLU-Pro (EM) 5 34.9% 35.3% 29.4% 46.1% 36.2% 48.4%
MLQA-Zh (EM) 3 38.1% 38.0% 40.3% 47.2% 43.0% 43.5%
MMLU-ProX-Zh (EM) 5 32.5% 26.7% 24.2% 45.2% 25.4% 40.7%
STEM GSM8K (EM) 8 68.2% 67.3% 38.5% 80.8% 47.8% 77.6%
MGSM-Zh (EM) 8 57.1% 40.7% 33.0% 69.7% 35.9% 68.9%
MATH (EM) 4 28.1% 40.8% 24.4% 44.8% 21.5% 44.4%
BBH (EM) 3 53.0% 59.8% 51.6% 70.8% 62.9% 59.8%
GPQA-MC (Acc. Norm) 5 30.4% 26.6% 28.6% 37.8% 30.1% 33.3%
HLE-MC (Acc. Norm) 3 10.7% 3.1% 8.0% 15.0% 11.5% 17.4%
Coding MBPP (Pass@1) 3 55.6% 51.0% 45.8% 67.5% 49.4% 66.6%
MBPP+ (Pass@1) 3 71.0% 66.1% 61.9% 80.8% 62.7% 81.8%
HumanEval (Pass@1) 0 49.9% 34.8% 36.6% 57.6% 36.0% 64.6%
HumanEval+ (Pass@1) 0 41.3% 28.1% 28.1% 49.9% 28.1% 57.3%
LiveCodeBench v6 (Pass@1) 3 5.1% 2.9% 2.9% 6.9% 3.4% 9.7%
CRUXEval (Pass@1) 1 40.6% 42.1% 39.7% 54.8% 42.3% 55.9%
RepoBench (EM) 3 21.0% 21.8% 23.0% 25.3% 25.2% 22.7%
Long Context LongBench v2 (Acc.) 3 28.0% 28.8% 26.6% 25.8% 27.8% 27.2%
NIAH (Acc.) / 79.8% 75.0% 99.5% 83.0% 99.8% 98.8%

Agentic Benchmarks

We takes APTBench for evaluating the agentic capabilities of base model.

Category Qwen3-1.7B-Base SmoLM3-3B-Base Gemma3-4B-Base Qwen3-4B-Base Llama3.1-8B Youtu-LLM-2B-Base
Code 25.1% 24.3% 32.8% 41.9% 23.6% 37.9%
Deep Research 28.5% 27.2% 36.4% 40.5% 30.0% 38.6%
Math 59.9% 60.7% 59.8% 70.5% 60.1% 68.0%
Tool 56.7% 59.1% 61.7% 65.8% 64.1% 64.2%
Downloads last month
26
Safetensors
Model size
2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for tencent/Youtu-LLM-2B-Base

Finetunes
1 model
Quantizations
1 model

Collection including tencent/Youtu-LLM-2B-Base