Youtu
Collection
3 items
β’
Updated
β’
2
π License β’ π» Code β’ π Technical Report β’ π Benchmarks
Youtu-LLM is a new, small, yet powerful LLM, contains only 1.96B parameters, supports 128k long context, and has native agentic talents. On general evaluations, Youtu-LLM significantly outperforms SOTA LLMs of similar size in terms of Commonsense, STEM, Coding and Long Context capabilities; in agent-related testing, Youtu-LLM surpasses larger-sized leaders and is truly capable of completing multiple end2end agent tasks.
Youtu-LLM has the following features:
| Type | Benchmark (Metric) | # Shots | Qwen3-1.7B-Base | SmoLM3-3B-Base | Gemma3-4B-Base | Qwen3-4B-Base | Llama3.1-8B | Youtu-LLM-2B-Base |
|---|---|---|---|---|---|---|---|---|
| Commonsense | MMLU-Pro (EM) | 5 | 34.9% | 35.3% | 29.4% | 46.1% | 36.2% | 48.4% |
| MLQA-Zh (EM) | 3 | 38.1% | 38.0% | 40.3% | 47.2% | 43.0% | 43.5% | |
| MMLU-ProX-Zh (EM) | 5 | 32.5% | 26.7% | 24.2% | 45.2% | 25.4% | 40.7% | |
| STEM | GSM8K (EM) | 8 | 68.2% | 67.3% | 38.5% | 80.8% | 47.8% | 77.6% |
| MGSM-Zh (EM) | 8 | 57.1% | 40.7% | 33.0% | 69.7% | 35.9% | 68.9% | |
| MATH (EM) | 4 | 28.1% | 40.8% | 24.4% | 44.8% | 21.5% | 44.4% | |
| BBH (EM) | 3 | 53.0% | 59.8% | 51.6% | 70.8% | 62.9% | 59.8% | |
| GPQA-MC (Acc. Norm) | 5 | 30.4% | 26.6% | 28.6% | 37.8% | 30.1% | 33.3% | |
| HLE-MC (Acc. Norm) | 3 | 10.7% | 3.1% | 8.0% | 15.0% | 11.5% | 17.4% | |
| Coding | MBPP (Pass@1) | 3 | 55.6% | 51.0% | 45.8% | 67.5% | 49.4% | 66.6% |
| MBPP+ (Pass@1) | 3 | 71.0% | 66.1% | 61.9% | 80.8% | 62.7% | 81.8% | |
| HumanEval (Pass@1) | 0 | 49.9% | 34.8% | 36.6% | 57.6% | 36.0% | 64.6% | |
| HumanEval+ (Pass@1) | 0 | 41.3% | 28.1% | 28.1% | 49.9% | 28.1% | 57.3% | |
| LiveCodeBench v6 (Pass@1) | 3 | 5.1% | 2.9% | 2.9% | 6.9% | 3.4% | 9.7% | |
| CRUXEval (Pass@1) | 1 | 40.6% | 42.1% | 39.7% | 54.8% | 42.3% | 55.9% | |
| RepoBench (EM) | 3 | 21.0% | 21.8% | 23.0% | 25.3% | 25.2% | 22.7% | |
| Long Context | LongBench v2 (Acc.) | 3 | 28.0% | 28.8% | 26.6% | 25.8% | 27.8% | 27.2% |
| NIAH (Acc.) | / | 79.8% | 75.0% | 99.5% | 83.0% | 99.8% | 98.8% |
We takes APTBench for evaluating the agentic capabilities of base model.
| Category | Qwen3-1.7B-Base | SmoLM3-3B-Base | Gemma3-4B-Base | Qwen3-4B-Base | Llama3.1-8B | Youtu-LLM-2B-Base |
|---|---|---|---|---|---|---|
| Code | 25.1% | 24.3% | 32.8% | 41.9% | 23.6% | 37.9% |
| Deep Research | 28.5% | 27.2% | 36.4% | 40.5% | 30.0% | 38.6% |
| Math | 59.9% | 60.7% | 59.8% | 70.5% | 60.1% | 68.0% |
| Tool | 56.7% | 59.1% | 61.7% | 65.8% | 64.1% | 64.2% |