Abstract
A synthetic reasoning dataset called CHIMERA is introduced to overcome data-centric challenges in training large language models for cross-domain reasoning, achieving performance comparable to much larger models.
Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable settings is hindered by three fundamental data-centric challenges: (1) the cold-start problem, arising from the lack of seed datasets with detailed, long Chain-of-Thought (CoT) trajectories needed to initialize reasoning policies; (2) limited domain coverage, as most existing open-source reasoning datasets are concentrated in mathematics, with limited coverage of broader scientific disciplines; and (3) the annotation bottleneck, where the difficulty of frontier-level reasoning tasks makes reliable human annotation prohibitively expensive or infeasible. To address these challenges, we introduce CHIMERA, a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning. CHIMERA is constructed with three key properties: (1) it provides rich, long CoT reasoning trajectories synthesized by state-of-the-art reasoning models; (2) it has broad and structured coverage, spanning 8 major scientific disciplines and over 1K fine-grained topics organized via a model-generated hierarchical taxonomy; and (3) it employs a fully automated, scalable evaluation pipeline that uses strong reasoning models to cross-validate both problem validity and answer correctness. We use CHIMERA to post-train a 4B Qwen3 model. Despite the dataset's modest size, the resulting model achieves strong performance on a suite of challenging reasoning benchmarks, including GPQA-Diamond, AIME 24/25/26, HMMT 25, and Humanity's Last Exam, approaching or matching the reasoning performance of substantially larger models such as DeepSeek-R1 and Qwen3-235B.
Community
We introduce CHIMERA, a compact but high-difficulty synthetic reasoning dataset with long Chain-of-Thought trajectories and broad multi-disciplinary coverage, designed for reasoning post-training of large language models.
Dataset
CHIMERA contains 9,225 expert-level problems spanning 8 subjects (Mathematics, Computer Science, Chemistry, Physics, Literature, History, Biology, Linguistics) and 1,179 fine-grained topics, all synthesized by GPT-5. Each problem comes with:
- A concise ground-truth answer and an authoritative reference solution (both GPT-5-generated),
- A long-form model solution with thinking traces from Qwen3-235B-A22B-Thinking-2507 or Qwen3.5-397B-A17B,
- Automated correctness labels from a GPT-5 + o4-mini verification panel.
Unlike existing reasoning datasets that are heavily math-focused or limited in solution length, CHIMERA provides structured domain diversity and long-horizon reasoning traces without any human annotation. Try our dataset here TianHongZXY/CHIMERA 🤗.
Models
We train Qwen3-4B-Thinking-2507 on CHIMERA through SFT followed by RL, yielding consistent gains across all major reasoning benchmarks:
| Benchmark | Qwen3-4B-Thinking-2507 | CHIMERA-4B-SFT | CHIMERA-4B-RL |
|---|---|---|---|
| GPQA-Diamond | 65.8 | 68.8 | 70.1 |
| AIME 2024 | 81.6 | 86.5 | 86.9 |
| AIME 2025 | 81.0 | 79.8 | 80.7 |
| AIME 2026 | 80.8 | 80.3 | 82.7 |
| HMMT Feb 2025 | 59.2 | 63.1 | 65.7 |
| HMMT Nov 2025 | 57.3 | 66.3 | 67.0 |
| HLE | 7.3 | 9.0 | 9.0 |
Models: CHIMERA-4B-SFT | CHIMERA-4B-RL
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Improving Data and Reward Design for Scientific Reasoning in Large Language Models (2026)
- MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods (2026)
- Agentic Proposing: Enhancing Large Language Model Reasoning via Compositional Skill Synthesis (2026)
- Dual-Phase LLM Reasoning: Self-Evolved Mathematical Frameworks (2026)
- What Makes Low-Bit Quantization-Aware Training Work for Reasoning LLMs? A Systematic Study (2026)
- Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability (2026)
- NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/chimera-compact-synthetic-data-for-generalizable-llm-reasoning-2385-d077d678
- Executive Summary
- Detailed Breakdown
- Practical Applications
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper