MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models
Abstract
MonitorBench is introduced as a comprehensive benchmark for evaluating chains of thought monitorability in large language models, revealing that monitorability decreases when structural reasoning is not required and that both open- and closed-source models exhibit reduced monitorability under stress-testing conditions.
Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the decision-critical factors driving the model's behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open-source benchmark for studying CoT monitorability remains lacking. To address this gap, we propose MonitorBench, a systematic benchmark for evaluating CoT monitorability in LLMs. MonitorBench provides: (1) a diverse set of 1,514 test instances with carefully designed decision-critical factors across 19 tasks spanning 7 categories to characterize when CoTs can be used to monitor the factors driving LLM behavior; and (2) two stress-test settings to quantify the extent to which CoT monitorability can be degraded. Extensive experiments across multiple popular LLMs with varying capabilities show that CoT monitorability is higher when producing the final target response requires structural reasoning through the decision-critical factor. Closed-source LLMs generally show lower monitorability, and there exists a negative relationship between monitorability and model capability. Moreover, both open- and closed-source LLMs can intentionally reduce monitorability under stress-tests, with monitorability dropping by up to 30% in some tasks that do not require structural reasoning over the decision-critical factors. Beyond these empirical insights, MonitorBench provides a basis for further research on evaluating future LLMs, studying advanced stress-test monitorability techniques, and developing new monitoring approaches.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering (2026)
- Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought? (2026)
- Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations (2026)
- Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing (2026)
- In-Context Environments Induce Evaluation-Awareness in Language Models (2026)
- Are Reasoning LLMs Robust to Interventions on Their Chain-of-Thought? (2026)
- Reasoning Models Struggle to Control their Chains of Thought (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.28590 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper