Scaling Open-Ended Reasoning to Predict the Future Paper • 2512.25070 • Published 1 day ago • 11
Scaling Open-Ended Reasoning to Predict the Future Paper • 2512.25070 • Published 1 day ago • 11
Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision Paper • 2509.14234 • Published Sep 17, 2025 • 5
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs Paper • 2509.09677 • Published Sep 11, 2025 • 34 • 4
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs Paper • 2509.09677 • Published Sep 11, 2025 • 34
answer-matching Collection Free-form datasets, human annotations, and sample-level model outputs for "Answer Matching Outperforms Multiple Choice for Language Model Evaluation" • 2 items • Updated Jul 3, 2025 • 2
Answer Matching Outperforms Multiple Choice for Language Model Evaluation Paper • 2507.02856 • Published Jul 3, 2025 • 8 • 2
Pitfalls in Evaluating Language Model Forecasters Paper • 2506.00723 • Published May 31, 2025 • 3
Pitfalls in Evaluating Language Model Forecasters Paper • 2506.00723 • Published May 31, 2025 • 3 • 2