How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings Paper • 2604.04323 • Published 4 days ago • 29
THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models Paper • 2504.13367 • Published Apr 17, 2025 • 26