Abstract
Claw-Eval addresses limitations in agent benchmarks by providing comprehensive evaluation across multiple modalities with trajectory-aware grading and safety assessments.
Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.
Community
Claw-Eval
the trajectory-aware evaluation in claw-eval is exactly the move we needed to separate genuine capability from lucky outputs. my question: how sensitive are Pass@k and Pass^k to the logging cadence or to desynchronization among traces, audit logs, and environment snapshots? the arxivlens breakdown helped me parse the method details, especially how they use three evidence channels (https://arxivlens.com/PaperView/Details/claw-eval-toward-trustworthy-evaluation-of-autonomous-agents-5634-89016710). an ablation varying the frequency of snapshots or removing one channel could reveal which piece actually drives the robustness gains.
Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents
Claw-Eval is an end-to-end evaluation suite for autonomous agents that addresses a critical blind spot in current benchmarks: evaluating only final outputs misses dangerous intermediate behaviors. The suite covers 300 tasks across 9 categories and introduces trajectory-aware grading over 2,159 rubric items. By collecting three independent evidence channels and running Pass@k scoring across 3 trials, Claw-Eval provides trustworthy measurements of Completion, Safety, and Robustness. Testing 14 frontier models reveals that trajectory-opaque evaluation misses 44% of safety violations and 13% of robustness failures.
Key Idea
Claw-Eval collects three independent evidence channels for every agent execution: execution traces (step-by-step actions), audit logs (system-level records), and environment snapshots (state of the world at checkpoints). By cross-referencing all three channels, the grader catches problems that any single channel would miss -- such as an agent that produces a correct final output but takes dangerous intermediate steps.
Method / Approach
The key methodological insight is the difference between trajectory-opaque and trajectory-aware evaluation. Trajectory-opaque evaluation only examines the final output, while trajectory-aware evaluation inspects the entire sequence of agent actions. The paper demonstrates that opaque evaluation systematically underestimates risk: it misses 44% of safety violations (e.g., agents accessing unauthorized resources before producing correct results) and 13% of robustness failures (e.g., agents that succeed through brittle retry loops).
Results
Claw-Eval scores agents on three dimensions -- Completion (did it finish the task?), Safety (did it avoid harmful behaviors?), and Robustness (does it succeed consistently?) -- using Pass@k across 3 independent trials per task to account for variance. Across 14 evaluated frontier models, no model achieves strong scores on all three dimensions simultaneously, revealing fundamental tradeoffs in current agent architectures. The 2,159 rubric items ensure fine-grained coverage across the 300 tasks and 9 categories.
Get this paper in your agent:
hf papers read 2604.06132 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper


