arxiv:2604.06132

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Published on Apr 7

· Submitted by

Lei Li on Apr 8

#2 Paper of the day

Claw-Eval

Upvote

Authors:

Rang Li ,

Yuanxin Liu ,

Lei Li ,

Abstract

Claw-Eval addresses limitations in agent benchmarks by providing comprehensive evaluation across multiple modalities with trajectory-aware grading and safety assessments.

AI-generated summary

Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.

View arXiv page View PDF Project page GitHub 336 Add to collection

Community

tobiaslee

Paper author Paper submitter about 16 hours ago

Claw-Eval

avahal

about 7 hours ago

the trajectory-aware evaluation in claw-eval is exactly the move we needed to separate genuine capability from lucky outputs. my question: how sensitive are Pass@k and Pass^k to the logging cadence or to desynchronization among traces, audit logs, and environment snapshots? the arxivlens breakdown helped me parse the method details, especially how they use three evidence channels (https://arxivlens.com/PaperView/Details/claw-eval-toward-trustworthy-evaluation-of-autonomous-agents-5634-89016710). an ablation varying the frequency of snapshots or removing one channel could reveal which piece actually drives the robustness gains.

mishig

about 4 hours ago

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Claw-Eval is an end-to-end evaluation suite for autonomous agents that addresses a critical blind spot in current benchmarks: evaluating only final outputs misses dangerous intermediate behaviors. The suite covers 300 tasks across 9 categories and introduces trajectory-aware grading over 2,159 rubric items. By collecting three independent evidence channels and running Pass@k scoring across 3 trials, Claw-Eval provides trustworthy measurements of Completion, Safety, and Robustness. Testing 14 frontier models reveals that trajectory-opaque evaluation misses 44% of safety violations and 13% of robustness failures.

Key Idea

Claw-Eval collects three independent evidence channels for every agent execution: execution traces (step-by-step actions), audit logs (system-level records), and environment snapshots (state of the world at checkpoints). By cross-referencing all three channels, the grader catches problems that any single channel would miss -- such as an agent that produces a correct final output but takes dangerous intermediate steps.

Method / Approach

The key methodological insight is the difference between trajectory-opaque and trajectory-aware evaluation. Trajectory-opaque evaluation only examines the final output, while trajectory-aware evaluation inspects the entire sequence of agent actions. The paper demonstrates that opaque evaluation systematically underestimates risk: it misses 44% of safety violations (e.g., agents accessing unauthorized resources before producing correct results) and 13% of robustness failures (e.g., agents that succeed through brittle retry loops).

Results

Claw-Eval scores agents on three dimensions -- Completion (did it finish the task?), Safety (did it avoid harmful behaviors?), and Robustness (does it succeed consistently?) -- using Pass@k across 3 independent trials per task to account for variance. Across 14 evaluated frontier models, no model achieves strong scores on all three dimensions simultaneously, revealing fundamental tradeoffs in current agent architectures. The 2,159 rubric items ensure fine-grained coverage across the 300 tasks and 9 categories.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.06132

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.06132 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.06132 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.06132 in a Space README.md to link it from this page.

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Abstract

Community

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Key Idea

Method / Approach

Results

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 2