Can AI Agents Answer Your Data Questions? A Benchmark for Data Agents Paper • 2603.20576 • Published 6 days ago • 1
view article Article Introducing HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Detecting Hallucinations in Real-World Scenarios May 2, 2025 • 19
DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing Paper • 2410.12189 • Published Oct 16, 2024 • 1
Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences Paper • 2404.12272 • Published Apr 18, 2024 • 1