Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents Paper • 2604.06132 • Published 2 days ago • 92
BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation Paper • 2603.25732 • Published 13 days ago • 11
FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization Paper • 2603.19835 • Published 19 days ago • 325
SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models Paper • 2603.16859 • Published 22 days ago • 248