GATE OpenING: A Comprehensive Benchmark for Judging Open-ended
Interleaved Image-Text Generation
Paper
• 2411.18499
• Published • 18
VLSBench: Unveiling Visual Leakage in Multimodal Safety
Paper
• 2411.19939
• Published • 10
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand
Audio-Visual Information?
Paper
• 2412.02611
• Published • 25
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills
in LLMs
Paper
• 2412.03205
• Published • 19
ProcessBench: Identifying Process Errors in Mathematical Reasoning
Paper
• 2412.06559
• Published • 86
OmniDocBench: Benchmarking Diverse PDF Document Parsing with
Comprehensive Annotations
Paper
• 2412.07626
• Published • 29
VisionArena: 230K Real World User-VLM Conversations with Preference
Labels
Paper
• 2412.08687
• Published • 13
SCBench: A KV Cache-Centric Analysis of Long-Context Methods
Paper
• 2412.10319
• Published • 11
Are Your LLMs Capable of Stable Reasoning?
Paper
• 2412.13147
• Published • 93
Multi-Dimensional Insights: Benchmarking Real-World Personalization in
Large Multimodal Models
Paper
• 2412.12606
• Published • 41
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World
Tasks
Paper
• 2412.14161
• Published • 51
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented
Generation for Preference Alignment
Paper
• 2412.13746
• Published • 9
CodeElo: Benchmarking Competition-level Code Generation of LLMs with
Human-comparable Elo Ratings
Paper
• 2501.01257
• Published • 51
MotionBench: Benchmarking and Improving Fine-grained Video Motion
Understanding for Vision Language Models
Paper
• 2501.02955
• Published • 44
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video
Understanding?
Paper
• 2501.05510
• Published • 44
WebWalker: Benchmarking LLMs in Web Traversal
Paper
• 2501.07572
• Published • 23
HALoGEN: Fantastic LLM Hallucinations and Where to Find Them
Paper
• 2501.08292
• Published • 17
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
Paper
• 2501.08828
• Published • 30
ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling
under Long-Context Scenario
Paper
• 2501.10132
• Published • 22
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Paper
• 2501.12380
• Published • 84
EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents
Paper
• 2501.11858
• Published • 7
Paper
• 2501.14249
• Published • 77
Redundancy Principles for MLLMs Benchmarks
Paper
• 2501.13953
• Published • 29
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal
Models
Paper
• 2502.00698
• Published • 24
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large
Language Models
Paper
• 2502.07346
• Published • 53
SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image
Interpretation
Paper
• 2502.08168
• Published • 12
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for
Reasoning Quality, Robustness, and Efficiency
Paper
• 2502.09621
• Published • 28
ZeroBench: An Impossible Visual Benchmark for Contemporary Large
Multimodal Models
Paper
• 2502.09696
• Published • 43
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance
Software Engineering?
Paper
• 2502.12115
• Published • 46
The Mirage of Model Editing: Revisiting Evaluation in the Wild
Paper
• 2502.11177
• Published • 10
MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Paper
• 2502.14499
• Published • 195
StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction
Following
Paper
• 2502.14494
• Published • 15
Q-Eval-100K: Evaluating Visual Quality and Alignment Level for
Text-to-Vision Content
Paper
• 2503.02357
• Published • 7
Benchmarking Large Language Models for Multi-Language Software
Vulnerability Detection
Paper
• 2503.01449
• Published • 4
Benchmarking AI Models in Software Engineering: A Review, Search Tool,
and Enhancement Protocol
Paper
• 2503.05860
• Published • 11
Paper
• 2503.14378
• Published • 61
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
Paper
• 2503.14478
• Published • 48
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs
for Knowledge-Intensive Visual Grounding
Paper
• 2503.12797
• Published • 32
MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process
Errors Identification
Paper
• 2503.12505
• Published • 11
PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for
Multimodal Large Language Models
Paper
• 2503.12545
• Published • 7
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
Paper
• 2503.19990
• Published • 35
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic
Faithfulness
Paper
• 2503.21755
• Published • 33
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual
Editing
Paper
• 2504.02826
• Published • 68
VCR-Bench: A Comprehensive Evaluation Framework for Video
Chain-of-Thought Reasoning
Paper
• 2504.07956
• Published • 46
S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability
of Large Reasoning Models
Paper
• 2504.10368
• Published • 22
LLM-SRBench: A New Benchmark for Scientific Equation Discovery with
Large Language Models
Paper
• 2504.10415
• Published • 9
ColorBench: Can VLMs See and Understand the Colorful World? A
Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
Paper
• 2504.10514
• Published • 48
THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating
Overthinking in Reasoning Models
Paper
• 2504.13367
• Published • 26
Towards Understanding Camera Motions in Any Video
Paper
• 2504.15376
• Published • 157
VideoVista-CulturalLingo: 360^circ Horizons-Bridging Cultures,
Languages, and Domains in Video Comprehension
Paper
• 2504.17821
• Published • 24
WorldGenBench: A World-Knowledge-Integrated Benchmark for
Reasoning-Driven Text-to-Image Generation
Paper
• 2505.01490
• Published • 5
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language
Models
Paper
• 2505.02735
• Published • 33
On Path to Multimodal Generalist: General-Level and General-Bench
Paper
• 2505.04620
• Published • 83
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and
Methodology
Paper
• 2507.07999
• Published • 51
REST: Stress Testing Large Reasoning Models by Asking Multiple Problems
at Once
Paper
• 2507.10541
• Published • 30
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video
Reasoning and Understanding
Paper
• 2507.15028
• Published • 21
Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and
Regional Languages Around the Globe
Paper
• 2508.01691
• Published • 10
FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction
Paper
• 2508.11987
• Published • 72
Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage,
but Not Direct the Play?
Paper
• 2509.03516
• Published • 12
GenExam: A Multidisciplinary Text-to-Image Exam
Paper
• 2509.14232
• Published • 21
AuditoryBench++: Can Language Models Understand Auditory Knowledge
without Hearing?
Paper
• 2509.17641
• Published • 4
UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image
Generation
Paper
• 2510.18701
• Published • 68
IF-VidCap: Can Video Caption Models Follow Instructions?
Paper
• 2510.18726
• Published • 26
PICABench: How Far Are We from Physically Realistic Image Editing?
Paper
• 2510.17681
• Published • 65
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for
Visual Chain-of-Thought
Paper
• 2511.02779
• Published • 60
UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture
Paper
• 2512.21675
• Published • 25