DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs Paper • 2601.03559 • Published 2 days ago • 2
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use Paper • 2504.07981 • Published Apr 4, 2025 • 2
AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness Paper • 2507.01702 • Published Jul 2, 2025 • 3
FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models Paper • 2502.17924 • Published Feb 25, 2025
AmbiGraph-Eval: Can LLMs Effectively Handle Ambiguous Graph Queries? Paper • 2508.09631 • Published Aug 13, 2025
EvolProver: Advancing Automated Theorem Proving by Evolving Formalized Problems via Symmetry and Difficulty Paper • 2510.00732 • Published Oct 1, 2025 • 5
EvolProver: Advancing Automated Theorem Proving by Evolving Formalized Problems via Symmetry and Difficulty Paper • 2510.00732 • Published Oct 1, 2025 • 5
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers Paper • 2508.14704 • Published Aug 20, 2025 • 43
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use Paper • 2504.07981 • Published Apr 4, 2025 • 2
Mercury: Ultra-Fast Language Models Based on Diffusion Paper • 2506.17298 • Published Jun 17, 2025 • 7
ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges Paper • 2411.18932 • Published Nov 28, 2024 • 1
GOAT-Bench: Safety Insights to Large Multimodal Models through Meme-Based Social Abuse Paper • 2401.01523 • Published Jan 3, 2024 • 1
Towards Explainable Harmful Meme Detection through Multimodal Debate between Large Language Models Paper • 2401.13298 • Published Jan 24, 2024
CofiPara: A Coarse-to-fine Paradigm for Multimodal Sarcasm Target Identification with Large Multimodal Models Paper • 2405.00390 • Published May 1, 2024
CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding? Paper • 2408.10718 • Published Aug 20, 2024