Beta version: *Information might not be fully accurate. Please report any discrepancies.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Explore 206 benchmarks across 8 capability domains.
Reasoning, scientific understanding, and complex problem-solving abilities
Contamination-free, continuously updated reasoning benchmark.
A more robust and harder version of MMLU, focusing on complex reasoning and STEM subjects.
Humanity's Last Exam - Hard reasoning benchmark without tools.
Humanity's Last Exam full evaluation without tools.
Humanity's Last Exam full evaluation with tool access enabled.
Complex Research using Integrated Thinking - Physics Test. Research-level physics reasoning.
Open-domain factuality benchmark focusing on short, verifiable answers.
Medical knowledge and diagnostic reasoning evaluation.
Extremely difficult expert-level science questions.
Graduate-Level Google-Proof Q&A Benchmark.
Physics reasoning and problem solving benchmark.
Abstraction and Reasoning Corpus - Level 1.
Abstraction and Reasoning Corpus - Level 2 (Extreme difficulty).
Expert-level chemistry knowledge and reasoning.
Korean reasoning and language understanding benchmark.
Scientific Olympiad level problems.
Scientific programming benchmark for code synthesis and correctness.
Open-ended scientific research benchmark with expert-level questions.
Biology and life-science benchmark requiring deep domain reasoning.
Hard scientific reasoning benchmark inspired by olympiad-level tasks.
High-level coding outcome quality benchmark for agent-driven development.
Natural language to repository-wide code edits benchmark.
Pass@1 metric for repository-scale code modification tasks.
Complex language benchmark covering difficult enterprise workflows.
Task-oriented benchmark for complex instruction execution.
Reference-heavy task-oriented benchmark requiring retrieval fidelity.
Hard-split medical reasoning benchmark.
Diamond subset for difficult planning and valuation tasks.
Expert-level evaluation benchmark across specialist domains.
Task-oriented benchmark for K12 educational tasks.
Compositional instruction-following benchmark with chained constraints.
Classification-focused track of task-oriented benchmark suite.
Extraction-focused benchmark for structured information tasks.
Vision-language travel-planning and grounded reasoning benchmark.
Text-only travel-planning and itinerary reasoning benchmark.
World knowledge, multilingual capabilities, and real-world understanding
Massive Multitask Language Understanding covers 57 subjects across STEM, the humanities, social sciences, and more.
Chatbot Arena ELO score. Crowd-sourced human preference ranking.
Artificial Analysis aggregate intelligence index.
Verified subset of SimpleQA for parametric knowledge evaluation.
Massive Multilingual Language Understanding.
Physical Interaction QA across multiple languages and cultures.
Code generation, software engineering, and programming tasks
Functional correctness of synthesized programs from docstrings.
Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Next-generation HumanEval with more diverse library calls and complex tasks.
Competitive programming rating based on problem solving.
Contamination-free coding benchmark using recent problems.
Competitive programming problems from Codeforces, ICPC, and IOI with Elo rating.
Artificial Analysis aggregate coding capability index.
Research-grade coding and software development tasks.
Cybersecurity-flavored coding benchmark in simulated environments.
Online-judge competitive coding benchmark focused on C++ tasks.
Mathematical reasoning, competition math, and quantitative problem-solving
Challenging competition mathematics problems (AIME/IMO level).
Grade school math word problems requiring multi-step reasoning.
American Invitational Mathematics Examination. Competition-level math.
Future prediction of AIME performance levels.
Artificial Analysis aggregate math capability index.
Harvard-MIT Mathematics Tournament - High difficulty competition math.
500-problem math benchmark for broad quantitative reasoning.
International Mathematical Olympiad style answer-only benchmark.
Competitive math arena for top-tier reasoning models.
American Invitational Mathematics Examination 2025 problems.
William Lowell Putnam Mathematical Competition problems - top 200 level difficulty.
Tool use, agentic workflows, and instruction following
Comprehensive framework to evaluate LLMs as agents across diverse environments.
Instruction Following Evaluation for Large Language Models. Measures ability to follow strict formatting and constraint requirements.
Reverse instruction following evaluation.
Artificial Analysis IFBench. Evaluates precise instruction following with constraints.
Advanced instruction-following benchmark with verified grading.
Complex, multi-constraint instruction following tasks.
Agent performance in realistic terminal workflows (v2.0 leaderboard).
Hard split of Terminal-Bench focused on tougher terminal workflows.
Verified desktop computer-use benchmark for end-to-end task completion.
Browser-based autonomous task execution benchmark.
Software engineering task completion in multi-step coding workflows.
Multi-repository software engineering benchmark.
Higher-difficulty SWE-bench subset for frontier coding agents.
Software engineering performance across multilingual codebases.
Evolutionary coding benchmark focused on long-horizon bug fixing.
Multi-language coding agent benchmark with editor-in-the-loop tasks.
Agent ability to produce complete, runnable software artifacts.
Short-form coding QA with executable correctness checks.
Verified spreadsheet manipulation and reasoning tasks.
Web browsing + synthesis benchmark for research agents.
BrowseComp variant with explicit context-window management.
Multi-agent swarm variant of BrowseComp.
Chinese-language browsing and synthesis benchmark.
Text-only variant of Humanity's Last Exam under agentic settings.
Verified subset of Humanity's Last Exam for reproducible evaluation.
Broad retrieval and synthesis benchmark across many sources.
Multi-agent swarm variant of WideSearch.
Finance-focused search and evidence-grounded answering benchmark.
Tier 2 and Tier 3 slices of FinSearchComp.
Long-horizon business simulation benchmark (final account balance).
Factuality benchmark across grounding, parametric, search, and multimodal.
Multi-step workflows using Model Context Protocol.
Long horizon real-world software tasks.
Deep multi-hop search QA for long-horizon agents.
Strategic environment-agent loop benchmark.
Artificial Analysis GDPVal benchmark for knowledge-work quality.
Tool-use and API orchestration benchmark for assistants.
Retail-domain tool-use and workflow benchmark from τ²-bench.
Telecom-domain tool-use and workflow benchmark.
Model Context Protocol interoperability benchmark.
Function calling reliability benchmark (v4).
Virtual task assistant benchmark across practical workflows.
Consulting-style multi-step reasoning and recommendation benchmark.
Long-horizon research task benchmark with citation requirements.
Rubric-based evaluation of research quality and rigor.
Verified embodied-agent benchmark in Minecraft-style tasks.
Multimodal browse + synthesize benchmark for web agents.
Vision-language variant of Humanity's Last Exam under agentic settings.
Image understanding, video analysis, and multimodal capabilities
Multi-discipline Multimodal Understanding and Reasoning.
Mathematical reasoning in visual contexts.
Compact MathVista split for faster multimodal reasoning checks.
Comprehensive mathematical vision benchmark.
Massive Multi-discipline Multimodal Understanding and Reasoning.
Logical reasoning in visual puzzles and diagrams.
Spatial and perception benchmark for multimodal models.
Multimodal visual perception benchmark.
Expert-level chart understanding and question answering.
Document visual question answering on scanned and digital documents.
Next-gen optical character recognition and document understanding.
Optical character recognition and document understanding benchmark.
Dynamic mathematical reasoning in visual contexts.
Mathematical competition problems with visual elements.
Multi-step mathematical reasoning on a canvas.
Professional level MMMU expansion.
Validation split of MMMU for multimodal understanding.
Expert-level Multimodal Mathematics Analysis.
Scientific Figure Evaluation.
High-level Physics Olympiad (Vision).
Cross-domain Logical Reasoning and Spatial benchmark.
Physics reasoning with open-ended visual questions.
Visual Perception and Coding Tasks.
Zero-shot visual reasoning benchmark.
Zero-shot visual reasoning sub-tasks.
Aggregate ZeroBench score across the full task set.
ZeroBench score when tool use is allowed.
ARC-AGI Level 1 tasks in image format.
ARC-AGI Level 2 tasks in image format.
Visual logic and sequence reasoning.
Evaluating bias in Vision-Language Models.
Evaluating perception failures in VLMs.
Visual factor identification and reasoning.
Real-world visual question answering.
Early-stage visual development benchmark.
Visual hallucination and factuality benchmark.
Multimodal Evaluation (Cognitive Capacity).
Elite multimodal model evaluation.
Multimodal Understanding and Interaction Benchmark.
Multilingual Text-centric Visual QA.
Global visual knowledge and reasoning.
Subjective and intuitive visual quality evaluation.
Visual Verification and reasoning.
Visual object counting and identification.
Few-shot counting benchmark (Lower is better handled in normalization).
Visual pointing and spatial grounding.
Multimodal Spatial Interaction Benchmark.
Hierarchical visual reasoning tasks.
Referential spatial reasoning evaluation.
Document Analysis and reasoning (2k).
Multi-perspective visual understanding.
Environment Reasoning and Question Answering.
Universal document understanding benchmark.
OCR benchmark measuring edit distance (lower is better).
Screen understanding benchmark for GUI interaction.
Information-seeking visual question answering on the test split.
Chart-based reasoning from arXiv papers (Data QA).
Chart-based reasoning from arXiv papers (Reasoning QA).
Information synthesis from complex charts.
Document Understanding and Dialogue Evaluation.
Multimodal Long context benchmark.
Long document understanding with URLs.
Multimodal Long context document evaluation.
Multimodal Video Understanding.
Verifiable question answering for short video clips.
Complex reasoning tasks in video content.
Sequence reasoning and motion understanding.
Deep diagnostic video understanding.
Long-form video reasoning and knowledge retrieval.
Continuous Physics reasoning in video.
Temporal orientation and perception in video.
First-person perspective temporal reasoning.
Comprehensive motion perception evaluation.
Temporal Object-centric Multimodal Analysis.
Contextual Grounding in long videos.
Understanding extremely long-form video content.
Professional level video quality and content evaluation.
Large-scale Video Benchmark.
Cross-video temporal and relational reasoning.
Live sports broadcast understanding.
Object-Video-Object relational reasoning.
Open-Domain Video understanding.
Video-to-speech and dialogue reasoning.
Short-form visual question answering with verifiable responses.
Video variant of MMMU for multimodal understanding and reasoning.
Video multimodal evaluation benchmark for perception and reasoning.
Television/video narrative understanding benchmark.
Open-world video understanding benchmark.
Performance on extended documents and long-context reasoning
Multi-Round Context Retrieval - 8-needle test.
Comprehensive long-context understanding (128k).
Artificial Analysis Long Context Reasoning benchmark. Evaluates reasoning over long contexts.
Traversal-based long context reasoning using BFS (128k).
Accuracy, hallucination resistance, and factual reliability
Factuality in long-form conceptual generations.
Evaluates model omniscience and factual reliability across diverse domains.
Precision of fine-grained facts in long-form biographies.
Factuality in long-form generations about objects.