Benchmarks

Contamination-free, continuously updated reasoning benchmark.

MMLU-Pro

A more robust and harder version of MMLU, focusing on complex reasoning and STEM subjects.

HLE

Humanity's Last Exam - Hard reasoning benchmark without tools.

HLE-Full

Humanity's Last Exam full evaluation without tools.

HLE-Full (w/ tools)

Humanity's Last Exam full evaluation with tool access enabled.

CritPt

Complex Research using Integrated Thinking - Physics Test. Research-level physics reasoning.

SimpleQA

Open-domain factuality benchmark focusing on short, verifiable answers.

HealthBench

Medical knowledge and diagnostic reasoning evaluation.

SuperGPQA

Extremely difficult expert-level science questions.

GPQA Diamond

Graduate-Level Google-Proof Q&A Benchmark.

Phybench

Physics reasoning and problem solving benchmark.

ARC-AGI-1

Abstraction and Reasoning Corpus - Level 1.

ARC-AGI-2

Abstraction and Reasoning Corpus - Level 2 (Extreme difficulty).

Superchem

Expert-level chemistry knowledge and reasoning.

KORBench

Korean reasoning and language understanding benchmark.

BigBench Extra Hard

Extra-hard subset of BIG-bench focusing on challenging reasoning and knowledge tasks.

FrontierSci-olympiad

Scientific Olympiad level problems.

SciCode

Scientific programming benchmark for code synthesis and correctness.

FrontierSci Research

Open-ended scientific research benchmark with expert-level questions.

BioBench

Biology and life-science benchmark requiring deep domain reasoning.

AInstein-Bench

Hard scientific reasoning benchmark inspired by olympiad-level tasks.

Vibe Coding

High-level coding outcome quality benchmark for agent-driven development.

NL2Repo-Bench

Natural language to repository-wide code edits benchmark.

NL2Repo Pass@1

Pass@1 metric for repository-scale code modification tasks.

CL-Bench

Complex language benchmark covering difficult enterprise workflows.

TOB Complex

Task-oriented benchmark for complex instruction execution.

TOB Reference

Reference-heavy task-oriented benchmark requiring retrieval fidelity.

HealthBench Hard

Hard-split medical reasoning benchmark.

GDPVal Diamond

Diamond subset for difficult planning and valuation tasks.

Xpert-Bench

Expert-level evaluation benchmark across specialist domains.

TOB K12

Task-oriented benchmark for K12 educational tasks.

TOB Compositional

Compositional instruction-following benchmark with chained constraints.

TOB Classification

Classification-focused track of task-oriented benchmark suite.

TOB Extraction

Extraction-focused benchmark for structured information tasks.

World-Travel VLM

Vision-language travel-planning and grounded reasoning benchmark.

World-Travel Text

Text-only travel-planning and itinerary reasoning benchmark.

GeneBench

Genetics and quantitative biology benchmark.

BixBench

Bioinformatics and data analysis benchmark.

Structural Biology

Protein structure and molecular biology reasoning benchmark.

Knowledge & Communication

World knowledge, multilingual capabilities, and real-world understanding

MMLU (5-shot)

Knowledge

Massive Multitask Language Understanding covers 57 subjects across STEM, the humanities, social sciences, and more.

LMArena ELO

Real-world

Chatbot Arena ELO score. Crowd-sourced human preference ranking.

Unit: ELO

AA Intelligence Index

Real-world

Artificial Analysis aggregate intelligence index.

SimpleQA Verified

Knowledge

Verified subset of SimpleQA for parametric knowledge evaluation.

MMMLU

Multilingual

Massive Multilingual Language Understanding.

Global PIQA

Multilingual

Physical Interaction QA across multiple languages and cultures.

Coding

Code generation, software engineering, and programming tasks

HumanEval

Functional correctness of synthesized programs from docstrings.

SWE-bench Verified

Resolving real-world GitHub issues. Verified subset ensures solvable issues.

BigCodeBench

Next-generation HumanEval with more diverse library calls and complex tasks.

OJBench (Python)

Online judge programming benchmark for Python.

Codeforces

Competitive programming rating based on problem solving.

Unit: rating

LiveCodeBench v6

Contamination-free coding benchmark using recent problems.

LiveCodeBench Pro

Competitive programming problems from Codeforces, ICPC, and IOI with Elo rating.

Unit: Elo

AA Coding Index

Artificial Analysis aggregate coding capability index.

PaperBench (CodeDev)

Research-grade coding and software development tasks.

CyberGym

Cybersecurity-flavored coding benchmark in simulated environments.

OJBench (cpp)

Online-judge competitive coding benchmark focused on C++ tasks.

Expert-SWE

Long-horizon software engineering tasks requiring expert-level reasoning.

MLE-Bench Lite

Lite version of machine learning engineering benchmark measuring medal rate.

SWE-Bench Multimodal

Software engineering benchmark with multimodal inputs.

Math

Mathematical reasoning, competition math, and quantitative problem-solving

MATH (CoT)

Challenging competition mathematics problems (AIME/IMO level).

GSM8K

Grade school math word problems requiring multi-step reasoning.

AIME 2024/25

American Invitational Mathematics Examination. Competition-level math.

AIME 2026

Future prediction of AIME performance levels.

AA Math Index

Artificial Analysis aggregate math capability index.

HMMT Feb 2025

Harvard-MIT Mathematics Tournament - High difficulty competition math.

HMMT Nov 2025

Harvard-MIT Mathematics Tournament November 2025 - High difficulty competition math.

HMMT Feb 2026

Harvard-MIT Mathematics Tournament 2026 - High difficulty competition math.

MATH-500

500-problem math benchmark for broad quantitative reasoning.

IMO-AnswerBench

International Mathematical Olympiad style answer-only benchmark.

MathArenaApex

Competitive math arena for top-tier reasoning models.

AIME 2025

American Invitational Mathematics Examination 2025 problems.

Putnam-200

William Lowell Putnam Mathematical Competition problems - top 200 level difficulty.

FrontierMath

Advanced mathematics benchmark with tiered difficulty.

FrontierMath Tier 4

Hardest tier of FrontierMath advanced mathematics benchmark.

Agents & Tools

Tool use, agentic workflows, and instruction following

AgentBench

Agent

Comprehensive framework to evaluate LLMs as agents across diverse environments.

IFEval

Instruction Following Evaluation for Large Language Models. Measures ability to follow strict formatting and constraint requirements.

Inverse IFEval

Reverse instruction following evaluation.

IFBench

Artificial Analysis IFBench. Evaluates precise instruction following with constraints.

Verified AdvancedIF

Advanced instruction-following benchmark with verified grading.

MultiChallenge

Complex, multi-constraint instruction following tasks.

Terminal-Bench 2.0

Agent performance in realistic terminal workflows (v2.0 leaderboard).

Terminal-Bench Hard

Hard split of Terminal-Bench focused on tougher terminal workflows.

Claw-Eval

Benchmark for daily agentic tasks across text and multimodal interactions.

APEX-Agents

Advanced agentic planning and execution benchmark.

OSWorld-Verified

Verified desktop computer-use benchmark for end-to-end task completion.

WebArena

Browser-based autonomous task execution benchmark.

SWE-Lancer

Software engineering task completion in multi-step coding workflows.

Multi-SWE-bench

Multi-repository software engineering benchmark.

SWE-bench Pro

Higher-difficulty SWE-bench subset for frontier coding agents.

SWE Multilingual

Software engineering performance across multilingual codebases.

SWE-Evo

Evolutionary coding benchmark focused on long-horizon bug fixing.

Aider Polyglot

Multi-language coding agent benchmark with editor-in-the-loop tasks.

ArtifactsBench

Agent ability to produce complete, runnable software artifacts.

CodeSimpleQA

Short-form coding QA with executable correctness checks.

SpreadsheetBench Verified

Verified spreadsheet manipulation and reasoning tasks.

BrowseComp

Web browsing + synthesis benchmark for research agents.

BrowseComp (ctx manage)

BrowseComp variant with explicit context-window management.

BrowseComp (Agent Swarm)

Multi-agent swarm variant of BrowseComp.

BrowseComp-ZH

Chinese-language browsing and synthesis benchmark.

HLE-Text

Text-only variant of Humanity's Last Exam under agentic settings.

HLE-Verified

Verified subset of Humanity's Last Exam for reproducible evaluation.

WideSearch

Broad retrieval and synthesis benchmark across many sources.

WideSearch (Agent Swarm)

Multi-agent swarm variant of WideSearch.

FinSearchComp

Finance-focused search and evidence-grounded answering benchmark.

FinSearchComp T2&T3

Tier 2 and Tier 3 slices of FinSearchComp.

Vending-Bench 2

Long-horizon business simulation benchmark (final account balance).

Unit: USD

FACTS Benchmark Suite

Factuality benchmark across grounding, parametric, search, and multimodal.

MCP Atlas

Multi-step workflows using Model Context Protocol.

Toolathlon

Long horizon real-world software tasks.

DeepSearchQA

Deep multi-hop search QA for long-horizon agents.

SEAL-0

Strategic environment-agent loop benchmark.

GDPVal-AA

Artificial Analysis GDPVal benchmark for knowledge-work quality.

TAU-Bench

Tool-use and API orchestration benchmark for assistants.

TAU-Bench Retail

Retail-domain tool-use and workflow benchmark from τ²-bench.

TAU-Bench Telecom

Telecom-domain tool-use and workflow benchmark.

MCP-Mark

Model Context Protocol interoperability benchmark.

BFCL v4

Function calling reliability benchmark (v4).

VitaBench

Virtual task assistant benchmark across practical workflows.

DeepConsult

Consulting-style multi-step reasoning and recommendation benchmark.

DeepResearchBench

Long-horizon research task benchmark with citation requirements.

ResearchRubrics

Rubric-based evaluation of research quality and rigor.

MineDojo Verified

Verified embodied-agent benchmark in Minecraft-style tasks.

MM-BrowseComp

Multimodal browse + synthesize benchmark for web agents.

HLE-VL

Vision-language variant of Humanity's Last Exam under agentic settings.

MM-ClawBench

Multimodal agent benchmark for daily tasks.

AndroidWorld

Mobile device control and task completion benchmark.

Finance Agent

Financial analysis and reasoning benchmark for agentic workflows.

OfficeQA Pro

Advanced document reasoning and office task completion benchmark.

Vision & Video

Image understanding, video analysis, and multimodal capabilities

MMMU (Multimodal)

Multimodal

Multi-discipline Multimodal Understanding and Reasoning.

MathVista

Mathematical reasoning in visual contexts.

MathVista (mini)

Compact MathVista split for faster multimodal reasoning checks.

MathVision

Comprehensive mathematical vision benchmark.

MMMU

Massive Multi-discipline Multimodal Understanding and Reasoning.

LogicVista

Logical reasoning in visual puzzles and diagrams.

BLINK

Spatial and perception benchmark for multimodal models.

MMVP

Multimodal visual perception benchmark.

ChartQA Pro

Expert-level chart understanding and question answering.

DocVQA

Document visual question answering on scanned and digital documents.

OCRBench v2

Next-gen optical character recognition and document understanding.

OCRBench

Optical character recognition and document understanding benchmark.

DynaMath

Dynamic mathematical reasoning in visual contexts.

MathKangaroo

Mathematical competition problems with visual elements.

MathCanvas

Multi-step mathematical reasoning on a canvas.

MMMU-Pro

Professional level MMMU expansion.

MMMU (val)

Validation split of MMMU for multimodal understanding.

EMMA

Expert-level Multimodal Mathematics Analysis.

SFE

Scientific Figure Evaluation.

HiPhO

High-level Physics Olympiad (Vision).

XLRS-Bench

Cross-domain Logical Reasoning and Spatial benchmark.

PhyX

Physics reasoning with open-ended visual questions.

VPCT

Visual Perception and Coding Tasks.

ZeroBench (main)

Zero-shot visual reasoning benchmark.

ZeroBench (sub)

Zero-shot visual reasoning sub-tasks.

ZeroBench

Aggregate ZeroBench score across the full task set.

ZeroBench (w/ tools)

ZeroBench score when tool use is allowed.

ArcAGI1-Image

ARC-AGI Level 1 tasks in image format.

ArcAGI2-Image

ARC-AGI Level 2 tasks in image format.

VisuLogic

Visual logic and sequence reasoning.

VLMsAreBiased

Evaluating bias in Vision-Language Models.

VLMsAreBlind

Evaluating perception failures in VLMs.

VisFactor

Visual factor identification and reasoning.

RealWorldQA

Real-world visual question answering.

BabyVision

Early-stage visual development benchmark.

HallusionBench

Visual hallucination and factuality benchmark.

MME-CC

Multimodal Evaluation (Cognitive Capacity).

MMStar

Elite multimodal model evaluation.

MUIRBench

Multimodal Understanding and Interaction Benchmark.

MTVQA

Multilingual Text-centric Visual QA.

WorldVQA

Global visual knowledge and reasoning.

VibeEval

Subjective and intuitive visual quality evaluation.

ViVerBench

Visual Verification and reasoning.

CountBench

Visual object counting and identification.

FSC-147

Few-shot counting benchmark (Lower is better handled in normalization).

Unit: error

Point-Bench

Visual pointing and spatial grounding.

MMSIBench

Multimodal Spatial Interaction Benchmark.

TreeBench

Hierarchical visual reasoning tasks.

RefSpatialBench

Referential spatial reasoning evaluation.

DA-2K

Document Analysis and reasoning (2k).

All-Angles

Multi-perspective visual understanding.

ERQA

Environment Reasoning and Question Answering.

OmniDocBench

Universal document understanding benchmark.

OmniDocBench 1.5

OCR benchmark measuring edit distance (lower is better).

Unit: edit distance

ScreenSpot-Pro

Screen understanding benchmark for GUI interaction.

InfoVQA (test)

Information-seeking visual question answering on the test split.

CharXiv-DQ

Chart-based reasoning from arXiv papers (Data QA).

CharXiv-RQ

Chart-based reasoning from arXiv papers (Reasoning QA).

CharXiv Reasoning

Information synthesis from complex charts.

DUDE

Document Understanding and Dialogue Evaluation.

MMLongBench

Multimodal Long context benchmark.

LongDocURL

Long document understanding with URLs.

MMLongBench-Doc

Multimodal Long context document evaluation.

MMVU

Multimodal Video Understanding.

VideoSimpleQA

Verifiable question answering for short video clips.

VideoReasonBench

Complex reasoning tasks in video content.

Morse-500

Sequence reasoning and motion understanding.

VideoHolmes

Deep diagnostic video understanding.

Minerva

Long-form video reasoning and knowledge retrieval.

ContPhy

Continuous Physics reasoning in video.

TempCompass

Temporal orientation and perception in video.

EgoTempo

First-person perspective temporal reasoning.

MotionBench

Comprehensive motion perception evaluation.

TOMATO

Temporal Object-centric Multimodal Analysis.

CGBench

Contextual Grounding in long videos.

LongVideoBench

Understanding extremely long-form video content.

VideoEval-Pro

Professional level video quality and content evaluation.

LVBench

Large-scale Video Benchmark.

CrossVid

Cross-video temporal and relational reasoning.

LiveSports-3K

Live sports broadcast understanding.

OVOBench

Object-Video-Object relational reasoning.

ODVBench

Open-Domain Video understanding.

ViSpeak

Video-to-speech and dialogue reasoning.

SimpleVQA

Short-form visual question answering with verifiable responses.

VideoMMMU

Video variant of MMMU for multimodal understanding and reasoning.

VideoMME

Video multimodal evaluation benchmark for perception and reasoning.

TVBench

Television/video narrative understanding benchmark.

OVBench

Open-world video understanding benchmark.

MedXPertQA MM

Multimodal medical question answering benchmark.

CoVoST

Multimodal

Multilingual speech-to-text translation benchmark.

FLEURS

Multimodal

Few-shot learning evaluation of universal representations of speech. Error rate (lower is better).

Unit: error rate

MLVU

Multi-task long video understanding benchmark.

MVBench

Comprehensive video understanding benchmark across multiple tasks.

Long Context

Performance on extended documents and long-context reasoning

MRCR v2

Multi-Round Context Retrieval - 8-needle test.

LongBench v2

Comprehensive long-context understanding (128k).

AA-LCR

Artificial Analysis Long Context Reasoning benchmark. Evaluates reasoning over long contexts.

Graphwalks Bfs

Traversal-based long context reasoning using BFS (128k).

Factuality

Accuracy, hallucination resistance, and factual reliability

LongFact-Concepts

Factuality in long-form conceptual generations.

AA-Omniscience

Evaluates model omniscience and factual reliability across diverse domains.

FactScore

Precision of fine-grained facts in long-form biographies.

LongFact-Objects

Factuality in long-form generations about objects.