Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Latest Data
2026-02-20
Context Window
10.0M
tokens
Input Cost
$1.00
per 1M tokens
Output Cost
$3.00
per 1M tokens
Parameters
Proprietary
model footprint
Performance Analysis // Verified Benchmarks
Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Contamination-free, continuously updated reasoning benchmark.
Artificial Analysis aggregate intelligence index.
A more robust and harder version of MMLU, focusing on complex reasoning and STEM subjects.
Humanity's Last Exam - Hard reasoning benchmark without tools.
Humanity's Last Exam full evaluation with tool access enabled.
Open-domain factuality benchmark focusing on short, verifiable answers.
Harvard-MIT Mathematics Tournament - High difficulty competition math.
International Mathematical Olympiad style answer-only benchmark.
Contamination-free coding benchmark using recent problems.
Artificial Analysis aggregate coding capability index.
Research-grade coding and software development tasks.
Cybersecurity-flavored coding benchmark in simulated environments.
Online-judge competitive coding benchmark focused on C++ tasks.
Graduate-Level Google-Proof Q&A Benchmark.
Comprehensive long-context understanding (128k).
Artificial Analysis Long Context Reasoning benchmark. Evaluates reasoning over long contexts.
Artificial Analysis IFBench. Evaluates precise instruction following with constraints.
Advanced instruction-following benchmark with verified grading.
American Invitational Mathematics Examination 2025 problems.
Compact MathVista split for faster multimodal reasoning checks.
Comprehensive mathematical vision benchmark.
Spatial and perception benchmark for multimodal models.
Multimodal visual perception benchmark.
Optical character recognition and document understanding benchmark.
Professional level MMMU expansion.
Validation split of MMMU for multimodal understanding.
Aggregate ZeroBench score across the full task set.
ZeroBench score when tool use is allowed.
Early-stage visual development benchmark.
Global visual knowledge and reasoning.
Universal document understanding benchmark.
Information-seeking visual question answering on the test split.
Chart-based reasoning from arXiv papers (Reasoning QA).
Multimodal Video Understanding.
Comprehensive motion perception evaluation.
Understanding extremely long-form video content.
Large-scale Video Benchmark.
Agent performance in realistic terminal workflows (v2.0 leaderboard).
Hard split of Terminal-Bench focused on tougher terminal workflows.
Verified desktop computer-use benchmark for end-to-end task completion.
Browser-based autonomous task execution benchmark.
Higher-difficulty SWE-bench subset for frontier coding agents.
Software engineering performance across multilingual codebases.
Web browsing + synthesis benchmark for research agents.
BrowseComp variant with explicit context-window management.
Multi-agent swarm variant of BrowseComp.
Broad retrieval and synthesis benchmark across many sources.
Multi-agent swarm variant of WideSearch.
Tier 2 and Tier 3 slices of FinSearchComp.
Deep multi-hop search QA for long-horizon agents.
Strategic environment-agent loop benchmark.
Artificial Analysis GDPVal benchmark for knowledge-work quality.
Telecom-domain tool-use and workflow benchmark.
Scientific programming benchmark for code synthesis and correctness.
Short-form visual question answering with verifiable responses.
Video variant of MMMU for multimodal understanding and reasoning.
Video multimodal evaluation benchmark for perception and reasoning.