Chatbot Arena ELO score. Crowd-sourced human preference ranking.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Latest Data
2026-05-21
Context Window
1.1M
tokens
Input Cost
$5.00
per 1M tokens
Output Cost
$30.00
per 1M tokens
Parameters
Unknown
model footprint
Performance Analysis // Verified Benchmarks
Chatbot Arena ELO score. Crowd-sourced human preference ranking.
WebDev Arena ELO score. Human preference ranking for web development tasks.
Vision Arena ELO score. Human preference ranking for multimodal vision tasks.
Document Arena ELO score. Human preference ranking for document understanding.
Humanity's Last Exam full evaluation without tools.
Humanity's Last Exam full evaluation with tool access enabled.
Cybersecurity-flavored coding benchmark in simulated environments.
Graduate-Level Google-Proof Q&A Benchmark.
Abstraction and Reasoning Corpus - Level 1.
Abstraction and Reasoning Corpus - Level 2 (Extreme difficulty).
Traversal-based long context reasoning using BFS (128k).
Professional level MMMU expansion.
Agent performance in realistic terminal workflows (v2.0 leaderboard).
Verified desktop computer-use benchmark for end-to-end task completion.
Higher-difficulty SWE-bench subset for frontier coding agents.
Web browsing + synthesis benchmark for research agents.
Multi-step workflows using Model Context Protocol.
Long horizon real-world software tasks.
Telecom-domain tool-use and workflow benchmark.
Advanced mathematics benchmark with tiered difficulty.
Long-horizon software engineering tasks requiring expert-level reasoning.
Genetics and quantitative biology benchmark.
Bioinformatics and data analysis benchmark.
Financial analysis and reasoning benchmark for agentic workflows.
Advanced document reasoning and office task completion benchmark.
Hardest tier of FrontierMath advanced mathematics benchmark.