Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Latest Data
2026-05-21
Context Window
1.0M
tokens
Input Cost
$5.00
per 1M tokens
Output Cost
$25.00
per 1M tokens
Parameters
Unknown
model footprint
Performance Analysis // Verified Benchmarks
Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Chatbot Arena ELO score. Crowd-sourced human preference ranking.
WebDev Arena ELO score. Human preference ranking for web development tasks.
Vision Arena ELO score. Human preference ranking for multimodal vision tasks.
Search Arena ELO score. Human preference ranking for search-augmented generation.
Document Arena ELO score. Human preference ranking for document understanding.
Humanity's Last Exam full evaluation without tools.
Humanity's Last Exam full evaluation with tool access enabled.
Cybersecurity-flavored coding benchmark in simulated environments.
Graduate-Level Google-Proof Q&A Benchmark.
Massive Multilingual Language Understanding.
Traversal-based long context reasoning using BFS (128k).
Screen understanding benchmark for GUI interaction.
Information synthesis from complex charts.
Agent performance in realistic terminal workflows (v2.0 leaderboard).
Verified desktop computer-use benchmark for end-to-end task completion.
Higher-difficulty SWE-bench subset for frontier coding agents.
Software engineering performance across multilingual codebases.
Web browsing + synthesis benchmark for research agents.
Multi-step workflows using Model Context Protocol.
Financial analysis and reasoning benchmark for agentic workflows.
Advanced document reasoning and office task completion benchmark.
Protein structure and molecular biology reasoning benchmark.
Software engineering benchmark with multimodal inputs.