Challenging competition mathematics problems (AIME/IMO level).
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Latest Data
Unknown
Context Window
262k
tokens
Input Cost
$0.60
per 1M tokens
Output Cost
$3.60
per 1M tokens
Parameters
397B (17B active)
model footprint
Performance Analysis // Verified Benchmarks
Challenging competition mathematics problems (AIME/IMO level).
Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Multi-discipline Multimodal Understanding and Reasoning.
Grade school math word problems requiring multi-step reasoning.
A more robust and harder version of MMLU, focusing on complex reasoning and STEM subjects.
Extremely difficult expert-level science questions.
Contamination-free coding benchmark using recent problems.
Artificial Analysis Long Context Reasoning benchmark. Evaluates reasoning over long contexts.
Instruction Following Evaluation for Large Language Models. Measures ability to follow strict formatting and constraint requirements.
Artificial Analysis IFBench. Evaluates precise instruction following with constraints.
Complex, multi-constraint instruction following tasks.
Compact MathVista split for faster multimodal reasoning checks.
Comprehensive mathematical vision benchmark.
Optical character recognition and document understanding benchmark.
Professional level MMMU expansion.
Real-world visual question answering.
Visual hallucination and factuality benchmark.
Elite multimodal model evaluation.
Visual object counting and identification.
Environment Reasoning and Question Answering.
Universal document understanding benchmark.
Chart-based reasoning from arXiv papers (Reasoning QA).
Multimodal Video Understanding.
Large-scale Video Benchmark.
Agent performance in realistic terminal workflows (v2.0 leaderboard).
Verified desktop computer-use benchmark for end-to-end task completion.
Software engineering performance across multilingual codebases.
Virtual task assistant benchmark across practical workflows.
Video variant of MMMU for multimodal understanding and reasoning.
Video multimodal evaluation benchmark for perception and reasoning.