Challenging competition mathematics problems (AIME/IMO level).
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Latest Data
Unknown
Context Window
128k
tokens
Input Cost
$0.47
per 1M tokens
Output Cost
$2.37
per 1M tokens
Parameters
Unknown
model footprint
Performance Analysis // Verified Benchmarks
Challenging competition mathematics problems (AIME/IMO level).
Functional correctness of synthesized programs from docstrings.
Multi-discipline Multimodal Understanding and Reasoning.
Chatbot Arena ELO score. Crowd-sourced human preference ranking.
A more robust and harder version of MMLU, focusing on complex reasoning and STEM subjects.
Humanity's Last Exam - Hard reasoning benchmark without tools.
Future prediction of AIME performance levels.
Competitive programming rating based on problem solving.
Graduate-Level Google-Proof Q&A Benchmark.
Abstraction and Reasoning Corpus - Level 1.
American Invitational Mathematics Examination 2025 problems.
William Lowell Putnam Mathematical Competition problems - top 200 level difficulty.
Mathematical reasoning in visual contexts.
Comprehensive mathematical vision benchmark.
Massive Multi-discipline Multimodal Understanding and Reasoning.
Logical reasoning in visual puzzles and diagrams.
Spatial and perception benchmark for multimodal models.
Expert-level chart understanding and question answering.
Next-gen optical character recognition and document understanding.
Dynamic mathematical reasoning in visual contexts.
Mathematical competition problems with visual elements.
Multi-step mathematical reasoning on a canvas.
Professional level MMMU expansion.
Expert-level Multimodal Mathematics Analysis.
Scientific Figure Evaluation.
High-level Physics Olympiad (Vision).
Cross-domain Logical Reasoning and Spatial benchmark.
Physics reasoning with open-ended visual questions.
Visual Perception and Coding Tasks.
Zero-shot visual reasoning benchmark.
Zero-shot visual reasoning sub-tasks.
ARC-AGI Level 1 tasks in image format.
ARC-AGI Level 2 tasks in image format.
Visual logic and sequence reasoning.
Evaluating bias in Vision-Language Models.
Evaluating perception failures in VLMs.
Visual factor identification and reasoning.
Real-world visual question answering.
Early-stage visual development benchmark.
Visual hallucination and factuality benchmark.
Multimodal Evaluation (Cognitive Capacity).
Elite multimodal model evaluation.
Multimodal Understanding and Interaction Benchmark.
Multilingual Text-centric Visual QA.
Global visual knowledge and reasoning.
Subjective and intuitive visual quality evaluation.
Visual Verification and reasoning.
Visual object counting and identification.
Few-shot counting benchmark (Lower is better handled in normalization).
Visual pointing and spatial grounding.
Multimodal Spatial Interaction Benchmark.
Hierarchical visual reasoning tasks.
Referential spatial reasoning evaluation.
Document Analysis and reasoning (2k).
Multi-perspective visual understanding.
Environment Reasoning and Question Answering.
Universal document understanding benchmark.
Chart-based reasoning from arXiv papers (Data QA).
Chart-based reasoning from arXiv papers (Reasoning QA).
Document Understanding and Dialogue Evaluation.
Multimodal Long context benchmark.
Long document understanding with URLs.
Multimodal Long context document evaluation.
Multimodal Video Understanding.
Verifiable question answering for short video clips.
Complex reasoning tasks in video content.
Sequence reasoning and motion understanding.
Deep diagnostic video understanding.
Long-form video reasoning and knowledge retrieval.
Continuous Physics reasoning in video.
Temporal orientation and perception in video.
First-person perspective temporal reasoning.
Comprehensive motion perception evaluation.
Temporal Object-centric Multimodal Analysis.
Contextual Grounding in long videos.
Understanding extremely long-form video content.
Professional level video quality and content evaluation.
Large-scale Video Benchmark.
Cross-video temporal and relational reasoning.
Live sports broadcast understanding.
Object-Video-Object relational reasoning.
Open-Domain Video understanding.
Video-to-speech and dialogue reasoning.
Scientific Olympiad level problems.
Short-form visual question answering with verifiable responses.
Video variant of MMMU for multimodal understanding and reasoning.
Video multimodal evaluation benchmark for perception and reasoning.
Television/video narrative understanding benchmark.
Open-world video understanding benchmark.