Beta version: *Information might not be fully accurate. Please report any discrepancies.

OpenAIVerified7 benchmarks

o1

Released 2024-12-05Reasoning Model Architecture

Training: 2023-09

Verified Model Card

Latest Data

Unknown

Context Window

200k

tokens

Input Cost

$15.00

per 1M tokens

Output Cost

$60.00

per 1M tokens

Cache Cost

$7.50 / Free

read / write per 1M

Parameters

Reasoning Model

model footprint

Benchmark Provenance

Performance Analysis // Verified Benchmarks

MMLU (5-shot)Knowledge

91.8/ 100

Verified

Last Verified: Unknown DateOpenAI Blog

Massive Multitask Language Understanding covers 57 subjects across STEM, the humanities, social sciences, and more.

HumanEvalCoding

88.1/ 100

Verified

Last Verified: Unknown DateOpenAI Blog

Functional correctness of synthesized programs from docstrings.

BigCodeBenchCoding

35.5/ 100

Verified

Last Verified: Unknown DateOpenAI Blog

Next-generation HumanEval with more diverse library calls and complex tasks.

LMArena ELOReal-world

1360/ 1700

Verified

Last Verified: Unknown DateOpenAI Blog

Chatbot Arena ELO score. Crowd-sourced human preference ranking.

AgentBenchAgent

87.6/ 100

Verified

Last Verified: Unknown DateOpenAI Blog

Comprehensive framework to evaluate LLMs as agents across diverse environments.

MATH-500Math

96.4/ 100

Verified

Last Verified: Unknown DateOpenAI Blog

500-problem math benchmark for broad quantitative reasoning.

GPQA DiamondSTEM

78/ 100

Verified

Last Verified: Unknown DateOpenAI Blog

Graduate-Level Google-Proof Q&A Benchmark.