Beta version: *Information might not be fully accurate. Please report any discrepancies.

OpenAIVerified9 benchmarks

GPT-4o

Released 2024-05-13200B (Estimated) Architecture

Verified Model Card

Latest Data

Unknown

Context Window

128k

tokens

Input Cost

$2.50

per 1M tokens

Output Cost

$10.00

per 1M tokens

Cache Cost

$1.25 / Free

read / write per 1M

Parameters

200B (Estimated)

model footprint

Benchmark Provenance

Performance Analysis // Verified Benchmarks

MMLU (5-shot)Knowledge

88.7/ 100

Verified

Last Verified: Unknown DateOpenAI Blog

Massive Multitask Language Understanding covers 57 subjects across STEM, the humanities, social sciences, and more.

MATH (CoT)Math

76.6/ 100

Verified

Last Verified: Unknown DateOpenAI Blog

Challenging competition mathematics problems (AIME/IMO level).

HumanEvalCoding

90.2/ 100

Verified

Last Verified: Unknown DateOpenAI Blog

Functional correctness of synthesized programs from docstrings.

MMMU (Multimodal)Multimodal

69.1/ 100

Verified

Last Verified: Unknown DateOpenAI Blog

Multi-discipline Multimodal Understanding and Reasoning.

BigCodeBenchCoding

31.1/ 100

Verified

Last Verified: Unknown DateOpenAI Blog

Next-generation HumanEval with more diverse library calls and complex tasks.

LMArena ELOReal-world

1388/ 1700

Verified

Last Verified: Unknown DateOpenAI Blog

Chatbot Arena ELO score. Crowd-sourced human preference ranking.

AgentBenchAgent

90/ 100

Verified

Last Verified: Unknown DateOpenAI Blog

Comprehensive framework to evaluate LLMs as agents across diverse environments.

GPQA DiamondSTEM

53.6/ 100

Verified

Last Verified: Unknown DateOpenAI Blog

Graduate-Level Google-Proof Q&A Benchmark.

SuperchemSTEM

40/ 100

Unverified

Last Verified: Unknown DateOpenAI Blog

Expert-level chemistry knowledge and reasoning.