Beta version: *Information might not be fully accurate. Please report any discrepancies.
Contamination-free, continuously updated reasoning benchmark.
Score Distribution
MMLU-Pro
mmlu-pro
HLE
hle
HLE-Full
hle-full
HLE-Full (w/ tools)
hle-full-tools
CritPt
critpt
SimpleQA
simpleqa