Beta version: *Information might not be fully accurate. Please report any discrepancies.
Verified desktop computer-use benchmark for end-to-end task completion.
Score Distribution
AgentBench
agentbench
IFEval
ifeval
Inverse IFEval
ifeval-inverse
IFBench
ifbench
Verified AdvancedIF
verified-advancedif
MultiChallenge
multichallenge