Beta version: *Information might not be fully accurate. Please report any discrepancies.

Agents & Tools

Measures ability to use external tools, follow complex instructions, operate autonomously in multi-step workflows, and function as effective AI agents. Includes BFCL, API-based tasks, and instruction following benchmarks.

Top Models

1Gemini 2.0 Flash

Domain Info

Benchmarks: 54
Models Evaluated: 64
Categories: Agent, Agentic, Instruction Following

Benchmarks

AgentBench IFEval Inverse IFEval IFBench Verified AdvancedIF MultiChallenge Terminal-Bench 2.0 Terminal-Bench Hard Claw-Eval APEX-Agents