AI Benchmarks

Compare leading AI models across standardized benchmarks. Last updated 2026-06-11.

SWE-bench leaderboard →MMLU-Pro leaderboard →HumanEval leaderboard →GPQA Diamond leaderboard →MATH leaderboard →

Compare specific models, side-by-side

Pick any 2 to 5 models to put head-to-head across benchmarks, pricing, and context windows. Popular pairs: Claude Opus vs GPT-5.5, Gemini 3 vs Llama 4, open-source vs frontier.

How do you know if Claude is smarter than GPT-4? How does the new Llama 4 stack up against Gemini 2.5? Benchmarks provide the answer. These standardized tests measure specific AI capabilities across diverse domains and let us compare models objectively. They're imperfect (benchmarks are often gamed), but they're the only shared language we have for understanding AI progress.

MMLU measures broad knowledge across multiple choice questions across chemistry, history, law, and 50+ other domains. A score of 92 percent means the model answers 92 out of 100 random questions correctly across all topics. MMLU is the closest we have to a general intelligence test for AI. HumanEval tests code generation: the model writes functions to solve programming problems that humans created. GPQA (Graduate-Level Google-Proof Questions) is deliberately hard, asking obscure questions that require deep expertise. MATH benchmarks raw mathematical reasoning. SWE-bench tests software engineering tasks: given a failing test and a codebase, can the model write code to fix it?

No single benchmark captures everything. A model that excels at MMLU might struggle with code. Benchmarks have been leaked and learned during training. And real-world performance depends on your specific task, how you prompt, and how you integrate the model into your system. Use this data to narrow the field of candidates. Then test the finalists on your actual workloads. We've also collected this data in our model comparison tool for side-by-side analysis.

SWE-bench: Real-world software engineering tasks from GitHub issues (SWE-bench Verified). Max score: 100.

Rank	Model	Provider	Score↓	Released
#1	Claude Fable 5	Anthropic	95.0/ 100	2026-06
#2	Claude Opus 4.8	Anthropic	88.6/ 100	2026-05
#3	Claude Opus 4.7	Anthropic	87.6/ 100	2026-04
#4	GPT-5.5	OpenAI	82.6/ 100	2026-04
#5	Claude Opus 4.6	Anthropic	80.8/ 100	2026-03
#6	DeepSeek V4 Pro	DeepSeek	80.6/ 100	2026-04
#7	Claude Sonnet 4.6	Anthropic	79.6/ 100	2026-02
#8	DeepSeek V4 Flash	DeepSeek	79.0/ 100	2026-04
#9	Claude Haiku 4.5	Anthropic	73.3/ 100	2026-01
#10	Gemini 2.5 Pro	Google	63.8/ 100	2026-01
#11	o3-mini	OpenAI	49.3/ 100	2025-11
#12	o1	OpenAI	48.9/ 100	2025-09
#13	Mistral Large	Mistral	47.2/ 100	2025-11
#14	DeepSeek V3	DeepSeek	42.0/ 100	2025-12
#15	GPT-4.5	OpenAI	38.0/ 100	2025-12
#16	GPT-4o	OpenAI	33.2/ 100	2025-05
#17	Llama 4 Maverick	Meta	24.0/ 100	2026-03
Not reported on SWE-bench
-	Gemini 2.0 Flash	Google	not reported	2025-10
-	Llama 4 Scout	Meta	not reported	2026-02
-	Mistral Small	Mistral	not reported	2025-09