Developer Tools

LLM Benchmark Comparison

Compare Claude, GPT-4o, Gemini, Llama and more on MMLU, HumanEval, MATH, GPQA and pricing. Sortable table.

Model	Provider	MMLU ↓	HumanEval	MATH	GPQA	Context	In/1M	Out/1M
Gemini 2.5 Pro	Google	89.4%	88.6%	79.1%	75%	1M	$1.25	$10
GPT-4o	OpenAI	88.7%	90.2%	76.6%	53.6%	128K	$2.5	$10
Claude Opus 4.8	Anthropic	88.2%	92%	71.5%	74.9%	200K	$15	$75
Grok 2	xAI	87.5%	89.1%	76.1%	56%	128K	$2	$10
Claude Sonnet 4.6	Anthropic	87.1%	90.5%	68.9%	71.2%	200K	$3	$15
Llama 3.3 70B	Meta (open)	86%	88.4%	77%	50.5%	128K	open	open
Gemini 2.5 Flash	Google	85.1%	85.4%	74%	62.1%	1M	$0.15	$0.6
Mistral Large 2	Mistral	84%	92.1%	68%	59.6%	128K	$2	$6
GPT-4o mini	OpenAI	82%	87.2%	70.2%	40.2%	128K	$0.15	$0.6
Claude Haiku 4.5	Anthropic	77.6%	80.2%	54.3%	58.1%	200K	$0.8	$4

Click column headers to sort. Prices in USD per 1M tokens. Last updated: 2026-06.

guide

LLM Comparison 2026: How to Choose the Right AI Model

Compare MMLU, HumanEval, MATH and GPQA scores. Learn which LLM benchmark matters for your use case and how to pick between Claude, GPT-4o, Gemini, and Llama.

→

More free toolsSee all 162 →

Merge PDFs Compress Image JSON Formatter Password Generator VAT Calculator QR Code Generator