Developer Tools
LLM Benchmark Comparison
Compare Claude, GPT-4o, Gemini, Llama and more on MMLU, HumanEval, MATH, GPQA and pricing. Sortable table.
| Model | Provider | MMLU โ | HumanEval | MATH | GPQA | Context | In/1M | Out/1M |
|---|---|---|---|---|---|---|---|---|
| Gemini 2.5 Pro | 89.4% | 88.6% | 79.1% | 75% | 1M | $1.25 | $10 | |
| GPT-4o | OpenAI | 88.7% | 90.2% | 76.6% | 53.6% | 128K | $2.5 | $10 |
| Claude Opus 4.8 | Anthropic | 88.2% | 92% | 71.5% | 74.9% | 200K | $15 | $75 |
| Grok 2 | xAI | 87.5% | 89.1% | 76.1% | 56% | 128K | $2 | $10 |
| Claude Sonnet 4.6 | Anthropic | 87.1% | 90.5% | 68.9% | 71.2% | 200K | $3 | $15 |
| Llama 3.3 70B | Meta (open) | 86% | 88.4% | 77% | 50.5% | 128K | open | open |
| Gemini 2.5 Flash | 85.1% | 85.4% | 74% | 62.1% | 1M | $0.15 | $0.6 | |
| Mistral Large 2 | Mistral | 84% | 92.1% | 68% | 59.6% | 128K | $2 | $6 |
| GPT-4o mini | OpenAI | 82% | 87.2% | 70.2% | 40.2% | 128K | $0.15 | $0.6 |
| Claude Haiku 4.5 | Anthropic | 77.6% | 80.2% | 54.3% | 58.1% | 200K | $0.8 | $4 |
Click column headers to sort. Prices in USD per 1M tokens. Last updated: 2026-06.
LLM Comparison 2026: How to Choose the Right AI Model
Compare MMLU, HumanEval, MATH and GPQA scores. Learn which LLM benchmark matters for your use case and how to pick between Claude, GPT-4o, Gemini, and Llama.