Developer Tools

LLM Benchmark Comparison

Compare Claude, GPT-4o, Gemini, Llama and more on MMLU, HumanEval, MATH, GPQA and pricing. Sortable table.

ModelProviderMMLU โ†“HumanEvalMATHGPQAContextIn/1MOut/1M
Gemini 2.5 ProGoogle89.4%88.6%79.1%75%1M$1.25$10
GPT-4oOpenAI88.7%90.2%76.6%53.6%128K$2.5$10
Claude Opus 4.8Anthropic88.2%92%71.5%74.9%200K$15$75
Grok 2xAI87.5%89.1%76.1%56%128K$2$10
Claude Sonnet 4.6Anthropic87.1%90.5%68.9%71.2%200K$3$15
Llama 3.3 70BMeta (open)86%88.4%77%50.5%128Kopenopen
Gemini 2.5 FlashGoogle85.1%85.4%74%62.1%1M$0.15$0.6
Mistral Large 2Mistral84%92.1%68%59.6%128K$2$6
GPT-4o miniOpenAI82%87.2%70.2%40.2%128K$0.15$0.6
Claude Haiku 4.5Anthropic77.6%80.2%54.3%58.1%200K$0.8$4

Click column headers to sort. Prices in USD per 1M tokens. Last updated: 2026-06.

guide

LLM Comparison 2026: How to Choose the Right AI Model

Compare MMLU, HumanEval, MATH and GPQA scores. Learn which LLM benchmark matters for your use case and how to pick between Claude, GPT-4o, Gemini, and Llama.

โ†’
More free toolsSee all 162 โ†’
Merge PDFsCompress ImageJSON FormatterPassword GeneratorVAT CalculatorQR Code Generator