Highest Benchmark
Models ranked by MMLU (Massive Multitask Language Understanding) score. MMLU tests reasoning across 57 academic subjects — scores above 90% exceed average human expert performance.
Methodology: Sorted by benchmark_mmlu descending. Human expert average is approximately 89.8%. Not all models have published MMLU scores.
| # | Model | Provider | Metric |
|---|---|---|---|
| 1 | o1 | OpenAI | 92.3% MMLU |
| 2 | DeepSeek R1 | DeepSeek | 90.8% MMLU |
| 3 | Qwen3 235B A22B | — | 88.9% MMLU |
| 4 | Claude 3.5 Sonnet | Anthropic | 88.7% MMLU |
| 5 | GPT-4o | OpenAI | 88.7% MMLU |
| 6 | Llama 3.1 405B | Meta | 88.6% MMLU |
| 7 | DeepSeek V3 | DeepSeek | 88.5% MMLU |
| 8 | DeepSeek V3.2 | — | 88.5% MMLU |
| 9 | Claude Sonnet 4.6 | Anthropic | 88.3% MMLU |
| 10 | Gemini 2.0 Flash | Google DeepMind | 87.9% MMLU |
| 11 | Grok 2 | xAI | 87.5% MMLU |
| 12 | Claude 3 Opus | Anthropic | 86.8% MMLU |
| 13 | GPT-4 Turbo | OpenAI | 86.4% MMLU |
| 14 | GPT-4 | OpenAI | 86.4% MMLU |
| 15 | Qwen 2.5 72B | Alibaba Cloud (Qwen) | 86.1% MMLU |
| 16 | Llama 3.3 70B | Meta | 86.0% MMLU |
| 17 | Gemini 1.5 Pro | Google DeepMind | 85.9% MMLU |
| 18 | Kimi K2.5 | — | 85.1% MMLU |
| 19 | Mistral Large 2 | Mistral AI | 84.0% MMLU |
| 20 | GPT-4o mini | OpenAI | 82.0% MMLU |
| 21 | Llama 3 70B | Meta | 82.0% MMLU |
| 22 | ERNIE 4.0 | — | 81.5% MMLU |
| 23 | Gemini 1.0 Pro | Google DeepMind | 79.1% MMLU |
| 24 | Claude 3 Sonnet | Anthropic | 79.0% MMLU |
| 25 | Gemini 1.5 Flash | Google DeepMind | 78.9% MMLU |
| 26 | Command R+ | Cohere | 75.7% MMLU |
| 27 | Claude 3 Haiku | Anthropic | 75.2% MMLU |
| 28 | Claude 3.5 Haiku | Anthropic | 75.2% MMLU |
| 29 | Mixtral 8x7B | Mistral AI | 70.6% MMLU |
| 30 | GPT-3.5 Turbo | OpenAI | 70.0% MMLU |
| 31 | Llama 2 70B | Meta | 68.9% MMLU |
| 32 | Mistral 7B | Mistral AI | 62.5% MMLU |
| 33 | GPT-3 (davinci-002) | OpenAI | 43.9% MMLU |