Highest Benchmark

Models ranked by MMLU (Massive Multitask Language Understanding) score. MMLU tests reasoning across 57 academic subjects — scores above 90% exceed average human expert performance.

Methodology: Sorted by benchmark_mmlu descending. Human expert average is approximately 89.8%. Not all models have published MMLU scores.

#	Model	Provider	Metric
1	o1	OpenAI	92.3% MMLU
2	DeepSeek R1	DeepSeek	90.8% MMLU
3	Qwen3 235B A22B	—	88.9% MMLU
4	Claude 3.5 Sonnet	Anthropic	88.7% MMLU
5	GPT-4o	OpenAI	88.7% MMLU
6	Llama 3.1 405B	Meta	88.6% MMLU
7	DeepSeek V3	DeepSeek	88.5% MMLU
8	DeepSeek V3.2	—	88.5% MMLU
9	Claude Sonnet 4.6	Anthropic	88.3% MMLU
10	Gemini 2.0 Flash	Google DeepMind	87.9% MMLU
11	Grok 2	xAI	87.5% MMLU
12	Claude 3 Opus	Anthropic	86.8% MMLU
13	GPT-4 Turbo	OpenAI	86.4% MMLU
14	GPT-4	OpenAI	86.4% MMLU
15	Qwen 2.5 72B	Alibaba Cloud (Qwen)	86.1% MMLU
16	Llama 3.3 70B	Meta	86.0% MMLU
17	Gemini 1.5 Pro	Google DeepMind	85.9% MMLU
18	Kimi K2.5	—	85.1% MMLU
19	Mistral Large 2	Mistral AI	84.0% MMLU
20	GPT-4o mini	OpenAI	82.0% MMLU
21	Llama 3 70B	Meta	82.0% MMLU
22	ERNIE 4.0	—	81.5% MMLU
23	Gemini 1.0 Pro	Google DeepMind	79.1% MMLU
24	Claude 3 Sonnet	Anthropic	79.0% MMLU
25	Gemini 1.5 Flash	Google DeepMind	78.9% MMLU
26	Command R+	Cohere	75.7% MMLU
27	Claude 3 Haiku	Anthropic	75.2% MMLU
28	Claude 3.5 Haiku	Anthropic	75.2% MMLU
29	Mixtral 8x7B	Mistral AI	70.6% MMLU
30	GPT-3.5 Turbo	OpenAI	70.0% MMLU
31	Llama 2 70B	Meta	68.9% MMLU
32	Mistral 7B	Mistral AI	62.5% MMLU
33	GPT-3 (davinci-002)	OpenAI	43.9% MMLU

← All leaderboards