sourc.dev
Home LLMs Tools SaaS APIs
Claude 3.5 Sonnet input $3.00/1M ↓ -50%
GPT-4o input $2.50/1M
Gemini 1.5 Pro input $1.25/1M
Mistral Large input $2.00/1M ↓ -33%
DeepSeek V3 input $0.27/1M
synced 2026-04-05
Claude 3.5 Sonnet input $3.00/1M ↓ -50%
GPT-4o input $2.50/1M
Gemini 1.5 Pro input $1.25/1M
Mistral Large input $2.00/1M ↓ -33%
DeepSeek V3 input $0.27/1M
synced 2026-04-05

Highest Benchmark

Models ranked by MMLU (Massive Multitask Language Understanding) score. MMLU tests reasoning across 57 academic subjects — scores above 90% exceed average human expert performance.

Methodology: Sorted by benchmark_mmlu descending. Human expert average is approximately 89.8%. Not all models have published MMLU scores.

# Model Metric
1 o1 92.3% MMLU
2 DeepSeek R1 90.8% MMLU
3 Qwen3 235B A22B 88.9% MMLU
4 Claude 3.5 Sonnet 88.7% MMLU
5 GPT-4o 88.7% MMLU
6 Llama 3.1 405B 88.6% MMLU
7 DeepSeek V3 88.5% MMLU
8 DeepSeek V3.2 88.5% MMLU
9 Claude Sonnet 4.6 88.3% MMLU
10 Gemini 2.0 Flash 87.9% MMLU
11 Grok 2 87.5% MMLU
12 Claude 3 Opus 86.8% MMLU
13 GPT-4 Turbo 86.4% MMLU
14 GPT-4 86.4% MMLU
15 Qwen 2.5 72B 86.1% MMLU
16 Llama 3.3 70B 86.0% MMLU
17 Gemini 1.5 Pro 85.9% MMLU
18 Kimi K2.5 85.1% MMLU
19 Mistral Large 2 84.0% MMLU
20 GPT-4o mini 82.0% MMLU
21 Llama 3 70B 82.0% MMLU
22 ERNIE 4.0 81.5% MMLU
23 Gemini 1.0 Pro 79.1% MMLU
24 Claude 3 Sonnet 79.0% MMLU
25 Gemini 1.5 Flash 78.9% MMLU
26 Command R+ 75.7% MMLU
27 Claude 3 Haiku 75.2% MMLU
28 Claude 3.5 Haiku 75.2% MMLU
29 Mixtral 8x7B 70.6% MMLU
30 GPT-3.5 Turbo 70.0% MMLU
31 Llama 2 70B 68.9% MMLU
32 Mistral 7B 62.5% MMLU
33 GPT-3 (davinci-002) 43.9% MMLU