Best Value
Models ranked by benchmark performance per dollar of input cost. Higher scores mean more capability for less money.
Methodology: Computed as benchmark_mmlu / input_price_per_1m. Only models with both values included. Higher is better.
| # | Model | Provider | Metric |
|---|---|---|---|
| 1 | Gemini 1.5 Flash | Google DeepMind | 1052.0 score/$1 |
| 2 | Gemini 2.0 Flash | Google DeepMind | 879.0 score/$1 |
| 3 | Llama 3.3 70B | Meta | 860.0 score/$1 |
| 4 | Qwen 2.5 72B | Alibaba Cloud (Qwen) | 717.5 score/$1 |
| 5 | GPT-4o mini | OpenAI | 546.7 score/$1 |
| 6 | DeepSeek V3.2 | — | 340.4 score/$1 |
| 7 | DeepSeek V3 | DeepSeek | 327.8 score/$1 |
| 8 | Claude 3 Haiku | Anthropic | 300.8 score/$1 |
| 9 | Mistral 7B | Mistral AI | 250.0 score/$1 |
| 10 | Kimi K2.5 | — | 212.7 score/$1 |
| 11 | Llama 3 70B | Meta | 160.8 score/$1 |
| 12 | Gemini 1.0 Pro | Google DeepMind | 158.2 score/$1 |
| 13 | Mixtral 8x7B | Mistral AI | 130.7 score/$1 |
| 14 | DeepSeek R1 | DeepSeek | 129.7 score/$1 |
| 15 | ERNIE 4.0 | — | 97.0 score/$1 |
| 16 | Claude 3.5 Haiku | Anthropic | 94.0 score/$1 |
| 17 | Llama 2 70B | Meta | 76.6 score/$1 |
| 18 | Gemini 1.5 Pro | Google DeepMind | 68.7 score/$1 |
| 19 | GPT-3.5 Turbo | OpenAI | 46.7 score/$1 |
| 20 | Grok 2 | xAI | 43.8 score/$1 |
| 21 | Mistral Large 2 | Mistral AI | 42.0 score/$1 |
| 22 | GPT-4o | OpenAI | 35.5 score/$1 |
| 23 | Command R+ | Cohere | 30.3 score/$1 |
| 24 | Claude 3.5 Sonnet | Anthropic | 29.6 score/$1 |
| 25 | Claude Sonnet 4.6 | Anthropic | 29.4 score/$1 |
| 26 | Claude 3 Sonnet | Anthropic | 26.3 score/$1 |
| 27 | Llama 3.1 405B | Meta | 17.7 score/$1 |
| 28 | GPT-4 Turbo | OpenAI | 8.6 score/$1 |
| 29 | o1 | OpenAI | 6.2 score/$1 |
| 30 | Claude 3 Opus | Anthropic | 5.8 score/$1 |
| 31 | GPT-4 | OpenAI | 2.9 score/$1 |
| 32 | GPT-3 (davinci-002) | OpenAI | 0.7 score/$1 |