#12 of 50

MMLU

Everyone cites this score — understanding what it actually measures helps you decide how much to trust it

What is MMLU

MMLU stands for Massive Multitask Language Understanding. It is a benchmark — a standardised test for language models — covering 57 subjects across science, humanities, social science, and professional fields. Law, medicine, mathematics, history, computer science, philosophy. Roughly 16,000 multiple-choice questions, each with four options.

A model's MMLU score is the percentage it gets right. Claude 3.5 Sonnet scores 88.7%. GPT-4o scores around 88.7%. Gemini 1.5 Pro sits in a similar range. Verified March 2026.

Where it came from

Before MMLU, benchmarks tested narrow capabilities. Could the model answer factual questions? Could it summarise text? Each test measured one thing. MMLU, published by researchers at UC Berkeley in 2020, was the first attempt to test breadth — how well a model performs across a wide range of academic and professional domains simultaneously.

It became the standard citation because it was comprehensive and public. Everyone used it. Everyone reported it. It became the number.

What it tells you — and what it does not

MMLU scores tell you how well a model performs on multiple-choice academic questions across a broad range of subjects. That is genuinely useful information. A model with a high MMLU score has strong general knowledge and reasoning ability.

What it does not tell you: how the model performs on your specific task. How it handles ambiguous instructions. How it responds when the answer is genuinely uncertain. How consistent it is across repeated calls. How it performs in languages other than English.

A model that scores 88% on MMLU and one that scores 85% are not necessarily meaningfully different for your use case. The gap that matters is between 60% and 88% — the difference between a model with broad knowledge and one without.

Use MMLU as a baseline check, not a purchasing decision.

Verified March 2026 · Source: Hendrycks et al., 2020 — "Measuring Massive Multitask Language Understanding"

Related terms

HumanEval What does 70B mean

← All terms

← EU data residency HumanEval →