MMLU benchmark
Massive Multitask Language Understanding — 57-subject knowledge test, 0-100%
What is mmlu benchmark?
MMLU — Massive Multitask Language Understanding — tests reasoning across 57 academic subjects including maths, history, law, medicine, and computer science. The average human expert scores approximately 89.8%. On sourc.dev, MMLU scores range from 43.9% (GPT-3 Davinci) to 90%+ (Claude 3.5 Sonnet, GPT-4o). A 10-point MMLU difference between two models is meaningful. A 2-point difference is noise. sourc.dev uses MMLU as an input to the Value Density Score — benchmark points per dollar.
Why it matters
MMLU is the most commonly cited benchmark, which makes it useful for comparison even though no single benchmark tells the complete story. When a new model claims to be 'state of the art,' check its MMLU score against the field on sourc.dev. A model at 86% MMLU is meaningfully more capable than one at 70% for general reasoning tasks. But a 1-point difference between 88% and 89% is within measurement noise. sourc.dev uses MMLU as an input to the Value Density Score — benchmark points per dollar of input cost.
Where models stand
Data available for 33 of 271 tracked entities.
How sourc.dev tracks this
sourc.dev verifies mmlu benchmark manually from official provider documentation, API responses, and published specifications. Every data point includes a source URL and verification date. When a value changes, the old value is preserved in the history table and the new value is recorded alongside it. Nothing is overwritten — the full timeline is always available.
57 subjects spanning STEM, humanities, social sciences, and professional domains — including abstract algebra, anatomy, astronomy, business ethics, clinical knowledge, computer security, econometrics, jurisprudence, and virology, among others.
Human expert performance averages approximately 89.8%. Leading models now score above 85%, with some exceeding 90%. A score above 70% indicates strong general knowledge. Below 50% is roughly random guessing on four-choice questions.
MMLU remains the most widely cited benchmark for general reasoning. However, as top models approach and exceed human expert scores, its discriminative power is decreasing. Newer benchmarks like GPQA and MATH target harder problems.