HumanEval
Before you trust a model with your code, know what this test actually contains
What is HumanEval
HumanEval is a coding benchmark. 164 programming problems, each with a function signature, a docstring describing what the function should do, and a set of test cases it must pass.
The model writes the code. The test cases run. Pass or fail.
Claude 3.5 Sonnet scores 92.0% on HumanEval. That means it writes code that passes the test suite on 151 of 164 problems. GPT-4o scores similarly. Verified March 2026.
What the test actually looks like
A typical HumanEval problem looks like this:
``` def has_close_elements(numbers: List[float], threshold: float) -> bool: """ Check if in given list of numbers, are any two numbers closer to each other than given threshold. >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True """ ```
The model completes the function. The test cases verify it. Simple problems. Common patterns. Standard Python.
Why this matters to you
A 92% HumanEval score tells you the model handles standard algorithmic problems reliably. That is useful context for choosing a model for coding assistance.
What it does not tell you: how the model performs on your codebase, your language, your framework, your edge cases. HumanEval problems are self-contained. Real codebases are not. Real bugs involve context spread across ten files, tribal knowledge about architectural decisions, and requirements that changed six months ago.
HumanEval is a floor, not a ceiling. A model that struggles on HumanEval will struggle with your code. A model that excels at HumanEval may still struggle with your code. The score tells you the model is capable. Your own testing tells you whether it is capable for your task.
Verified March 2026 · Source: Chen et al., 2021 — "Evaluating Large Language Models Trained on Code"