Learn

HumanEval benchmark

OpenAI code generation benchmark — pass@1 on 164 Python problems, 0-100%

What is humaneval benchmark?

benchmark" class="glossary-link">HumanEval tests code generation — 164 Python programming problems written by OpenAI researchers. A model passes a problem if its generated code passes all unit tests. If you are building a coding assistant, HumanEval is the most relevant single benchmark. Scores above 80% indicate strong code generation capability. sourc.dev tracks HumanEval as a verified attribute on every model that has been evaluated.

Why it matters

If you are building anything that generates code — a coding assistant, a test generator, a migration tool — benchmark" class="glossary-link">HumanEval scores give you the best available baseline for comparing models. Scores above 80% indicate the model can handle most standard programming tasks. Below 60%, expect frequent errors that need human correction. sourc.dev tracks HumanEval as a verified attribute and uses it in capability comparisons.

Where models stand

92.4 %

Claude 3.5 Sonnet

92 %

DeepSeek R1

91.6 %

GPT-4o

90.2 %

Llama 3.1 405B

89 %

Data available for 15 of 271 tracked entities.

How sourc.dev tracks this

sourc.dev verifies benchmark" class="glossary-link">humaneval benchmark manually from official provider documentation, API responses, and published specifications. Every data point includes a source URL and verification date. When a value changes, the old value is preserved in the history table and the new value is recorded alongside it. Nothing is overwritten — the full timeline is always available.

Frequently asked questions

FAQ What programming language does HumanEval use?

Python. All 164 problems are Python functions with docstrings, and the model must generate correct Python code that passes the provided unit tests. Some extended versions (HumanEval+, MultiPL-E) test other languages.

FAQ What does pass@1 mean?

pass@1 is the probability that a single generated solution passes all tests. It is the strictest measure — one attempt, pass or fail. Some benchmarks also report pass@10 or pass@100, which allow multiple attempts.

FAQ Is HumanEval representative of real coding tasks?

HumanEval covers algorithmic and utility functions. It does not test full-application development, debugging, refactoring, or working with large codebases. It is a useful signal for code generation ability but not a complete picture of coding capability.