DevQualityEval LLM benchmark
This page provides more information on the DevQualityEval benchmark for LLMs.
What is DevQualityEval?
DevQualityEval is a standardized evaluation benchmark and framework to compare and improve LLMs for software development. The benchmark helps assess the applicability of LLMs for real-world software engineering tasks. DevQualityEval combines a range of task types to challenge LLMs in various software development use cases. The benchmark provides metrics and comparisons to grade models and compare their performance.
tip
For up-to-date results, check out the latest DevQualityEval deep dive. Access the DevQualityEval leaderboard for detailed results.
Since all deep dives build upon each other, it is worth taking a look at previous dives:
- Anthropic's Claude 3.7 Sonnet is the new king 👑 of code generation (but only with help), and DeepSeek R1 disappoints (Deep dives from the DevQualityEval v1.0)
- OpenAI's o1-preview is the king 👑 of code generation but is super slow and expensive (Deep dives from the DevQualityEval v0.6)
- DeepSeek v2 Coder and Claude 3.5 Sonnet are more cost-effective at code generation than GPT-4o! (Deep dives from the DevQualityEval v0.5.0)
- Is Llama-3 better than GPT-4 for generating tests? And other deep dives of the DevQualityEval v0.4.0
- Can LLMs test a Go function that does nothing?