Skip to main content

DevQualityEval benchmark

DevQualityEval is a standardized LLM evaluation benchmark and framework to compare and improve code generation with LLMs. The benchmark helps assess LLMs' applicability for real-world software development tasks.

The DevQualityEval benchmark:

  • Currently contains Java and Go tasks and corresponding test cases
  • Covers three distinct task groups: "write test" (the plain repository contains 1 example, and the light repository contains 23 for Java and Go each), "code repair" (5 examples each), and "transpile" (5 examples)
  • Provides metrics and comparisons for evaluating LLMs for the specified development tasks

DevQualityEval grades models in a point-based system that rewards:

  • Providing compiling code
  • Reaching sufficient coverage with the generated test code.
  • The benchmark also measures the processing time it takes to generate responses and counts the number of characters in model responses to highlight models that provide short and efficient responses.

DevQualityEval supports any inference endpoint implementing the OpenAI inference API and works on Linux, Mac, and Windows.

Comparing the capabilities and costs of top models with DevQualityEval

Read the DevQualityEval documentation and see the results and insights from our latest deep dive.