DevQualityEval benchmark
DevQualityEval is a standardized LLM evaluation benchmark and framework to compare and improve code generation with LLMs. The benchmark helps assess LLMs' applicability for real-world software development tasks.
The DevQualityEval benchmark:
- Currently contains Java and Go tasks and corresponding test cases
- Covers three distinct task groups: "write test" (the
plain
repository contains 1 example, and thelight
repository contains 23 for Java and Go each), "code repair" (5 examples each), and "transpile" (5 examples) - Provides metrics and comparisons for evaluating LLMs for the specified development tasks
DevQualityEval grades models in a point-based system that rewards:
- Providing compiling code
- Reaching sufficient coverage with the generated test code.
- The benchmark also measures the processing time it takes to generate responses and counts the number of characters in model responses to highlight models that provide short and efficient responses.
DevQualityEval supports any inference endpoint implementing the OpenAI inference API and works on Linux, Mac, and Windows.
Read the DevQualityEval documentation and see the results and insights from our latest deep dive.