Tutorial: Finding reproducers with Symflower
This example project helps you understand how you can use Symflower to efficiently find reproducers when debugging an application.
The project
Our example project is about exploring the possibility of growing artificial strawberries. The challenge is to computationally analyze variations in the strawberry's genome.
Specification: Write a program that measures the similarity between two linear sequences of base pairs in the strawberry's genome.
The genome consists of DNA molecules which are sequences of base pairs. There are four different bases:
- adenine (A)
- cytosine (C)
- guanine (G)
- thymine (T)
These DNA sequences represent the genetic code.
Solution: Implement the Levenshtein algorithm to measure the minimal edit distance between two sequences. It computes the minimal number of characters to insert, delete, or change to transform a given string into another. You can encode a DNA sequence as a string, with one character per base pair, like "ACGT" for a DNA sequence of four base pairs.
For example, to transform the string "ACG" into "ATTG", we need to change one character and insert one more.
The solution involves dividing the input problem into smaller sub-problems and combining their solutions. First, create an alignment matrix where every cell represents the minimal edit distance between two substrings of the input strings. This matrix shows how that would work for the "ACG -> ATTG" example input:
Observe that the cell values are the edit distances for corresponding substrings:
- Row A column A is 0, which is the edit distance between "A" and "A"
- Row A column C is 1, which is the edit distance between "AC" and "A"
- Row A column T₁ is 1, which is the edit distance between "A" and "AT"
Based on the three values above, we can compute the edit distance between "AC" and "AT".
There are three options that correspond to the values above.
- The distance between "A" and "A" (0) + the distance between "C" and "T" (1 because they mismatch) = 1
- The distance between "AC" and "A" (1) + the distance added by inserting "T" (1) = 2
- The distance between "A" and "AT" (1) + the distance added by deleting "C" (1) = 2
Since we are computing the minimal distance, we are using the minimum out of the three for the cell in row C column T₁.
The same process is applied to the remaining cells. Finally, the rightmost column in the last row is 2, which is the minimal edit distance between "ACG" and "ATTG".
This is your implementation:
class Levenshtein {
static int editDistance(String x, String y) {
if (x == null || y == null)
throw new IllegalArgumentException();
// Handle the easy cases: only insertions or deletions.
if (x.isEmpty()) return y.length();
if (y.isEmpty()) return x.length();
// Define the alignment matrix.
int[][] M = new int[y.length()][x.length()];
for (int i = 0; i < x.length(); i++) M[0][i] = i;
for (int j = 0; j < y.length(); j++) M[j][0] = j;
int j = 1;
int i = 1;
for (; j < y.length(); j++) {
for (i = 1; i < x.length(); i++) {
int match = M[j-1][i-1];
if (x.charAt(i-1) != y.charAt(j-1)) match++;
int left = M[j][i-1] + 1;
int above = M[j-1][i] + 1;
M[j][i] = Math.min(match, Math.min(left, above));
}
}
int distance = M[j-1][i-1];
assert (distance == 0) == x.equals(y);
return distance;
}
}
You add some tests to demonstrate that your implementation is correct for the three cases: insertion, deletion and modification of characters.
static void testEditDistance() {
assert editDistance("same", "same") == 0;
assert editDistance("insert", "inssert") == 1;
assert editDistance("delete", "deete") == 1;
assert editDistance("modify", "mödify") == 1;
assert editDistance("cake", "cookie") == 3;
}
However, your client for this project soon reports that the program crashes! Here's the stack trace:
Exception in thread "crack_strawberry_genome" java.lang.AssertionError
at Levenshtein.editDistance(Levenshtein.java:27)
Debugging the solution
We know that this assertion is violated:
assert (distance == 0) == x.equals(y);
There are two things that may have gone wrong: The computed distance is zero but the two input strings are different. The computed distance is greater than zero, but the two input strings are the same.
For smart and fast debugging, you can run Symflower's test suite generation feature. This will create a LevenshteinSymflowerTest.java
file with test cases to cover the entire function.
Here's the failing test that reveals the problem:
@Test
public void editDistance6() throws AssertionError {
String x = "A";
String y = "B";
// assertThrows(java.lang.AssertionError.class, () -> {
Plain.editDistance(x, y);
// });
}
The two input strings are different, but the nested loop that computes the alignment costs is never executed. This is a simple off-by-one error that can be corrected by adding 1 to the dimensions of the alignment matrix.
That's how Symflower's generated tests help you efficiently debug your applications in real time.