In this post I'll write a JavaScript implementation of the Levenshtein Word Distance algorithm which measures the "cost" of transforming one word into another by totalling the number of letters which need to be inserted, deleted or substituted.
The Levenshtein Word Distance has a fairly obvious use in helping spell checkers decided which words to suggest as alternatives to mis-spelled words: if the distance is low between a mis-spelled word and an actual word then it is likely that word is what the user intended to type. However, it can be used in any situation where strings of characters need to be compared, such as DNA matching.
The Levenshtein Word Distance
I usually leave the program's output until near the end of each post but this time I'll put it up front as it is the easiest way of showing how the algorithm works. I'll actually show two program outputs, the first being after the required data structure has been initialized but before the algorithm has been run, and the second after the algorithm has been run.
Here we have a hypothetical situation where somebody has typed "banama" and we want to see how close it is to "banana". The source word "banama" is shown down the left and the target word "banana" is along the top. We also have a grid of integers with one more row than the number of letters in the source, and one more column than the number of letters in the target.
Most values are initially set to 0, but the first row of numbers represent the cumulative number of letters which need to be inserted to form the target, and the first column shows the cumulative number of deletions to remove the source. In other words, to transform "banama" into "banana" you could delete six letters and insert six more. Obviously that's not the most efficient way though as only one operation needs to be done, the substitution of the "m" with an "n". Let's now look at the program output after the algorithm has been run.
All the other numbers have been filled in, and the number in the bottom right is the total cost of transforming "banama" to "banana" using the minimum number of operations, in this case 1.
The values for each of the 36 numbers initialized to 0 are calculated as follows:
- Find the value diagonally top-left and if the corresponding letters are different add 1 (substitution cost)
- Find the value above and add 1 (deletion cost)
- Find the value to the left and add 1 (insertion cost)
- Set the current value to the lowest of the above three values
The overall plan of the implemention of Levenshtein's Word Distance should now be clear - given two words we just need to create a suitably sized 2D array, initialize the numbers and then iterate the rows and columns, setting the elements to their final values.
The Code
The project consists of a set of files including an HTML file for displaying output in a browser, CSS and graphics files, and two JavaScript files, levenshtein.js and levenshteinpage.js. The files can be downloaded as a zip from the Downloads page, or cloned/downloaded from Github.
Source Code Links
This is the levenshtein.js file.
levenshtein.js
function levenshteinCreate(source_word, target_word) { let lev = {}; lev.grid = []; lev.source_length = source_word.length; lev.target_length = target_word.length; lev.minimum_cost = 0; lev.source_word = source_word; lev.target_word = target_word; for(let r = 0; r <= lev.source_length; r++) { lev.grid[r] = []; for(let c = 0; c <= lev.target_length; c++) { if(r == 0) { lev.grid[r][c] = c; } else if(c == 0) { lev.grid[r][c] = r; } else { lev.grid[r][c] = 0; } } } return lev; } function levenshteinCalculate(lev) { const INSERT_COST = 1; const DELETE_COST = 1; const SUBSTITUTE_COST = 1; let total_substitution_cost; let total_deletion_cost; let total_insertion_cost; for(let sourceletter = 0; sourceletter < lev.source_length; sourceletter++) { for(let targetletter = 0; targetletter < lev.target_length; targetletter++) { if(lev.target_word[targetletter] != lev.source_word[sourceletter]) { total_substitution_cost = lev.grid[sourceletter][targetletter] + SUBSTITUTE_COST; } else { total_substitution_cost = lev.grid[sourceletter][targetletter]; } total_deletion_cost = lev.grid[sourceletter][targetletter+1] + DELETE_COST; total_insertion_cost = lev.grid[sourceletter+1][targetletter] + INSERT_COST; lev.grid[sourceletter+1][targetletter+1] = min(total_substitution_cost, total_deletion_cost, total_insertion_cost); } } } function min(a, b, c) { let m; (a <= b && a <= c) ? (m = a) : (b <= a && b <= c) ? (m = b) : (m = c); return m; } function levenshteinPrintGrid(lev) { clearConsole("console"); for(let c = 0; c <= lev.target_length; c++) { if(c > 0) { writeToConsole(`<span class="letter">${lev.target_word[c-1].padEnd(5, " ").replace(/ /g, " ")}</span>`, "console"); } else { writeToConsole(" ".repeat(10), "console"); } } writeToConsole("<br/><br/>", "console"); for(let r = 0; r <= lev.source_length; r++) { if(r > 0) { writeToConsole(`<span class="letter">${lev.source_word[r-1]}</span>${" ".repeat(4)}`, "console"); } else { writeToConsole(" ".repeat(5), "console"); } for(let c = 0; c <= lev.target_length; c++) { writeToConsole(`${String(lev.grid[r][c]).padEnd(5, " ").replace(/ /g, " ")}`, "console"); } writeToConsole("<br/><br/>", "console"); } } function levenshteinPrintCost(lev) { writeToConsole(`<span class="letter">Minimum cost of transforming "${lev.source_word}" to "${lev.target_word}" = ${lev.grid[lev.source_length][lev.target_length]}</span>`, "console"); }
levenshteinCreate
The first function starts by creating a new object and adding a few properties to it. The first of these is an empty array which we'll use to hold the table of values. We also need the two words and their lengths (which are needed several times so this makes accessing them easier) and also the cost which is initialized to 0.
We then initialize the table using nested for loops. Note that these iterate from 0 to the lengths of the words, as we need an an extra row and an extra column at the beginning. Most of the array elements are set to 0 except the first row and first column which are set to 0, 1, 2 etc, ie. their indexes. After the loop we return the object, which now contains a table looking like the one in the first output above.
levenshteinCalculate
We firstly declare the three costs as consts, and then calculate the values by iterating the rows (letters of the source) and the columns (letters of the target). The three costs are calculated as described above. Note we only add in the substitution cost if the corresponding letters are different.
Each value is set to the lowest of the three costs using the min function which returns the lowest of three integers. Note that as we have an extra row and column at the beginning we need to add 1 to the word indexes to set the grid values.
levensgteinPrintGrid
This function takes an object as created and populated by the previous functions. It firstly prints out the target and then iterates the rows, printing out each source letter and then the values, all nicely spaced out for clarity.
levenshteinPrintCost
Here we simply print out the bottom-right value with a suitable bit of text.
Using the Code
That's the main coding done so we just need to demonstrate it in a web page.
levenshteinpage.js
window.onload = function() { document.getElementById('go').onclick = ()=>go(); }; function go() { const firstword = document.getElementById('firstword').value; const secondword = document.getElementById('secondword').value; const lev = levenshteinCreate(firstword, secondword); levenshteinCalculate(lev); levenshteinPrintGrid(lev); levenshteinPrintCost(lev); }
window.onload
The HTML page has two text inputs and a button to allow the user to enter the two strings to run the algorithm on. Here we just need to set the button's onclick event handler.
go
Firstly we pick up the two values entered by the user. For a production application we would validate the input but as this is just a demo I have not done so.
Then we set a variable to the return values of levenshteinCreate. We then call levenshteinCalculate and the two print functions.
Seeing the Result
The code is now finished so open levenshteinworddistance.htm in your browser, uncommenting one of the three lines in onload at a time.
I have already included the banama/banana output above so won't repeat it here. The word distance there was 1, so banana easily qualifies as a sensible suggestion if somebody mis-types banama. However, if you typed banama you wouldn't expect your word processor to suggest elephant, so let's try those two words.
The word distance here is 7 as banama needs to be almost completely re-typed to get to elephant. Clearly not a valid alternative.
Finally levinstein/levenshtein. One substitution and one insertion give us a word distance of 2, also a sensible suggestion. You would probably consider word distances of 2 or perhaps 3 to be reasonable for alternative suggestions, but no more.