Soundex is a phonetic algorithm, assigning values to words or names so that they can be compared for similarity of pronounciation. For this post I will write an implementation in JavaScript.
It doesn't take much thought to realise that the whole area of phonetic algorithms is a minefield, and Soundex itself is rather restricted in its usefulness. In fact, after writing this implementation I came to the conclusion that it is rather mediocre but at least coding it up does give a better understanding of how it works and therefore its usefulness and limitations.
Wikipedia has a surprisingly brief article on the topic Soundex on Wikipedia which you might like to read.
The Algorithm
The purpose of the algorithm is to create for a given word a four-character string. The first character is the first character of the input string. The subsequent three characters are any of the numbers 1 to 6, padded to the right with zeros if necessary. The idea is that words that sound the same but are spelled differently will have the same Soundex encoding.
The steps involved are:
- Copy the first character of the input string to the first character of the output string
- For subsequent characters in the input string, add digits to the output string according to the table below, up to a maximum of three digits (ie. a total output string length of 4). Note that a number of input letters are ignored, including all vowels. Also, further occurences of an input letter with the same encoding are ignored.
- If we reach the end of the input string before the output string reaches 4 characters, pad it to the right with zeros.
Letter Encodings
This table lists the digits assigned to the letters A-Z. I have assigned 0 to letters which are ignored, and note that uppercase and lowercase letters are treated the same.
Input letter | Encoding |
---|---|
A | 0 |
B | 1 |
C | 2 |
D | 3 |
E | 0 |
F | 1 |
G | 2 |
H | 0 |
I | 0 |
J | 2 |
K | 2 |
L | 4 |
M | 5 |
N | 5 |
O | 0 |
P | 1 |
Q | 2 |
R | 6 |
S | 2 |
T | 3 |
U | 0 |
V | 1 |
W | 0 |
X | 2 |
Y | 0 |
Z | 2 |
The Code
The project consists of an HTML page, a small JavaScript file containing a function to output text to the page, a graphic, a CSS file and the following JavaScript files.
- soundex.js
- soundpage.js
The files can be downloaded as a zip file from the Downloads page, or you can clone or download the Github repo.
Source Code Links
Firstly let's look at the soundex.js file.
soundex.js
function soundex(name) { let s = []; let si = 1; let c; // ABCDEFGHIJKLMNOPQRSTUVWXYZ let mappings = "01230120022455012623010202"; s[0] = name[0].toUpperCase(); for(let i = 1, l = name.length; i < l; i++) { c = (name[i].toUpperCase()).charCodeAt(0) - 65; if(c >= 0 && c <= 25) { if(mappings[c] != '0') { if(mappings[c] != s[si-1]) { s[si] = mappings[c]; si++; } if(si > 3) { break; } } } } if(si <= 3) { while(si <= 3) { s[si] = '0'; si++; } } return s.join(""); }
The first line creates an array which will hold the individual characters of the encoding, and the variable si is the current index of the array. Variable c is current letter in the input string, modified as we will see in a moment.
Next we create a mappings string. This represents the output values for each letter of the alphabet as per the above table. We then set the first letter of the output string to the first letter of the input, converted to upper case.
Next we enter a for loop through the input string; note that the loop starts at 1 as we have already dealt with the first character. Within the loop we assign c to the current input letter, again converted to upper case. We then subtract 65 so the numeric value corresponds to the indexes of the mappings array.
Next we check the value is within the range 0 to 25, ie. an uppercase letter. If not it is ignored, but if so we check if its corresponding numeric value is not 0. We then check the value is not the same as the previous to implement the rule that consecutive identical values are skipped, and then set the next value of the output string to the correct number. The si index is then incremented, before we check if it is more than 3; if so we break out of the loop.
Finally, we need to check if we have not yet filled up the encoded string s, which can happen if there are not enough encodable letters in the input string. If this is the case we simply pad out the string with 0s in a while loop.
Finally we return the contents of the array converted to a string with the join function.
No let's move on to soundexpage.js where we call the above function.
soundexpage.js
window.onload = function() { writeToConsole("Soundex Algorithm<br/><br/>", "console"); let names1 = ["Johnson", "Adams", "Davis", "Simons", "Richards", "Taylor", "Carter", "Stevenson", "Taylor", "Smith", "McDonald", "Harris", "Sim", "Williams", "Baker", "Wells", "Fraser", "Jones", "Wilks", "Hunt", "Sanders", "Parsons", "Robson", "Harker"]; let names2 = ["Jonson", "Addams", "Davies", "Simmons", "Richardson", "Tailor", "Chater", "Stephenson", "Naylor", "Smythe", "MacDonald", "Harrys", "Sym", "Wilson", "Barker", "Wills", "Frazer", "Johns", "Wilkinson", "Hunter", "Saunders", "Pearson", "Robertson", "Parker"]; let namecount = names1.length; let s1; let s2; for(let i = 0; i < namecount; i++) { s1 = soundex(names1[i]); s2 = soundex(names2[i]); writeToConsole(`${names1[i].padEnd(16, " ").replace(/ /g, " ")} ${s1} ${names2[i].padEnd(16, " ").replace(/ /g, " ")} ${s2}<br/>`, "console"); } };
In the onload function we first create a couple of arrays of strings, each pair of names being similar to some degree. To avoid hard-coding the array size the next line picks it up using length.
We then create a couple of variables for the encoded values and then loop through the name pairs, calling the soundex function for each, and finally printing out the names and their Soundex encodings.
Open soundexalgorithm.htm in your browser, which will show the following output.
As you can see, the algorithm is not perfect. Even with this small selection of names a few problems are apparent. Ignoring repeating values means Simons and Simmons are given the same encoding, and using only the first few letters means Richards and Richardson are also encoded the same. Ignoring vowels means that Wells and Wills, Sanders and Saunders, Parsons and Pearson are all given the same encoding despite not actually being homophones.