The Drawbacks of Levenshtein Distance Algorithms for Matching

Interest in edit distance algorithms has greatly increased with the popularity of word ladder games. These games provide a starting and ending word and ask the user to connect the two words by changing one letter at a time. The goal is to get from the starting word to the ending word with the fewest possible changes. Many questions have been asked as to whether these algorithms can be used for solving business problems, such as name matching.

Example of word ladder game

Edit distance algorithms have been used for name matching for over 70 years. The most popular choice for this purpose is Levenshtein distance, named for the Soviet mathematician Vladimir Levenshtein who created the algorithm in 1965. The Levenshtein distance algorithm counts the minimum number of single-character edits (insertions, deletions, or substitutions) needed to convert one term to another. In the case of our game above, the Levenshtein distance would be 4. (It should be noted that word ladder games are more restrictive than Levenshtein, as they require each change to result in a real word and only allow substitutions, not additions or deletions).

Other such edit distance algorithms exist as well:

At a very basic level, these algorithms can provide a loose approximation of term similarity. Two identical strings would have a distance of zero. Two completely different strings would have a distance of the length of the longer string. So, the Levenshtein distance divided by the maximum string length yields a ratio of how different the strings are.

If Levenshtein distance algorithms provide a good way to identify string similarity, why shouldn’t they be used for name matching?

Name matching is complex

As with other simple “solutions” to name matching, such as SoundEx, edit distance algorithms ignore much of the complexities of names:

The names John and Joan have a Levenshtein distance of 1, yet are separate names and different genders.
Joe T Garcia and Joseph Thomas Garcia have a Levenshtein distance of 8 yet could very well be the same person.
Levenshtein cannot handle any name across languages or scripts, even for names that are the same, such as Sun Yat-sen and 孫中山.
Nicknames are a key part of name matching, from the simple (Bill and Billy) to the complex (Alexandra and Sandy). While simple nicknames can be matched using multiple methods, including Levenshtein, complex nicknames require more advanced techniques.
Out of order names, often caused by data import errors, will return the maximum Levenshtein distance, meaning a complete mismatch.

Taken together, these deficiencies make any edit distance algorithm insufficient for proper name matching. As described above, these algorithms can fail in many cases in both identifying true positives and true negatives.

Complete name matching

Advanced, cross-language, cross-script name matching requires a complex combination of specifically trained AI models, names databases, language and culture-specific rules engines, and other technologies. These systems work together to complete the advanced work of understanding and matching names.

Babel Street Match (formerly Rosette Name Indexer) uses a combination approach to name matching that is:

AI-powered, dynamically and simultaneously considering all the ways that names can vary, not just the ways found in a list
Hybrid, applying the strength of one approach to overcome the weakness of another
Two-pass, using the common key method to quickly eliminate obvious non-matches and statistical similarity to intelligently score each remaining pair of names

With its understanding of the many ways names vary and configurability, Match quickly returns explainably-scored results that align with an organization’s data and risk profile.