From the webinar: Why “Vanilla” Search Can’t Handle the Nuances of Name Screening
Babel Street’s Carlos Azeglio (Business Development Lead), who is a financial industry veteran, and Patrick Deeb (Senior Solutions Engineer) held a short conversational webinar about why what we think of as “search” performs poorly when searching for names, and what name search needs to be accurate and successful. A summary of their conversation and the Q&A that followed is below.
The fundamental assumptions of text search don’t work for name search
Full-text search is based on a fundamental assumption that doesn’t hold for names: that the words that appear the most frequently are the least important (aka, term frequency/inverse document frequency) and those that appear the most frequently are the least important, but for names, “William” appearing 1,000 times is just as important as “Wilbur” only appearing once.
It’s impossible to fix “typos” in names
A common name such as Cindy can also be spelled Cindy, Cyndi, Cindi, Cyndy, Syndy, Syndi, Sindi and Sindy. If you assume that a rare spelling such as “Lourie” is a typo for “Lori” and “correct it,” that might result in incorrect matches if the name is in fact spelled with a “u.” A typical fuzzy search won’t match “Lori” to “Lourie.” For names from Arabic, Chinese or Korean, there isn’t even agreement on where the spaces are between name components when written in English.
On the other hand, real typos have an outsized impact on the quality of name search results, because whereas one misspelled word in a text document impacts only 1/500th of a 500-word document, a misspelled name affects 50% or 100% of what you are looking for when you are performing name searches.
Standard fuzzy text matching algorithms aren’t enough for names
These algorithms are sometimes called “fuzzy logic.” The fuzziness of the search is based on looking for letters in words that have been inserted, deleted, or substituted compared to the word being searched on. Thus searching for “John” could fuzzy match against “Johnn” (insertion), “Jon” (deletion) or “Jahn” (substituted). But so many phenomena specific to names are not covered by these three operations. There is no way to turn a name through the insertion/deletion/substitution of 1-2 characters and be able to match nicknames, initials, shortened names, missing components, gender differences and more. It’s possible to set up rules for these but it gets really ugly fast.
What strategies are seeing the most success?
Rules-based systems are very complex, top-heavy and very difficult to maintain. In a past position in financial compliance, Carlos recalls there being a rules-based name matching system and only one person on staff that knew how to untangle all the rules. Twenty to 30 years ago, the go-to was creating huge static lists of name variations for every name on a watch list. There might be hundreds of variations for one name and when you multiply that times the 1 million names on the list, performance becomes challenging for doing real time screening. And you still might not make the match.
Today, AI is heavily used to handle the complexity that name matching demands.
How does AI actually help in this area?
The new method is a hybrid intelligence that includes AI, which can dynamically and simultaneously consider all the key ways that names vary. It aligns data in a way that is more effective than rules, which are challenging to maintain. Not only can you replace the need for rules and static lists but you also benefit from trained models per language. The dynamic quality of AI enables identification of name variations in real time instead of iterating over static lists – which also need to be kept up to date. One example of this new hybrid intelligence/AI approach to name matching is Rosette Name Indexer.
Rosette plugs into your search engine or system just to do the name search portion and it understands names like a person does while minimizing false positives and false negatives, and it’s fast and lightweight to minimize the load on your existing systems. (Watch a 3-minute video on how Rosette works.)
Can you just get a group of clever engineers to build name search based on open source?
Sure they could, but the AI portion makes it really complex to come up with the fuzzy logic for names in 6-12 months. However, the biggest factor is time. Built into Rosette is 25+ years of R&D investment incorporating client feedback based on real-world scenarios. That can’t be replaced by several engineers in a room. Rosette handles all kinds of name complexity that most engines ignore, plus supports multiple languages. It is transparent which helps with compliance and regulation as far as explaining why certain names matched. The speed and accuracy of Rosette and its quick deployment brings an immediate return-on-investment.
Q: Doesn’t AI tend to be a “black box”? How do you know what is happening “under the hood”?
A: That is true, but Rosette provides transparency and explainability. We can get extremely granular about a match and why we’ve assigned a score. For a typical user the general ideas could be exposed. But for data scientists and folks who really need to know, we fully explain how a specific score was calculated.
Q: Is Rosette available for Elastic in the Cloud?
A: Yes, with caveats. It is available right now. Currently, if you are on their platinum support plan you can use our plug-in. You need platinum because you need their support to get the Rosette plug-in installed. I believe they are working to make it easier and more cost effective, but yes, you definitely can use Rosette in Elastic in the Cloud now. (Update: The Rosette integration for Amazon OpenSearch is scheduled for 2024.)
Q: What type of AI is it?
A: It is trained models per language under the hood. It’s a trifecta of different things. There are also lists and sets of rules but also models trained per language to allow the algorithm to make decisions in real time as to if names match or not and that is the AI.
Q: So if I have names in Chinese or other script characters but the name in the database was in Latin characters, how would Rosette handle that?
It would handle it beautifully. The cross-lingual capabilities of Rosette let us stand out from the competition. It would make that match. Rosette currently supports more than 20 languages and 10 scripts.
Q: How exactly can name matching be combined with other kinds of matching like addresses, fuzzy addresses?
A: Name matching is just one component of Rosette. It also does address, date, and location matching. The more attributes you have in your search the better match you will get. Address and date matching also have a fuzzy AI element to get relevant results even if the date or addresses are not exact. There are many cases where all three (name, address, date,) are being used in the same use case for yet more relevant match results.
Q: Can you explain technology details? What if we don’t have test samples?
A: If I interpret the question as saying a person doesn’t have test samples but wants to see what the matching looks like, we have several public data sets we like to suggest for test scenarios: the OFAC list and voter registration data from different states. We suggest taking a public data set and ingesting it into Elasticsearch and running our plugin against that. Try to understand as much about the data set as you can, so you know what you are searching against and then try different name variations to see what that looks like and then tweak the parameters to see how that affects the score.
Q: Is Rosette a name matching solution or a name variant generation solution?
A: Name variant generation is an approach from the past 20-30 years ago which has become outdated. Rosette is built using a more modern approach that has proven to be more effective.
Interested in a demo of Rosette? Contact us.