Semantic Search is Remedy for Keyword Inaccuracy in E-discovery

Semantic search is the aspirin to the headaches of keyword search in e-discovery. It is based on text embeddings, a natural language processing technology that has been around since the 1950s, but only became viable for business use in the last few years. Semantic search looks for meaning, not exact words, and enables teams to use English to search in languages they don’t know.

Why change the decades-long keyword search method? Keyword search catches too much irrelevant data, and too little of what’s really important. Under-generation of search results means important evidence is being missed, possibly exposing a client to litigation risk. Over-generation of search results obscures important documents with too many false positive hits and squanders review time by human attorneys, significantly increasing the financial cost of legal services.

The problems with keyword search

Keywords catch too little due to misspellings, alternative word forms, and the many ways of saying the same thing. A 1985 study by David C. Blair and M. E. Maron evaluated the effectiveness of keyword search in a case concerning a subway accident. The event was called “the unfortunate incident” by one side and a “disaster” by the other side. Documents called it an “event,” “incident,” “situation,” “problem,” or “difficulty.” Malfunctioning mechanisms were called “sick,” “dead,” or “fried,” and a critical issue was deemed the “smoking gun.”
Keywords return too much irrelevant information because they lack context. “Confidential” can be an important keyword that signals sensitive information, but because it appears frequently in email footers, it can’t be used in keyword search. The word “interest” could mean fascination, a repayment premium for a loan, or having property rights. However, searching for a phrase or sentence that provides context will rarely find a match in keyword search.
Not knowing the “good keywords.” At the start of a case, whether for early case assessment or discovery, all the important keywords can’t be known until the attorney begins to manually review documents.

What is semantic search?

Semantic search is an intelligent fuzzy keyword search that finds matches based on the meaning of words. It addresses issues of context and variety that plague standard keyword search, which is locked into finding a particular set of words spelled a particular way. In contrast, a semantic search for “alcoholic drink” could yield “cocktail,” but also “whisky on the rocks,” “margarita,” or “pina colada,” and so on.

The technology that enables semantic search is called word or text embeddings. It is trained on a body of text and assigns a vector value for each word, phrase/sentence, or entire documents based on the context within which it appears. It is also effective at finding nearly duplicate documents. Text embeddings mathematically calculate the “distance” in meaning between text. Once values have been calculated for words in one language, the same can be done in other languages. Then, the various languages can be aligned so words that share similar meanings in one language will be close in value to those in another language.

Consequently, semantic search is possible across languages, so that “unmanned aerial vehicle” can find “drone,” “UAV,” “Predator XP,” “無人航空機,” or “Unbemanntes Luftfahrzeug.”

Benefits of semantic search to e-discovery

Attorneys review data found from keyword search and set aside any documents that are not relevant. John H. Beisner estimated this human review is 75-90% of the cost of producing the documents. Furthermore, the earlier referenced Blair and Maron study examined the case’s discovery database of 40,000 documents and 350,000 pages. Although the legal team felt confident , they had found 75% of relevant documents through keyword search, Blair and Maron discovered it was only 20%.

How semantic search solves keyword search issues

More NLP to help e-discovery

Semantic search is just one of many technologies in the arsenal of natural language processing tools that can reduce risk and manual labor for e-discovery. Other technologies include:

Language identification — Reveals ahead of time what languages need to be handled in the discovery.
Entity extraction — Can be custom trained to find particular entities. In the Blair & Moran study, the legal team sought documents about “steel quantity.” Efforts were stymied by relevant documents that only mentioned the number of steel things, such as “girders,” “beams,” “ frames,” and “bracings.” Entity extraction could find documents that mentioned quantities and “steel things.”
Event extraction — Can be quickly custom trained to find events specific to a use case, and requires a small amount of training data to reach useful accuracy.

Semantic search reduces labor and costs for multilingual e-discovery

In today’s global economy, It’s not unusual for discovery to include documents in multiple languages. Traditionally, attorneys would turn to cross-lingual keyword search, which meant either machine translating the keywords into another language, or machine translating the documents being searched. Translating keywords suffers from the same weakness as regular keyword search: lack of context. Should the word “interest” be translated to Japanese as 利子 (as in interest on a loan), 興味 (attraction), or one of the four other definitions of “interest”?

Machine translation of the documents to be searched is the other approach, but errors mean potentially relevant results may be “lost in translation.” For example, the Chinese phrase 吃醋 means “to be jealous,” but literally it’s “to eat vinegar.” Google Translate correctly translates 吃醋 to “jealous,” but the phrase used in a sentence 你还会吃前女友的醋吗？(=Are you still jealous of your ex-girlfriend?) becomes “Would you still eat your ex-girlfriend’s vinegar?”

Semantic search across languages minimizes errors from translation by searching the text as written, and while there may be some fuzziness, the meaning will be true. More significantly, because semantic search can help discover key concepts and case-relevant phrases in the different languages, they can be bootstrapped to create a starter glossary for machine translation to minimize labor by the human contract attorney. They also enable software to automatically categorize files based on key phrases expressing similar ideas.