Skip to main content
Foggy forest
Text Analytics

Duplicate Document Detection and Cross-lingual Search

Automate mundane tasks and find relevant text using text embedding

Numbers are great, because they are easy to compare, tabulate and examine. Text? Not so much. But text embeddings let users manipulate and compare the meaning behind words and text like numbers.

Basically, text embeddings convert words, phrases, or even whole documents into a mathematical vector representing its meaning. Vectors that are numerically closer will be closer in meaning. (For the long explanation of how text embeddings work, read our blog posts “Using Deep Learning to Power Multilingual Text Embeddings for Global Analysis” Part I and Part II.) A given word compared to itself will score a 1.0 in similarity, but outside of that case, 0.8 is about as high a match as you will ever see.

Cross-lingual query expansion (i.e., taking your English search and generating the equivalent in a number of languages) and duplicate document detection can be built using text embeddings. The only difference is cross-lingual search is looking for an equivalent phrase in a different language, and detecting duplicate documents is often done in a single language.

Let’s see how this works.

Cross-lingual query expansion

Before we had access to text embeddings, monolingual English speakers would take a search term and drop it into Google Translate and then copy the result into the search box. It’s laborious and you may not even have the right term. That has all changed with the availability of semantic similarity of terms with Babel Street Text Analytics (formerly Rosette) version 1.12.1, which supports Arabic, English, Chinese, German, Japanese, North & South Korean, Russian, and Spanish for this function.

Similar words or phrases can be discovered within a language or across languages. Given the word “spy”, Text Analytics returns these similar terms in Spanish, German, and Japanese.

Input: Spy
Spanish
{"term":"espía","similarity":0.61295485},
{"term":"cia","similarity":0.46201307},
{"term":"desertor","similarity":0.42849663},
{"term":"cómplice","similarity":0.36646274},
{"term":"subrepticiamente","similarity":0.36629659}
German
{"term":"Deckname","similarity":0.51391315},
{"term":"GRU","similarity":0.50809389},
{"term":"Spion","similarity":0.50051737},
{"term":"KGB","similarity":0.49981388},
{"term":"Informant","similarity":0.48774603},
Japanese
{"term":"スパイ","similarity":0.5544399},
{"term":"諜報","similarity":0.46903181},
{"term":"MI6","similarity":0.46344957},
{"term":"殺し屋","similarity":0.41098994},
{"term":"正体","similarity":0.40109193},

Text Analytics /semantic/similar endpoint is returning similar terms from a term database compiled from Wikipedia and Gigaword.

Duplicate document detection

Text embeddings are also useful in areas such as e-discovery where being able to detect nearly duplicate documents could save person-weeks of labor during discovery. Babel Street Text Analytics will accept an entire document as input to its semantics/vector endpoint and calculate the vector (a location in semantic space, as represented by a vector of floating point numbers). Then, the values of the resulting vectors for each document can be compared.

For instance, a press release can be published on 100+ websites. Using semantic vectors, you can programmatically identify all 100 copies as different versions of the article.

Find out how to transform your data into actionable insights.

Schedule a Demo

Stay Informed

Sign up to receive the latest intel, news and updates from Babel Street.

Babel Street Home
Trending Searches