Entity extraction is becoming a mission-critical tool for finding mentions of people, places, organizations, and products in massive quantities of text. In patent searches, law enforcement, voice-of-the-customer analysis, ad targeting, content recommendation, e-discovery, and anti-fraud, entity extraction enables swift analysis of gigabytes of data.

Among named entity recognition systems, those such as Babel Street Text Analytics (formerly Rosette) entity extraction function which rely on machine learning to find entities have the advantage. They can find previously unknown entities. Furthermore, because statistical entity extractors are context sensitive, it can disambiguate between places like Paris and people named Paris.

Why entity extraction needs to be flexible

When it comes to entity extraction, not all content is created equal. While most entity extractors are quite accurate out-of-the-box when working on well-formed text such as news articles, the high degree of content variation in blogs, restaurant reviews, financial documents, electronic medical records, legal contracts, and patent filings, can limit the algorithms’ accuracy.

Text Analytics has an advantage in these cases. Its statistical model has been tuned to a wide range of content beyond simply published news. And, for users with particularly quirky data—whether in format, style, or vocabulary—and for those who need every last bit of accuracy, Text Analytics includes robust field training capabilities with multiple mechanisms for adapting to your data’s idiosyncrasies, thus maximizing the accuracy of entity extraction on your data.

Using field training to improve accuracy

Level 1: Just add data

The easiest level of adaptation, called “unsupervised field training,” can be almost completely user driven. Text Analytics provides access to a state-of-the art clustering tool chain. You add any quantity of your own data — no need for annotation! just any old documents you have lying around that are representative of the data you need to extract — and Text Analytics will build a new model adapted to the idiosyncrasies of your data, dramatically increasing the entity extraction accuracy.

This unsupervised process allows Text Analytics to more accurately locate entities in the genre, style and vocabulary used by your data, based on the idea of word clusters, i.e., “similar words tend to appear in similar contexts.” Thus it might learn that the function word “outturn” is used in financial documents the same way “outcome” is used in news articles, or that the words “Waltham”, “Atiak”, “Loveland”, “Svetogorsk”, “Yeisk” and “Descoberto” are all likely names of LOCATIONs, even though none were mentioned in the original collection of annotated training data. Consequently Text Analytics will better understand the context surrounding unfamiliar words, and as a result, extract them into existing, well-defined clusters.

Level 2: A little annotation goes a long way

For even greater accuracy, you can annotate a small quantity of your data and actively teach Text Analytics the unique contexts for entities that are common to your documents. Only a few hundred annotated documents can create dramatic improvements in accuracy. Adaptation Studio makes adding annotated documents to boost the existing entity extraction model in Text Analytics much faster and more efficient than traditional annotation methods.

It used to be that annotators had no choice but to work “blind.” That is, they could not tell when they had annotated enough documents to reach the desired level of accuracy. Adaptation Studio, a user-friendly, web application, coordinates the work of multiple annotators and creates training data exponentially faster than traditional methods.

How? By:

Leveraging interim models: The training is bootstrapped by tagging a tiny number of documents to build an interim model
Efficient annotation: Active learning technology prioritizes the untagged documents that the interim model shows least confidence in; therefore, a greater variety of events are tagged sooner
Computer-assisted tagging: The interim model pre-tags unannotated documents so that human annotators only correct errors, which is faster than hand-tagging every event
Iterative model evaluation: The system continuously measures the model’s accuracy, allowing annotation to stop as soon as accuracy is achieved

Customers who have conducted entity extraction field training report a drop in both false positives (increased precision) and false negatives (increased recall) from Text Analytics and a noticeable improvement in their overall analytics system.

Professional support

Given that most of our customers welcome guidance in selecting data, building a new model, and evaluating the results, Babel Street offers professional services to assist with field training. Whether you are just adding raw data to the entity extraction model to improve accuracy or annotating your own data, we are here to assist in your target languages.

Find out how to transform your data into actionable insights.

Schedule a Demo

Analytics

Data

Insights

Secure Access

Ecosystem Overview

Anti-money Laundering

Border Security

Commercial

Government

Insider Threat

Law Enforcement

OSINT & Threat Intelligence

Blog

Resources

Case Studies

Glossary

Success Stories

Developers

Interactive Demos & Trials

Partner Program

Become a Partner

Launch Portal

About Us

Leadership

Newsroom

Events

Careers

Contact

Adapt Babel Street Entity Extraction to Your Content for Increased Accuracy

Why entity extraction needs to be flexible

Using field training to improve accuracy

Level 1: Just add data

Level 2: A little annotation goes a long way

Professional support

Stay Informed

Share

You may like

What is Entity Extraction?

What’s the Difference Between Entity Extraction (NER) and Entity Resolution?

The Most Effective Entity Extraction Techniques

How to Write Annotation Guidelines for Entity Extraction