By Tina Lieu
Entity extraction is becoming a mission-critical tool for finding mentions of people, places, organizations, and products in massive quantities of text. In patent searches, law enforcement, voice-of-the-customer analysis, ad targeting, content recommendation, e-discovery, and anti-fraud, entity extraction enables swift analysis of gigabytes of data.
Among named entity recognition systems, those such as Rosette’s entity extraction function which rely on machine learning to find entities have the advantage. They can find previously unknown entities. Furthermore, because statistical entity extractors are context sensitive, it can disambiguate between places like Paris and people named Paris.
Why entity extraction needs to be flexible
When it comes to entity extraction, not all content is created equal. While most entity extractors are quite accurate out-of-the-box when working on well-formed text such as news articles, the high degree of content variation in blogs, restaurant reviews, financial documents, electronic medical records, legal contracts, and patent filings, can limit the algorithms’ accuracy.
Rosette Entity Extraction has an advantage in these cases. Rosette’s statistical model has been tuned to a wide range of content beyond simply published news. And, for users with particularly quirky data—whether in format, style, or vocabulary—and for those who need every last bit of accuracy, Rosette includes robust field training capabilities with multiple mechanisms for adapting to your data’s idiosyncrasies, thus maximizing the accuracy of named entity extraction on your data.
Using field training to improve accuracy
Level 1: Just add data
The easiest level of adaptation, called “unsupervised field training,” can be almost completely user driven. Rosette provides access to a state-of-the art clustering tool chain. You add any quantity of your own data — no need for annotation! just any old documents you have lying around that are representative of the data you need to extract — and Rosette will build a new model adapted to the idiosyncrasies of your data, dramatically increasing the entity extraction accuracy.
This unsupervised process allows Rosette to more accurately locate entities in the genre, style and vocabulary used by your data, based on the idea of word clusters, i.e., “similar words tend to appear in similar contexts.” Thus it might learn that the function word “outturn” is used in financial documents the same way “outcome” is used in news articles, or that the words “Waltham”, “Atiak”, “Loveland”, “Svetogorsk”, “Yeisk” and “Descoberto” are all likely names of LOCATIONs, even though none were mentioned in the original collection of annotated training data. Consequently Rosette will better understand the context surrounding unfamiliar words, and as a result, extract them into existing, well-defined clusters.
Level 2: A little annotation goes a long way
For even greater accuracy, you can annotate a small quantity of your data and actively teach Rosette the unique contexts for entities that are common to your documents. Only a few hundred annotated documents can create dramatic improvements in accuracy. Rosette Adaptation Studio — within the Model Training Suite framework — makes adding annotated documents to boost the existing entity extraction model in Rosette much faster and more efficient than traditional annotation methods.
It used to be that annotators had no choice but to work “blind.” That is, they could not tell when they had annotated enough documents to reach the desired level of accuracy. Rosette Adaptation Studio, a user-friendly, web application, coordinates the work of multiple annotators and creates training data exponentially faster than traditional methods.
- Leveraging interim models: The training is bootstrapped by tagging a tiny number of documents to build an interim model
- Efficient annotation: Active learning technology prioritizes the untagged documents that the interim model shows least confidence in; therefore, a greater variety of events are tagged sooner
- Computer-assisted tagging: The interim model pre-tags unannotated documents so that human annotators only correct errors, which is faster than hand-tagging every event
- Iterative model evaluation: The system continuously measures the model’s accuracy, allowing annotation to stop as soon as accuracy is achieved
Customers who have conducted entity extraction field training report a drop in both false positives (increased precision) and false negatives (increased recall) from Rosette and a noticeable improvement in their overall analytics system.
Given that most of our customers welcome guidance in selecting data, building a new model, and evaluating the results, Babel Street offers professional services to assist with field training. Whether you are just adding raw data to the entity extraction model to improve accuracy or annotating your own data, we are here to assist in your target languages.
Contact us if you have more questions about the highly adaptable Rosette Entity Extractor or our next generation annotation system, Rosette Adaptation Studio inside Rosette Model Training Suite.