Skip to main content
Header img event extraction 2x

What is Entity Extraction?

How it’s used and how it works

Entity extraction (aka, named entity recognition or NER) is a type of natural language processing technology that enables computers to analyze text as it is naturally written. Specifically, it pulls out the most important data points (entities) in unstructured text (think news, webpages, text fields). Entities include names of people, places, organizations, and products, as well as dates, email addresses, and phone numbers. Extracted entities can populate a database record about the text. This structure enables higher-level analyses, such as relationships between entities, detecting events, and sentiment analysis around entities.

What is named entity recognition used for?

Better search for e-commerce, business research

Extracted entities make keyword search more accurate. Keywords only match words, whereas entity extraction uses context to know when, for example, “Paris” refers to a city, the name of a person ("Paris Jones"), or a nonentity (plaster of Paris). In e-commerce, extracting price, clothing features, size, and other product attributes from descriptions lets shoppers filter searches to refine 200 results to a browsable 20.

Brand monitoring and intelligence gathering

Want to know “what are people saying” about a new product launch or their experience at your hotel? NER is an enabling technology for sentiment analysis to track social media buzz or uncover new rivals. Intelligence agencies that track specific people and organizations of interest in message streams can distinguish between similarly named entities (e.g., Richie Fox the astronaut or hockey referee) by linking to an entity knowledge base using the context surrounding the entity. (Does the text refer to space or hockey?)

Knowledge graphs, event extraction, fact extraction

Pushing the possible are technologies built on NER:

  • Knowledge graphs visualize the relationship between entities (who is affiliated with what organizations and locations)
  • Fact extraction answers factual questions (What kills bacteria?)
  • Event extraction finds who did what to whom, when, and where.

Especially for these advanced technologies, entity extraction must be highly accurate and chain together different mentions of the same entity. This is also known as coreference resolution.

How entity extraction works

Different techniques are used to extract different types of entities.

Machine learning trains models to extract entities such as person, location, and organization where word meaning varies depending on context (e.g., Paris). A corpus of text containing thousands of examples of each entity type is annotated by humans. Then an algorithm trains a statistical model on that data to “learn rules” for predicting which words represent which entity type.

The accuracy from machine learning models depends on the algorithm used and, even more so, creating high-quality training and test data. Deep learning models can be more accurate than traditional machine learning models, but are currently much slower. Optimizing the accuracy of a model means adapting the statistical model to that set of data.

The exact match method matches words against a list of entities for each entity type. This method is appropriate for entity types that are finite and unambiguous, such as nationalities. However, since exact match doesn’t consider context, it cannot distinguish between the nationality “Polish,” and the common word “polish.”

Pattern matching is effective for finding entities that follow a particular pattern, such as email addresses, URLs, and phone numbers.

Applications that analyze big data to find insight from patterns and themes in unstructured text depend on entity extraction, and will only continue to grow.


Disclaimer: All names, companies, and incidents portrayed in this document are fictitious. No identification with actual persons (living or deceased), places, companies, and products are intended or should be inferred.

Babel Street Home
Trending Searches