Pattern-based Few-shot Entity Disambiguation Reduces Need for Large Data Sets

Babel Street hosted a gathering of the Boston NLP Group at its Somerville, MA office. The featured topics were Use Cases for Do It Yourself NLP and Pattern-Based Few-Shot Entity Disambiguation. Philip Blair, a research engineer at Babel Street, presented the progress his team has been making on projects around entity linking and how pattern exploitation techniques can be applied to entity disambiguation.

Entity linking is a natural language processing (NLP) task that identifies an entity mentioned in a document and connects it with an entry in a knowledge base. In the example “Cambridge is a city in Massachusetts,” the entities “Cambridge” and “Massachusetts” could correspond to multiple entries in the knowledge base, such as for Cambridge, UK or the University of Massachusetts. So, first the system performs candidate generation (similar to named entity recognition), and then it must perform a disambiguation step on those candidates to select the right one for that particular mention.

Entity disambiguation is a tricky problem because for a system to do it automatically, it must:

Look at the context in which the mention appears
Examine relevant fields in the knowledge base for each candidate entity
Make an informed decision based on what it finds

In very simple cases, you could write rules for determining whether entities matched, but in the real world, it’s far more nuanced. Disambiguation systems are based on statistics and machine learning, and one of their biggest obstacles is domain shift. Statistical models are generally trained in one particular style of text from which to glean context. Often, news articles are the source for model training and they may not apply in medical, legal, or engineering contexts, for example.

Data scientists who recognize this problem might be tempted to collect data in the required domain and train a model themselves. Unfortunately, this turns out to be an expensive and time-consuming effort that requires a data set with tens of thousands of entity mentions. That’s a big investment each time a model is needed for a new domain.

Rather, the goal is to get named entity linking systems running without needing such a large quantity of data. This leads us to a relatively recent technique known as pattern exploitation. Large language models are pre-trained on large bodies of text, generally by predicting words through a task known as masked language modeling. In this method, you randomly select tokens from the document and hide them with a mask token. The system must then take the input sentence and predict which token was masked. An example of this is the sentence “I walked my [MASK] yesterday.” It’s up to the model to determine what word the mask token is replacing, and a well-trained model will be able to say “dog” is more likely than “squirrel.”

Another interesting example is “We are in [MASK], Massachusetts,” where a well-trained model would be able to say Boston is more likely than Utah. In this case, it can be inferred that the model has “learned” there is a relationship between Boston and Massachusetts. The question, then, is how this can be leveraged directly for entity linking.

Philip’s team was inspired by a new technique called pattern exploitation training via a system called ADAPET. ADAPET is a method for turning a language model into a text classifier. Philip provided the hypothetical example of using masked language modeling to get a language model to predict whether a customer review is positive or negative. The trick? Just rephrase the sample slightly, like this: “This is a [MASK] review: This product sucks!” Now ask the model whether the mask is more likely to be hiding the word “positive” or “negative.” This technique calls on the model’s intuition, which has already been strengthened by the training it’s received up to this point. At the risk of overgeneralizing, this technique allows us to control how the model approaches the problem. The result is a binary classifier that uses a general pre-trained language model without additional domain-specific training.

When you take a machine learning model and apply it to a new domain without any additional training, it’s known as “zero-shot” learning. However, often with just a little fine tuning for the new domain, you can have an even better model. This is “few-shot” learning. While few-shot learning still requires a bit of training on the new domain, in terms of time and cost it is an immensely better alternative to training a new model from scratch.

To test the performance of this ADAPET-based entity disambiguation classifier, Philip and the team implemented it within the existing DCA system as an additional input when choosing between entity candidates. For example, when extracting an entity such as “Boston,” this would help the system decide whether the word represents the city itself or, for example, a sports team. The results indicate that while ADAPET-integrated DCA and baseline DCA perform similarly, the ADAPET component reaches its peak accuracy with much less training data.

These results were obtained through testing in the domain of general news data. Further testing showed that in a more niche domain (in this case mental health news) fine-tuning DCA with ADAPET resulted in a 10% across-the-board performance improvement. “This system is able to more readily adapt to a more varied set of domains without degrading the performance on the baseline general news domain,” Philip summarized. In the future, he hopes to further improve the system by adding more language support (it is currently limited to English), testing it on more domains, and integrating it with different knowledge bases.

This isn’t the only few-shot learning project at Babel Street. During Hackathon 2022, Team 2 pursued a few-shot event extraction project. Events are more complex than entities (hence the tendency to throw powerful, massively trained models at them); they describe interactions involving one or more entities. We can assign these entities different roles based on how they participate in the event. As an example, take the sentence “Tony bought a sandwich from Satriale’s.” In this purchase event, the entity Tony would be in the buyer role, sandwich would be in the product role, and Satriale’s would be in the seller role. Team 2 used the same language masking approach to determine the correct candidate for a given role by asking the model a series of yes or no questions.

Focusing on movement events, the team provided the following example: “We're told the Australian Prime Minister was greeted by the King as he arrived in London today from Australia." When identifying the origin role, their model formulates the problem like this: “Is London the origin of the movement? - False. Is Australia the origin of the movement? - True.” Incidentally, Team 2’s results showed promise for one of Philip’s team’s long term goals: their English-primed model (using GPT-J, which works differently from ADAPET) showed some success answering questions in French.

One of Team 2’s biggest obstacles during their project was the amount of manual annotation they had to do. Since they generated multiple yes/no questions for each sentence, their 115 sentences became 350 training examples. However, this is still a drop in the bucket compared to the amount of work necessary for training a new model from scratch. It’s exciting to think about what we can do with all the time that pattern exploitation training will save.