In this blog post, I'll introduce you to named entity recognition with Spacy. We'll cover the following:
- What is Named Entity Recognition (NER) and why is it useful?
- How to do Named Entity Recognition using Spacy
- How do Spacy NER models compare across model sizes in English?
- How do Spacy NER models perform in different languages (English and French)?
- How to visualise our named entities with Displacy
Let's dive in!
What is Named Entity Recognition?
Named Entity Recognition (NER) is a natural language processing task in which we aim to find and classify named entities in unstructured text. These named entities belong to one of multiple predefined categories such as person, location, organization or date.
Why is it useful?
Named Entity Recognition is useful because it allows us to answer questions about unstructured texts quite easily. Some use cases that I can think of:
- Customer insights: what products, locations or people are often mentioned in (negative) reviews?
- Prioritising customer support tickets: categorizing them based on NER in the text
- Quick insights into (tech) interview candidates: where are they from, what tools do they use, where did they go to college?
- Summarizing insights for fun and profit: setting a buy (or sell) order on the coin that Elon Musk mentions in his tweet right at that second (don't do this).
Quick introduction to Spacy
For those of you still unfamiliar with Spacy: Spacy is a kick-ass Python library for Natural Language Processing.
It's completely open-source and it provides tooling for tokenization, POS-tagging, Named Entity Recognition, Dependency Parsing and much more!
Moreover, it works with language models and has industry-grade models for a bunch of languages (e.g. English, French and Spanish but also Chinese and Russian).
Spacy for Named Entity Recognition
Today, we'll be using Spacy specifically for its Named Entity Recognition functionality.
The Spacy NER system is fast and ready to be used in large enterprise applications.
It is trained on the OntoNotes 5 corpus and the following named entitiy classes are present for the English language model:
First of all, it's important to have spacy installed as well as any language models you need. I also use the
wikipedia package to get texts from Wikipedia.
I've created this tutorial using Jupyter lab, so if you want to follow along, I'd recommend you use that too.
pip install spacy pip install wikipedia python -m spacy download en_core_web_trf python -m spacy download en_core_web_sm python -m spacy download fr_core_news_sm python -m spacy download fr_core_news_lg
Getting started with Spacy
First we load our spacy language model, in this case the
en_core_web_trf. This is Spacy's most accurate English-language model, a transformer-based one.
To make use of the Spacy model's capacities, we need it to transform our sentence (or document(s)) into Spacy docs.
We do this by calling
nlp(<text>). Here, the convention is to call our model
Once we've Spacified our sentence, we can iterate over all of the entities that the model found in the sentence:
nlp = spacy.load('en_core_web_trf') # highest acc model sentence = "England Lifts Coronavirus Rules as Queen Elizabeth Battles Infection." doc = nlp(sentence) for ent in doc.ents: print(ent.text, ent.start_char, ent.end_char, ent.label_)
This outputs the following:
You see the model correctly classified England as country (geo-political entity) and Queen Elizabeth as a person.
💡Pro tip: use
spacy.explain() to explain Spacy concepts such as named entity abbreviations. For instance,
spacy.explain('NORP') will return 'Nationalities or religious or political groups'.
Ambiguity makes named entity recognition more difficult!
For clear and unambiguous sentences, these models work well. Yet, when the words are ambiguous, like names of people that are also names of organizations, the named entity recognition model has more trouble providing accurate tags. This holds especially true for smaller language models as we will see below.
As an example take the sentence: ''Elon Musk says Tesla was 'idiotic' to stop Model X production in 2020". Any idea of what could go wrong with this sentence?
sentence = "Elon Musk says Tesla was 'idiotic' to stop Model X production in 2020" doc = nlp(sentence) for ent in doc.ents: print(ent.text, ent.start_char, ent.end_char, ent.label_)
You can see here that the
en_core_web_trf model was able to correctly classify its entities. Yet if we use
en_core_web_sm, which is a much smaller English language model, it will give you a different output:
nlp_sm = spacy.load('en_core_web_sm') sentence = "Elon Musk says Tesla was 'idiotic' to stop Model X production in 2020" doc_sm = nlp_sm(sentence) for ent in doc_sm.ents: print(ent.text, ent.start_char, ent.end_char, ent.label_)
You see here it thinks Tesla is a person, which makes sense (Nikola Tesla anyone?), yet in this context we'd like it if it recognised it as an organization. It also didn't recognise Elon Musk as a person and failed to classify Model X as a product.
You see that named entity recognition is not always trivial. More training data often leads to better models. The transformer architecture of the
trf model probably also leads to better performance in this case because it takes contextual information into account.
Note that transformer-based architectures are more computationally expensive to run as they may require a GPU to run effectively. I've been able to run these examples without GPU, but keep in mind that these models may be harder to deploy in production.
Using Displacy to visualise your named entities
Printing out your named entities is fine for smaller texts, but with larger texts it gets pretty hard to keep the overview and look at the named entities within the context of the whole text.
Luckily the cool people at Spacy created Displacy, which amongst other things, allows you to display your named entities in a beautiful way.
All you have to do is the following:
- import displacy from spacy
displacy.render(doc, style='ent')if you're in a jupyter notebook, if not use
displacy.serve(doc, style='ent')where doc is your 'spacified' document
from spacy import displacy displacy.render(doc, style="ent")
Here, we use our
en_core_web_trf model again, by the way.
Displacy will output the following visual to demonstrate the named entities in the text:
Applying NER to a Wikipedia page
Now, to test our named entity models on a larger piece of text, we'll use Wikipedia.
Python has a nice library, appropriately named
wikipedia that allows us to fetch summaries from Wikipedia pages.
We use both our small and our large language models here, to demonstrate how their results may vary on a large text.
import wikipedia london = wikipedia.summary(title='London') doc = nlp(london) doc_sm = nlp_sm(london)
displacy.render() again on both docs:
displacy.render(doc, style="ent") displacy.render(doc_sm, style="ent")
Below, I compare the outputs of both Spacy language models. Note that I've capped the summary texts to the first paragraph for better legibility.
- The smaller model misses some entities (e.g. River Thames, Kent)
- The smaller model misclassifies some entities (e.g. Essex as ORG and Surrey as a Person, Greater London as a Person)
- The smaller model groups 'the Romans as Londinium' into one entity
- The larger model is not flawless either: it misses Londinium
- You could argue that the second 'England' should be 'south-east England', which is wrongly tagged in both models
Applying NER to French texts with Spacy's French language models
As you may know, lower resource languages (languages with fewer large language corpora and other linguistic resources for conducting NLP) may have lower performance than higher resource languages such as English. Reasons for this can range from less data available to train these models as well as less focus on these languages from researchers. Now, French certainly isn't a 'super low resource language' but I was curious whether the difference in performance with English would be here, too.
So, we're going to have a look at the French equivalent of the Wikipedia article of London. Note, the texts are not identical, so no exact comparison can be done. It's just to get an idea of how well the French language models work.
First we set the language of our wikipedia class to French. Then we search for the summary of the article on Londres (London in French).
wikipedia.set_lang("fr") londres = wikipedia.summary(title='Londres')
Then, we load the two French language models that we downloaded previously:
nlp_fr_sm = spacy.load('fr_core_news_sm') nlp_fr_lg = spacy.load('fr_core_news_lg')
We create the docs:
doc_fr_sm = nlp_fr_sm(londres) doc_fr_lg = nlp_fr_lg(londres)
And before we visualise, be aware that Spacy's French NER model only covers the following entities:
This less fine-grained entity selection is quite common in non-English entity models. Many of the multilingual and non-English models I've seen on Huggingface were only trained on these entities.
Now, as you can see most of the GPEs have actually been classified as LOC. Even though this is not as fine-grained, the model does pick the most closely related entity that it has been trained on.
Some exceptions are 'Romains' which I'd argue is closer to ORG than to LOC semantically. And you could wonder why it didn't classify more entities as MISC for lack of a better entity to classify it as.
Of course, it doesn't find the DATES (2000 ans) or QUANTITIES (e.g. 2.9 km2) because the NER model hasn't learned about them.
Finally, what's interesting is that both the small and the large model have the same classifications. So on this summary text, we cannot see a clear performance benefit of using the larger NER model for French.
Now if you needed more fine-grained named entities for French, you'd have to train a NER model yourself. Spacy offers a lot of tooling for this as well. I might cover the topic of training your own named entity models in a later blog post.
In this tutorial I've shown you how easy it is to do Named Entity Recognition with Spacy. Thanks to the Displacy visualiser we can also easily visualise our entity results.
We've seen how ambiguous sentences can confuse a model and that some models are better equipped at dealing with these ambiguity than others. Here, the large English transformer model really outshone its smaller counterpart.
Once again, we've seen how English language models have quite an advantage over lower resource language models, both in accuracy and in that they can offer more granularity in their entities. For French (and other, even lower resource languages), the types of entities that these models are trained on are limited. Depending on your use case, this may still suffice though.
Have you worked with Named Entities before? What problem did you solve using NER? I'm curious to hear your experience. :)
Let's keep in touch! 📫
If you would like to be notified whenever I post a new article, you can sign up for my email newsletter here.
If you have any comments, questions or want to collaborate, please email me at firstname.lastname@example.org or drop me a message on Twitter.