Complete guide to build your own Named Entity Recognizer with Python

Updates

  • 29-Apr-2018 – Added Gist for the entire code

NER, short for Named Entity Recognition is probably the first step towards information extraction from unstructured text. It basically means extracting what is a real world entity from the text (Person, Organization, Event etc …).

Why do you need this information? You might want to map it against a knowledge base to understand what the sentence is about, or you might want to extract relationships between different named entities (like who works where, when the event takes place etc…)

NLTK NER Chunker

NLTK has a standard NE annotator so that we can get started pretty quickly.

ne_chunk needs part-of-speech annotations to add NE labels to the sentence. The output of the ne_chunk is a nltk.Tree object.

The ne_chunk function acts as a chunker, meaning it produces 2-level trees:

  1. Nodes on Level-1: Outside any chunk
  2. Nodes on Level-2: Inside a chunk – The label of the chunk is denoted by the label of the subtree

In this example, Mark/NNP is a level-2 leaf, part of a PERSON chunk. and/CC is a level-1 leaf, meaning it’s not part of any chunk.

IOB tagging

nltk.Tree is great for processing such information in Python, but it’s not the standard way of annotating chunks. Maybe this can be an article on its own but we’ll cover this here really quickly.

The IOB Tagging system contains tags of the form:

  1. B-{CHUNK_TYPE} – for the word in the Beginning chunk
  2. I-{CHUNK_TYPE} – for words Inside the chunk
  3. OOutside any chunk

A sometimes used variation of IOB tagging is to simply merge the B and I tags:

  1. {CHUNK_TYPE} – for words inside the chunk
  2. O – Outside any chunk

We usually want to work with the proper IOB format.

Here’s how to convert between the nltk.Tree and IOB format:

GMB corpus

NLTK doesn’t have a proper English corpus for NER. It has the CoNLL 2002 Named Entity CoNLL but it’s only for Spanish and Dutch. You can definitely try the method presented here on that corpora. In fact doing so would be easier because NLTK provides a good corpus reader. We are going with Groningen Meaning Bank (GMB) though.
GMB is a fairly large corpus with a lot of annotations. Unfortunately, GMB is not perfect. It is not a gold standard corpus, meaning that it’s not completely human annotated and it’s not considered 100% correct. The corpus is created by using already existed annotators and then corrected by humans where needed.

Let’s start playing with the corpus. Download the 2.2.0 version of the corpus here: Groningen Meaning Bank Download

Essentially, GMB is composed of a lot of files, but we only care about the .tags files. Here’s how one looks like:

That looks rather messy, but in fact, it’s pretty structured. A file contains more sentences, which are separated by 2 newline characters. For every sentence, every word is separated by 1 newline character. For every word, each annotation is separated by a tab character.

Let’s interpret the tags a bit. We can observe that the tags are composed (Except for O of course) as such: {TAG}-{SUBTAG}. Here’s what the top-level categories mean:

  • geo = Geographical Entity
  • org = Organization
  • per = Person
  • gpe = Geopolitical Entity
  • tim = Time indicator
  • art = Artifact
  • eve = Event
  • nat = Natural Phenomenon

The subcategories are pretty unnecessary and pretty polluted. per-ini for example tags the Initial of a person’s name. This tag, kind of makes sense. On the other hand, it’s unclear what the difference between per-nam (person name) and per-giv (given name), per-fam (family-name), per-mid (middle-name).

I decided to just remove the subcategories and focus only on the main ones. Let’s modify the code a bit:

This looks much better. You might decide to drop the last few tags because they are not well represented in the corpus. We’ll keep them … for now.

Training your own system

In an earlier post, we have trained a part-of-speech tagger. You can read it here: Training a Part-Of-Speech Tagger. We’re taking a similar approach for training our NE-Chunker.

The feature extraction works almost identical as the one implemented in the Training a Part-Of-Speech Tagger, except we added the history mechanism. Since the previous IOB tag is a very good indicator of what the current IOB tag is going to be, we have included the previous IOB tag as a feature.

Let’s create a few utility functions to help us with the training and move the corpus reading stuff into a function, read_gmb:

Check the output:

We managed to read sentences from the corpus in a proper format. We can now start to actually train a system. NLTK offers a few helpful classes to accomplish the task. nltk.chunk.ChunkParserI is a base class for building chunkers/parsers. Another useful asset we are going to use is the nltk.tag.ClassifierBasedTagger. Under the hood, it uses a NaiveBayes classifier for predicting sequences.

Let’s build the datasets:

We built everything up to this point so beautifully such that the training can be expressed as simply as:

It probably took a while. Let’s take it for a spin:

The system you just trained did a great job at recognizing named entities:

  • Named Entity “Germany”Geographical Entity
  • Named Entity “Monday”Time Entity

Testing the system

Let’s see how the system measures up. Because we followed to good patterns in NLTK, we can test our NE-Chunker as simple as this:

Conclusions

  • Chunking can be reduced to a tagging problem.
  • Named Entity Recognition is a form of chunking.
  • We explored a freely available corpus that can be used for real-world applications.
  • The NLTK classifier can be replaced with any classifier you can think about. Try replacing it with a scikit-learn classifier.

If you loved this tutorial, you should definitely check out the sequel: Training a NER system on a large dataset. It builds upon what you already learned, it uses a scikit-learn classifier and pushes the accuracy to 97%.

Notes

  • I’ve used NLTK version 3.2.1
  • You can find the entire code here: Python NER Gist