Training a NER System Using a Large Dataset

In a previous article, we studied training a NER (Named-Entity-Recognition) system from the ground up, using the Groningen Meaning Bank Corpus. This article is a continuation of that tutorial. The main purpose of this extension is to:

  1. Replace the classifier with a Scikit-Learn Classifier
  2. Train a NER on a larger subset of the training data
  3. Increase accuracy
  4. Understand Out Of Core Learning

What was wrong with the initial system you might ask. There wasn’t anything fundamentally wrong with the process. In fact, it’s a great didactical example, and we can build upon it. This is where it was lacking:

  1. If you did the training yourself, you probably realized we can’t train the system on the whole dataset (I chose to train it on the first 2000 sentences).
  2. The dataset is so huge – it can’t be loaded all in memory.
  3. We achieved around 93% accuracy. That might sound like a good accuracy, but we might be deceived. Named entities are probably around 10% of the tags. If we predict that all words have O tag (remember, O stands for outside any entity), we’re achieving a 90% accuracy. We can probably do better.
  4. We can come up with a better feature set that better describes the data and is more relevant to our task.

Out-Of-Core Learning

We are used to showing all the data we have at once to our classifier. This means that we have to keep all the data in memory. This can get in our way if we want to train on a larger dataset. Keeping the dataset out of RAM is called Out-Of-Core Learning.

There are certain types of classifiers that accept the data to be presented in batches. Scikit-Learn includes a few such classifiers. Here’s the list: Scikit-Learn Incremental Classifiers. The process of learning from batches is called Incremental Learning.

The classifiers that support Incremental Learning implement the partial_fit method.

Using generators

In the previous tutorial, we created a method of reading from the corpus that didn’t keep the whole dataset in memory. It was making use of the concept of Generator.

Unfortunately, because we had to present the whole data, we were transforming the generator into a list, thus losing the advantage of working with generators. Since we don’t need all the data this time, we’ll be slicing batches from the generator every time we call the partial_fit method. Let’s include the corpus reading routine, from the previous article here:

Better features

The feature detector created in the previous article wasn’t at all bad. In fact, it includes the most popular features and they have been adapted to achieve better performance. We’re going to make a few adjustments. One of the most important features in the task of Named-Entity-Recognition is the shape of the word. We’re going to create a function that describes particular word forms. You should experiment with this function and see if you get better results. Here’s my function:

Here’s the final feature extraction function (I also added one more IOB tag from history):

Learning in batches

After getting the corpus reading and the feature extraction out of the way, we can focus on the cool stuff: training the NE-chunker. The code is fairly simple, but let’s first state what we want to achieve:

  1. The training method should receive a generator. It should only slice batches from the generator, not load the whole data into memory.
  2. We’re going to train a Perceptron. It trains fast and gives good results in this case.
  3. Keep in mind that we will use the partial_fit method.
  4. Because we don’t show all the data at once, we have to give a list of all the classes up front.

Let’s build out NE-chunker:

This is how we train it:

We’ve achieved a whopping 4% boost in performance. That’s huge at this level. It’s exactly the percentages that count. Congrats, you just trained a NE-Chunker with a 97% accuracy.


  1. The more data you use, the better.
  2. Keeping things on-disk, rather that in RAM helps us train on larger datasets.
  3. Scikit-Learn includes models that can be incrementally trained.
  4. Using a more fancy classifier isn’t always the best solution.

If this was too abrupt for you, check out the Complete guide to training a NER System (Named-Entity-Recognition).