Complete guide for training your own Part-Of-Speech Tagger

Part-Of-Speech tagging (or POS tagging, for short) is one of the main components of almost any NLP analysis. The task of POS-tagging simply implies labelling words with their appropriate Part-Of-Speech (Noun, Verb, Adjective, Adverb, Pronoun, …).

Penn Treebank Tags

The most popular tag set is Penn Treebank tagset. Most of the already trained taggers for English are trained on this tag set. Examples of such taggers are:

  • NLTK default tagger
  • Stanford CoreNLP tagger

What is POS tagging

Here’s a simple example:

POS tagging tools in NLTK

There are some simple tools available in NLTK for building your own POS-tagger. You can read the documentation here: NLTK Documentation Chapter 5 , section 4: “Automatic Tagging”. You can build simple taggers such as:

  • DefaultTagger that simply tags everything with the same tag
  • RegexpTagger that applies tags according to a set of regular expressions
  • UnigramTagger that picks the most frequent tag for a known word
  • BigramTagger, TrigramTagger working similarly to the UnigramTagger but also taking some of the context into consideration

Picking a corpus to train the POS tagger

Resources for building POS taggers are pretty scarce, simply because annotating a huge amount of text is a very tedious task. One resource that is in our reach and that uses our prefered tag set can be found inside NLTK.

Training our own POS Tagger using scikit-learn

Before starting training a classifier, we must agree first on what features to use. Most obvious choices are: the word itself, the word before and the word after. That’s a good start, but we can do so much better. For example, the 2-letter suffix is a great indicator of past-tense verbs, ending in “-ed”. 3-letter suffix helps recognize the present participle ending in “-ing”.

Small helper function to strip the tags from our tagged corpus and feed it to our classifier:

Let’s now build our training set. Our classifier should accept features for a single word, but our corpus is composed of sentences. We’ll need to do some transformations:

We’re now ready to train the classifier. I’ve opted for a DecisionTreeClassifier. Feel free to play with others:

Let’s tag!

We can now use our classifier like this:

Conclusions

  • Training your own POS tagger is not that hard
  • All the resources you need are right there
  • Hopefully this article sheds some light on this subject, that can sometimes be considered extremely tedious and “esoteric”