Complete guide for training your own Part-Of-Speech Tagger
Part-Of-Speech tagging (or POS tagging, for short) is one of the main components of almost any NLP analysis. The task of POS-tagging simply implies labelling words with their appropriate Part-Of-Speech (Noun, Verb, Adjective, Adverb, Pronoun, …).
Penn Treebank Tags
The most popular tag set is Penn Treebank tagset. Most of the already trained taggers for English are trained on this tag set. Examples of such taggers are:
- NLTK default tagger
- Stanford CoreNLP tagger
What is POS tagging
Here’s a simple example:
from nltk import word_tokenize, pos_tag
print pos_tag(word_tokenize("I'm learning NLP"))
# [('I', 'PRP'), ("'m", 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP')]
POS tagging tools in NLTK
There are some simple tools available in NLTK for building your own POS-tagger. You can read the documentation here: NLTK Documentation Chapter 5 , section 4: “Automatic Tagging”. You can build simple taggers such as:
DefaultTaggerthat simply tags everything with the same tag
RegexpTaggerthat applies tags according to a set of regular expressions
UnigramTaggerthat picks the most frequent tag for a known word
TrigramTaggerworking similarly to the
UnigramTaggerbut also taking some of the context into consideration
Picking a corpus to train the POS tagger
Resources for building POS taggers are pretty scarce, simply because annotating a huge amount of text is a very tedious task. One resource that is in our reach and that uses our prefered tag set can be found inside NLTK.
tagged_sentences = nltk.corpus.treebank.tagged_sents()
print "Tagged sentences: ", len(tagged_sentences)
print "Tagged words:", len(nltk.corpus.treebank.tagged_words())
# [(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), (u',', u','), (u'61', u'CD'), (u'years', u'NNS'), (u'old', u'JJ'), (u',', u','), (u'will', u'MD'), (u'join', u'VB'), (u'the', u'DT'), (u'board', u'NN'), (u'as', u'IN'), (u'a', u'DT'), (u'nonexecutive', u'JJ'), (u'director', u'NN'), (u'Nov.', u'NNP'), (u'29', u'CD'), (u'.', u'.')]
# Tagged sentences: 3914
# Tagged words: 100676
Training our own POS Tagger using scikit-learn
Before starting training a classifier, we must agree first on what features to use. Most obvious choices are: the word itself, the word before and the word after. That’s a good start, but we can do so much better. For example, the 2-letter suffix is a great indicator of past-tense verbs, ending in “-ed”. 3-letter suffix helps recognize the present participle ending in “-ing”.
def features(sentence, index):
""" sentence: [w1, w2, ...], index: the index of the word """
'is_first': index == 0,
'is_last': index == len(sentence) - 1,
'is_capitalized': sentence[index].upper() == sentence[index],
'is_all_caps': sentence[index].upper() == sentence[index],
'is_all_lower': sentence[index].lower() == sentence[index],
'prev_word': '' if index == 0 else sentence[index - 1],
'next_word': '' if index == len(sentence) - 1 else sentence[index + 1],
'has_hyphen': '-' in sentence[index],
'capitals_inside': sentence[index][1:].lower() != sentence[index][1:]
pprint.pprint(features(['This', 'is', 'a', 'sentence'], 2))
Small helper function to strip the tags from our tagged corpus and feed it to our classifier:
return [w for w, t in tagged_sentence]
Let’s now build our training set. Our classifier should accept features for a single word, but our corpus is composed of sentences. We’ll need to do some transformations:
# Split the dataset for training and testing
cutoff = int(.75 * len(tagged_sentences))
training_sentences = tagged_sentences[:cutoff]
test_sentences = tagged_sentences[cutoff:]
print len(training_sentences) # 2935
print len(test_sentences) # 979
X, y = , 
for tagged in tagged_sentences:
for index in range(len(tagged)):
return X, y
X, y = transform_to_dataset(training_sentences)
We’re now ready to train the classifier. I’ve opted for a
DecisionTreeClassifier. Feel free to play with others:
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
clf = Pipeline([
clf.fit(X[:10000], y[:10000]) # Use only the first 10K samples if you're running it multiple times. It takes a fair bit :)
print 'Training completed'
X_test, y_test = transform_to_dataset(test_sentences)
print "Accuracy:", clf.score(X_test, y_test)
# Accuracy: 0.904186083882
# not bad at all :)
We can now use our classifier like this:
tags = clf.predict([features(sentence, index) for index in range(len(sentence))])
return zip(sentence, tags)
print pos_tag(word_tokenize('This is my friend, John.'))
# [('This', u'DT'), ('is', u'VBZ'), ('my', u'JJ'), ('friend', u'NN'), (',', u','), ('John', u'NNP'), ('.', u'.')]
- Training your own POS tagger is not that hard
- All the resources you need are right there
- Hopefully this article sheds some light on this subject, that can sometimes be considered extremely tedious and “esoteric”
Sir I wanted to know the part where clf.fit() is defined. What is the value of X and Y there ? X and Y there seem uninitialized.
Hi Suraj, Good catch. Indeed, I missed this line: “X, y = transform_to_dataset(training_sentences)”.
I’ve updated the code, thanks again 🙂
This is great! Could you also give an example where instead of using scikit, you use pystruct instead? Thank you in advance!
Great idea! I haven’t played with pystruct yet but I’m definitely curious. I plan to write an article every week this year so I’m hoping you’ll come back when it’s ready. Thanks!
Yes, I definitely will! 🙂
Can you demonstrate trigram tagger with backoffs’ being bigram and unigram? That would be helpful!
I’d probably demonstrate that in an NLTK tutorial. It’s been done nevertheless in other resources: http://www.nltk.org/book/ch05.html <- "5.4 Combining Taggers" chapter
I am an absolute beginner for programming.
Is there any example of how to POSTAG an unknown language from scratch?
lets say, i have already the tagged texts in that language as well as its tagset.
Absolutely, in fact, you don’t even have to look inside this English corpus we are using. You can consider there’s an unknown language inside. Knowing particularities about the language helps in terms of feature engineering. Picking features that best describes the language can get you better performance. That being said, you don’t have to know the language yourself to train a POS tagger.
[…] an earlier post, we have trained a part-of-speech tagger. You can read it here: Training a Part-Of-Speech Tagger. We’re taking a similar approach for training our […]
[…] libraries like scikit-learn or TensorFlow. Here are some examples of training your own NLP models: Training a POS Tagger with NLTK and scikit-learn and Train a NER System. NLTK also provides some interfaces to external tools like the […]
[…] the leap towards multiclass. Examples of multiclass problems we might encounter in NLP include: Part Of Speach Tagging and Named Entity Extraction. Let’s repeat the process for creating a dataset, this time with […]
Hi! thanks for the good article, it was very helpful!
I’m trying to build my own pos_tagger which only labels whether given word is firm’s name or not.
I tried using Stanford NER tagger since it offers ‘organization’ tags.
However, I found this tagger does not exactly fit my intention.
So, I’m trying to train my own tagger based on the fixed result from Stanford NER tagger.
My question is , ‘is there any better or efficient way to build tagger than only has one label (firm name : yes or not) that you would like to recommend ?”
Or do you have any suggestion for building such tagger?
Is this what you’re looking for: https://nlpforhackers.io/named-entity-extraction/ ?
I’m building a pos tagger for the Sinhala language which is kinda unique cause, comparison of English and Sinhala words is kinda of hard. What way do you suggest?
Sorry, I didn’t understand what’s the exact problem. Also, I’m not at all familiar with the Sinhala language. Do you have an annotated corpus? If the words can be deterministically segmented and tagged then you have a sequence tagging problem.
Can you give an example of a tagged sentence?
Is there any unsupervised method for pos tagging in other languages(ps: languages that have no any implementations done regarding nlp)
If there are, I’m not familiar with them 🙁
How to use a MaxEnt classifier within the pipeline?
MaxEnt is another way of saying
LogisticRegression. Just replace the
It is a very helpful article, what should I do if I want to make a pos tagger in some other language.
First thing would be to find a corpus for that language. Second would be to check if there’s a stemmer for that language(try NLTK) and third change the function that’s reading the corpus to accommodate the format. What language are we talking about?
I’m intended to create twitter tagger, any suggestions, tips, or pieces of advice.
There is a Twitter POS tagged corpus: https://github.com/ikekonglp/TweeboParser/tree/master/Tweebank/Raw_Data
Follow the POS tagger tutorial: https://nlpforhackers.io/training-pos-tagger/
Many thanks for this post, it’s very helpful.
Could you show me how to save the training data to disk, you know the training takes a lot of time, if I can save it on the disk it will save a lot of time when I use it next time.
You mean save the model to disk?
Yes, I tried to change
It seems never end 😀
Will this change increase the accuracy?
Yes, I mean how to save the training model to disk.
Go through this
Question: why do you have the empty list tagged_sentence =  in the pos_tag() function, when you don’t use it? Did you mean to assign the zipped sentence/tag list to it?
Thanks Earl! I think that’s precisely what happened 🙂
This is what I did, to get a list of lists from the zip object.
tags = clf.predict([features(sentence, index) for index in range(len(sentence))])
tagged_sentence = list(map(list, zip(sentence, tags)))
Thanks so much for this article. It’s helped me get a little further along with my current project. NLP is fascinating to me. And I grateful for blog articles like this and all the work that’s gone before so it’s much easier for people like me.
Hey thank you it is very helpful.
I tried using my own pos tag language and get better results when change sparse on DictVectorizer to True, how it make model better predict the results? What sparse actually mean?
how significant was the performance boost?
Great tutorial. Thank you Bogdani!
I’ve prepared a corpus and tag set for Arabic tweet POST. I’m working on CRF and plan to incorporate word embedding (ara2vec ) also as feature to improve the accuracy; however, I found that CRF doesn’t accept real-valued embedding vectors. Any suggestions?
Use LSTMs or if you’re going for something simpler you can still average the vectors and feed it to a LogisticRegression Classifier
It is a great tutorial, But I have a question.
Currently, I am working on information extraction from receipts, for that, I have to perform sequence tagging in receipt TEXT. I am afraid to say that POS tagging would not enough for my need because receipts have customized words and more numbers. Can you give some advice on this problem?
What do you plan to tag the text with?