Stackoverflow is full of questions about why stemmers and lemmatizers don’t work as expected. The root cause of the confusion is that their role is often misunderstood. Here’s a comparison:
- Both stemmers and lemmatizers try to bring inflected words to the same form
- Stemmers use an algorithmic approach of removing prefixes and suffixes. The result might not be an actual dictionary word.
- Lemmatizers use a corpus. The result is always a dictionary word.
- Lemmatizers need extra info about the part of speech they are processing. “Calling” can be either a verb or a noun (the calling)
- Stemmers are faster than lemmatizers
When to use stemmers and when to use lemmatizers
There’s no definite right answer here, but here are a few guidelines:
- If speed is important, use stemmers (lemmatizers have to search through a corpus while stemmers do simple operations on a string)
- If you just want to make sure that the system you are building is tolerant to inflections, use stemmers (If you query for “best bar in New York”, you’d accept an article on “Best bars in New York 2016″)
- If you need the actual dictionary word, use a lemmatizer. (for example, if you are building a natural language generation system)
How do stemmers work
Stemmers are extremely simple to use and very fast. They usually are the preferred choice. They work by applying different transformation rules on the word until no other transformation can be applied.
from nltk.stem import SnowballStemmer
snow = SnowballStemmer('english')
print snow.stem('getting') # get
print snow.stem('rabbits') # rabbit
print snow.stem('xyzing') # xyze - it even works on non words!
print snow.stem('quickly') # quick
print snow.stem('slowly') # slowli
You can see a stemmer in action in this article about Building an inverted index
How do lemmatizers work
As previously mentioned, lemmatizers need to know about the part of speech. This is a substantial dissadvantage since the task of Part-Of-Speech tagging is prone to errors. Here’s how to properly use a lemmatizer:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
print wnl.lemmatize('getting', 'v') # get
print wnl.lemmatize('rabbits', 'n') # rabbit
print wnl.lemmatize('xyzing', '') # KeyError! - Doesn't work on non-words!
print wnl.lemmatize('quickly', 'r') # quickly
print wnl.lemmatize('slowly', 'r') # slowly