Stemmers vs. Lemmatizers

Stackoverflow is full of questions about why stemmers and lemmatizers don’t work as expected. The root cause of the confusion is that their role is often misunderstood. Here’s a comparison:

* Both stemmers and lemmatizers try to bring inflected words to the same form
* Stemmers use an algorithmic approach of removing prefixes and suffixes. The result might not be an actual dictionary word.
* Lemmatizers use a corpus. The result is always a dictionary word.
* Lemmatizers need extra info about the part of speech they are processing. “Calling” can be either a verb or a noun (the calling)
* Stemmers are faster than lemmatizers

When to use stemmers and when to use lemmatizers

There’s no definite right answer here, but here are a few guidelines:

  • If speed is important, use stemmers (lemmatizers have to search through a corpus while stemmers do simple operations on a string)
  • If you just want to make sure that the system you are building is tolerant to inflections, use stemmers (If you query for “best bar in New York”, you’d accept an article on “Best bars in New York 2016″)
  • If you need the actual dictionary word, use a lemmatizer. (for example, if you are building a natural language generation system)

How do stemmers work

Stemmers are extremely simple to use and very fast. They usually are the preferred choice. They work by applying different transformation rules on the word until no other transformation can be applied.

You can see a stemmer in action in this article about Building an inverted index

How do lemmatizers work

As previously mentioned, lemmatizers need to know about the part of speech. This is a substantial dissadvantage since the task of Part-Of-Speech tagging is prone to errors. Here’s how to properly use a lemmatizer: