Roundup of Python NLP Libraries
The purpose of this post is to gather into a list, the most important libraries in the Python NLP libraries ecosystem. This list is important because Python is by far the most popular language for doing Natural Language Processing. This list is constantly updated as new libraries come into existence. In case you are looking for a list of useful corpora, check out this NLP corpora list
General Purpose
Name | Functionalities | Notes | URL |
---|---|---|---|
NLTK | tokenization, POS, NER, classification, sentiment analysis, access to corpora | Maybe the best known Python NLP Library. Not entirely suited for production environments but really good for getting started | GitHub |
spaCy | tokenization, POS, NER, classification, sentiment analysis, dependency parsing, word vectors | Efficient and performant NLP Library built with Cython for speed | GitHub |
Gensim | topic modelling, word vectors, access to corpora | Perfomant topic modelling library | GitHub |
Stanford NLP | tokenization, POS, NER, classification, word vectors | The famous Stanford CoreNLP Library | GitHub |
Flair | tokenization, POS, NER, dependency parsing | A very simple framework for state-of-the-art NLP | GitHub |
TextBlob | tokenization, POS, NER, classification, sentiment analysis, spellcheck, parsing | Pythonic library built upon NLTK and Pattern | GitHub |
Polyglot | tokenization, POS, NER, classification, sentiment analysis, spellcheck, parsing | Library focusing on multilingual NLP. Models available for most languages. | GitHub |
Pattern | tokenization, POS, NER, sentiment analysis, parsing | General purpose framework similar in purpose to NLTK | GitHub |
ScikitLearn | classification | General purpose machine learning framework with text classification features | GitHub |
SkLearn CRF | sequence tagging | Sequence tagging classifiers following the ScikitLearn API | GitHub |
Ambiverse NLU | NER, Concept Extraction | A Natural Language Understanding suite by Max Planck Institute for Informatics | GitHub |
Textacy | tokenization, POS, NER, sentiment analysis, parsing, corpora access, topic modelling, statistics | High level library built on top of spaCy | GitHub |
thinc | high-level deep learning models | spaCy’s deep learning infrastructure | GitHub |
NLPNet | POS, parsing, SRL | Neural models for POS tagging, dependency parsing, semantic role labelling | GitHub |
finetune | Classification, Entailment, Sequence Tagging | Scikit-learn style model finetuning for NLP | GitHub |
SpellChecking
Name | Functionalities | URL |
---|---|---|
JamSpell | spellcheck | GitHub |
PySpellchecker | spellcheck | GitHub |
PyEnchant | spellcheck | GitHub |
Based on PyTorch
Name | Functionalities | Notes | URL |
---|---|---|---|
PyText | built-in neural models | NLP framework built on top of PyTorch from Facebook Research | GitHub |
PyTorch-NLP | build neural models, corpora access | Simple high level framework built on top of PyTorch | GitHub |
torchtext | corpora access | Load text data for processing with PyTorch | GitHub |
AllenNLP | SRL, Question Answering, Entailment | State-of-the-art deep learning models on a wide variety of linguistic tasks | GitHub |
Visualizing Text
Name | Functionalities | Notes | URL |
---|---|---|---|
Scattertext | visualization | Perform visual exploratory text analysis | GitHub |
word_cloud | visualization | Draw word clouds | GitHub |
Chatbots
Name | Functionalities | Notes | URL |
---|---|---|---|
SnipsNLU | NLU engine | Pretrained NLU models available | GitHub |
Rasa NLU | NLU engine | NLU Engine that can use pretrained spaCy or mitie models | GitHub |
DeepPavlov | NLU Engine, Dialog System | Open Source conversational AI library | GitHub |
Miscellaneous
Name | Functionalities | Notes | URL |
---|---|---|---|
Splitta | sentence boundary detection | Statistical models for sentence boundary detection | GitHub |
chardet | encoding detection | Universal Character Encoding Detector | GitHub |
vocabulary | synonims, dictionary | Dictionary as a module | GitHub |
langdetect | language detection | well … it detects the language a text is written in 🙂 | GitHub |
Nice article – as usual! You might want to add DeepPavlov – https://deeppavlov.ai/ – on the chatbot section.
Also small typo, spaCy’s deep learning library is thinc (not think)
Hi Markos,
Indeed, my spellchecker couldn’t stand “thinc”. Will add DeepPavlov and also, thanks for your kind words.
Bogdan.
Thanks for the article Bogdan.
Any reason why no OpenNLP here?
not inside the Python ecosystem
This is a really resourceful article. Thanks!
Please do mention Spark NLP too in it.
https://github.com/JohnSnowLabs/spark-nlp