Menu Sidebar
Word Embeddings Cover

Complete Guide to Word Embeddings


We talked briefly about word embeddings (also known as word vectors) in the spaCy tutorial.
SpaCy has word vectors included in its models. This tutorial will go deep into the intricacies of how to compute them and their different applications.

Bag Of Words Model
In most of our tutorials so far, we’ve been using a Bag-Of-Words model.
Take for example this article: Text Classification Recipe. Using the BOW model we just keep counts of the words from the vocabulary. We don’t know anything about the words semantics.

Read More

Part Of Speech tagging with CRF

Quick Recipe: Build a POS tagger using a Conditional Random Field

A while back I wrote a Complete guide for training your own Part-Of-Speech Tagger. If you are new to Part-Of-Speech Tagging (POS Tagging) make sure you follow that tutorial first. This article is more of an enhancement of the work done there.

What is a CRF?

A Conditional Random Field (CRF for short) is a discriminative sequence labelling model. It’s fairly easy to explain model (compared to Hidden Markov Models). Basically, given:

  1. some feature extractors (feature extractors need to output real numbers)
  2. weights associated with the features (which are learned)
  3. previous labels

predict the current label.

You probably just realized that they seem totally appropriate for doing POS tagging. That’s true, and it’s also appropriate for other NLP tools like NE Extractors and Chunkers .

Read More

spaCy Tutorial Cover

Complete Guide to spaCy

spaCy is a relatively new framework in the Python Natural Language Processing environment but it quickly gains ground and will most likely become the de facto library. There are some really good reasons for its popularity:

It's really FAST
Written in Cython, it was specifically designed to be as fast as possible
It's really ACCURATE
spaCy implementation of its dependency parser is one of the best-performing in the world:
It Depends: Dependency Parser Comparison
Using A Web-based Evaluation Tool

Batteries included
  • Index preserving tokenization (details about this later)
  • Models for Part Of Speech tagging, Named Entity Recognition and Dependency Parsing
  • Supports 8 languages out of the box
  • Easy and beautiful visualizations
  • Pretrained word vectors
It plays nicely with all the other already existing tools that you know and love: Scikit-Learn, TensorFlow, gensim
DeepLearning Ready
It also has its own deep learning framework that’s especially designed for NLP tasks:

Read More

Complete Guide to Topic Modeling

Complete Guide to Topic Modeling

What is Topic Modeling?

Topic modelling, in the context of Natural Language Processing, is described as a method of uncovering hidden structure in a collection of texts. Although that is indeed true it is also a pretty useless definition. Let’s define topic modeling in more practical terms.
Read More

WordClouds Cover

Quick Recipe: Building Word Clouds

What are Word Clouds?

Word Clouds are a popular way of displaying how important words are in a collection of texts. Basically, the more frequent the word is, the greater space it occupies in the image. One of the uses of Word Clouds is to help us get an intuition about what the collection of texts is about. Here are some classic examples of when Word Clouds can be useful:

Read More

TextRank for Text Summarization

TextRank for Text Summarization

The task of summarization is a classic one and has been studied from different perspectives. The task consists of picking a subset of a text so that the information disseminated by the subset is as close to the original text as possible. The subset, named the summary, should be human readable. The task is not about picking the most common words or entities. Think of it as a quick digest for a news article.
Read More

Language models

If you come from a statistical background or a machine learning one then probably you don’t need any reasons for why it’s useful to build language models. If not, here’s what language models are and why they are useful.
Read More

Natural Language Processing Corpora

Natural Language Processing Corpora

One of the reasons why it’s so hard to learn, practice and experiment with Natural Language Processing is due to the lack of available corpora. Building a gold standard corpus is seriously hard work. That’s why resources are so scarce or cost a lot of money. In this post, I’m going to aggregate some cool resources, some very well known, some a bit under the radar.
Read More

Introduction to Python NLTK

Introduction to NLTK

NLTK (Natural Language ToolKit) is the most popular Python framework for working with human language. There’s a bit of controversy around the question whether NLTK is appropriate or not for production environments. Here’s my take on the matter:
Read More


Weighting words using Tf-Idf

If I ask you “Do you remember the article about electrons in NY Times?” there’s a better chance you will remember it than if I asked you “Do you remember the article about electrons in the Physics books?”. Here’s why: an article about electrons in NY Times is far less common than in a collection of physics books. It is less likely to stumble upon the “electron” concept in NY Times than in a physics book.
Read More

Older Posts




Like My Tutorials?

Buy me a coffee

Pin It on Pinterest


Sign up for the Newsletter

Here's what to expect:

* Newly published content

* Curated articles from around the web about NLP and related

* Absolutely NO SPAM

You have Successfully Subscribed!