Complete Guide to Word Embeddings

Introduction

We talked briefly about word embeddings (also known as word vectors) in the spaCy tutorial.
SpaCy has word vectors included in its models. This tutorial will go deep into the intricacies of how to compute them and their different applications.

Bag Of Words Model
In most of our tutorials so far, we’ve been using a Bag-Of-Words model.
Take for example this article: Text Classification Recipe. Using the BOW model we just keep counts of the words from the vocabulary. We don’t know anything about the words semantics.

Another drawback of the BOW model is that we work with very sparse vectors most of the time. We keep “slots” for words that only appeared once in the corpus (or very rarely). Also, we keep different slots for very similar words. You might say that this can be solved by using a stemmer. That’s true, we can make the feature space smaller using a stemmer but only to some extent. There are highly related words that don’t have the same stem.

A simple way of computing word vectors is to apply a dimensionality reduction algorithm on the Document-Term matrix like we did in the Topic Modeling Article. In that case, we only did size-2 vectors. We can select a more realistic vector size.

Ok, so what are word vectors (or embeddings)? They are vectorized representations of words. In order for the representation to be useful (to mean something) some empirical conditions should be satisfied:

  • Similar words should be closer to each other in the vector space
  • Allow word analogies: "King" - "Man" + "Queen" == "Woman"

In case the word analogy example looks a bit black magic, bare with me, all will be uncovered. In fact, word analogies are so popular that they’re one of the best ways to check if the word embeddings have been computed correctly.

If you are familiar a bit with Recurrent Neural Networks (RNNs), I must mention that word embeddings can also be derived via an embedding layer that is trained via backpropagation along with the rest of the network. This won’t be covered in this tutorial. In fact, computing word embeddings can be very useful when working with neural nets. Using already computed word vectors is called pretraining.

Word2Vec Algorithm

This is the most popular algorithm for computing embeddings. It basically consists of a mini neural network that tries to learn a language model. Remember how we tried to generate text by picking probabilistically the next word? In its simplest form, the neural network can learn what is the next word after a given input node. Obviously, the results will be rather simplistic. We need more information about the context of a word in order to learn good embeddings.

CBOW vs Skip-Gram

CBOW (Continuous Bag-Of-Words) is about creating a network that tries to predict the word in the middle given some surrounding words: [W[-3], W[-2], W[-1], W[1], W[2], W[3]] => W[0]

Skip-Gram is the opposite of CBOW, try to predict the surrounding words given the word in the middle: W[0] => [W[-3], W[-2], W[-1], W[1], W[2], W[3]]

The computed network weights are actually the word embeddings we were looking for. If you don’t have any neural network experience, don’t worry, it’s not needed for doing the practical exercises in this tutorial.

Word2Vec with Gensim

Gensim provides a quality implementation of the Word2Vec model. Let’s see it in action on the Brown Corpus:

Let’s now take the model for a spin:

If you’ve been coding along, you probably are pretty disappointed of the results. Let’s try a bigger corpus:

We opted to only use the most popular words so that it’s easier to make a visualization later. Let’s see how does the new model perform (words and values will differ a bit):

Here’s how you should think about analogies

Word embeddings linear relationships - analogies

From: https://www.tensorflow.org/images/linear-relationships.png

:

Visualizing Word2Vec Vectors with t-SNE
t-SNE (t-distributed stochastic neighbour embedding) is a popular algorithm in the deep learning crowd for displaying high-dimensional data in 2D/3D. Let’s see how we can use it to display our vectors.

Let’s have a look at the t-SNE word cloud for some neatly packed together words:

Cities Cluster in Word2Vec vectors t-SNE representation

Cities Cluster in Word2Vec vectors t-SNE representation

Car Manufacturers Cluster in Word2Vec vectors t-SNE representation

Car Manufacturers Cluster in Word2Vec vectors t-SNE representation

Color Cluster in Word2Vec vectors t-SNE representation

Color Cluster in Word2Vec vectors t-SNE representation

Time Unit Cluster in Word2Vec vectors t-SNE representation

Time Unit Cluster in Word2Vec vectors t-SNE representation

Country Cluster in Word2Vec vectors t-SNE representation

Country Cluster in Word2Vec vectors t-SNE representation

GLoVe

GLoVe (Global Vectors) is another method for deriving word vectors. It doesn’t have an implementation in the popular libraries we’re used to but they should not be ignored. The algorithm is derived from algebraic methods (similar to matrix factorization), performs very well and it converges faster than Word2Vec.

You can download pretrained vectors (on various corpora) from here: Stanford GLoVe Website.

Many libraries offer support for GLoVe vectors:

FastText with Python and Gensim

fastText is a library developed by Facebook that serves two main purposes:

  1. Learning of word vectors
  2. Text classification

If you are familiar with the other popular ways of learning word representations (Word2Vec and GloVe), fastText brings something innovative to the table. Rather than considering words being independent of one another (fast theoretically has nothing in common with faster or fastest in the classic models), fastText takes into account all character subsequences (within a length range) when computing a representation for a word. This means that:

  • Better vectors for rare words: If we don’t have enough context for a word because it appears very few times in the corpus, we might be able to use its related words’ context.
  • We can compute vectors even for unseen words: Use the vectors of the subsequences to get the vector.

Installing fastText

Installing the Python wrapper for fastText is as easy as:

Unfortunately, the capabilities of the wrapper are pretty limited. You still need to work with on-disk text files rather than go about your normal Pythonesque way. Because of that, we’ll be using the gensim fastText implementation.

Create a fastText model

Building a fastText model with gensim is very similar to building a Word2Vec model. It does take more time though.

The most amazing feature of FastText is that it’s able to fill in the blanks and figure out word vectors for out of vocabulary words:

Visualizing fastText Vectors with t-SNE
fastText vectors are still numpy vectors. We can use the same code to create the word cloud.
t-SNE word cloud for fastText vectors

Religion Cloud in t-SNE word cloud for fastText vectors

What are Word Vectors good for?

Word vectors can be very useful. They open a bunch of new opportunities. Here are some of them:

  • Use them as the input of a neural network, usually boosting performance
  • It can be considered automatic feature extraction
  • Establish synonyms
  • Dimensionality reduction
  • Create Document Vectors: either by averaging the word vectors or by using the Doc2Vec extension.