Complete Guide to Word Embeddings
Introduction
We talked briefly about word embeddings (also known as word vectors) in the spaCy tutorial.
SpaCy has word vectors included in its models. This tutorial will go deep into the intricacies of how to compute them and their different applications.
Take for example this article: Text Classification Recipe. Using the BOW model we just keep counts of the words from the vocabulary. We don’t know anything about the words semantics.
Another drawback of the BOW model is that we work with very sparse vectors most of the time. We keep “slots” for words that only appeared once in the corpus (or very rarely). Also, we keep different slots for very similar words. You might say that this can be solved by using a stemmer. That’s true, we can make the feature space smaller using a stemmer but only to some extent. There are highly related words that don’t have the same stem.
A simple way of computing word vectors is to apply a dimensionality reduction algorithm on the Document-Term matrix like we did in the Topic Modeling Article. In that case, we only did size-2 vectors. We can select a more realistic vector size.
Ok, so what are word vectors (or embeddings)? They are vectorized representations of words. In order for the representation to be useful (to mean something) some empirical conditions should be satisfied:
- Similar words should be closer to each other in the vector space
- Allow word analogies:
"King" - "Man" + "Woman" == "Queen"
In case the word analogy example looks a bit black magic, bare with me, all will be uncovered. In fact, word analogies are so popular that they’re one of the best ways to check if the word embeddings have been computed correctly.
If you are familiar a bit with Recurrent Neural Networks (RNNs), I must mention that word embeddings can also be derived via an embedding layer that is trained via backpropagation along with the rest of the network. This won’t be covered in this tutorial. In fact, computing word embeddings can be very useful when working with neural nets. Using already computed word vectors is called pretraining.
Word2Vec Algorithm
This is the most popular algorithm for computing embeddings. It basically consists of a mini neural network that tries to learn a language model. Remember how we tried to generate text by picking probabilistically the next word? In its simplest form, the neural network can learn what is the next word after a given input node. Obviously, the results will be rather simplistic. We need more information about the context of a word in order to learn good embeddings.
CBOW vs Skip-Gram
CBOW
(Continuous Bag-Of-Words) is about creating a network that tries to predict the word in the middle given some surrounding words: [W[-3], W[-2], W[-1], W[1], W[2], W[3]] => W[0]
Skip-Gram
is the opposite of CBOW, try to predict the surrounding words given the word in the middle: W[0] => [W[-3], W[-2], W[-1], W[1], W[2], W[3]]
The computed network weights are actually the word embeddings we were looking for. If you don’t have any neural network experience, don’t worry, it’s not needed for doing the practical exercises in this tutorial.
Word2Vec with Gensim
Gensim provides a quality implementation of the Word2Vec model. Let’s see it in action on the Brown Corpus:
1 2 3 4 5 6 | from nltk.corpus import brown from gensim.models import Word2Vec print(brown.sents()) w2v_model = Word2Vec(brown.sents(), size=128, window=5, min_count=3, workers=4) |
Let’s now take the model for a spin:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | # Getting the vector for a word print(w2v_model.wv['Italy'], w2v_model.wv['France']) # Getting most similar vectors print(w2v_model.wv.most_similar('Paris')) # [('Italy', 0.9691801071166992), # ('London', 0.9680684208869934), # ('Boston', 0.9672129154205322), # ('1949', 0.9662010669708252), # ('Rome', 0.9606728553771973), # ('Chicago', 0.9604784846305847), # ('Village', 0.9537506103515625), # ('questionnaire', 0.951071560382843), # ('Law', 0.9509029388427734), # ('College', 0.9507826566696167)] # "King" - "Man" + "Woman" == "Queen" print(w2v_model.wv.most_similar(positive=['woman', 'king'], negative=['man'])) print(w2v_model.wv.most_similar(positive=["Rome", "France"], negative=["Italy"])) |
If you’ve been coding along, you probably are pretty disappointed of the results. Let’s try a bigger corpus:
1 2 3 4 5 6 | from gensim.models.word2vec import Text8Corpus # Go here and download + unzip the Text8 Corpus: http://mattmahoney.net/dc/text8.zip # We take only words that appear more than 150 times for doing a visualization later w2v_model2 = Word2Vec(Text8Corpus('~/Downloads/text8'), size=100, window=5, min_count=150, workers=4) |
We opted to only use the most popular words so that it’s easier to make a visualization later. Let’s see how does the new model perform (words and values will differ a bit):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | # Getting most similar vectors print(w2v_model2.wv.most_similar('paris')) # [('louvre', 0.7243613004684448), # ('venice', 0.7047281265258789), # ('vienna', 0.7043783068656921), # ('montparnasse', 0.7016372680664062), # ('le', 0.6870340704917908), # ('sur', 0.6818796396255493), # ('chapelle', 0.6787714958190918), # ('rodin', 0.6766049265861511), # ('bologna', 0.6761612892150879), # ('munich', 0.6749240159988403)] # "King" - "Man" + "Woman" == "Queen" print(w2v_model2.most_similar(['woman', 'king'], ['man'], topn=3)) # [('queen', 0.6777610778808594), ('throne', 0.6143913269042969), ('elizabeth', 0.593910813331604)] # "Father" - "Boy" + "Girl" == "Mother" print(w2v_model2.most_similar(['girl', 'father'], ['boy'], topn=3)) # [('mother', 0.7972878813743591), ('wife', 0.7469687461853027), ('grandmother', 0.7419005632400513)] # "Paris" - "France" + "Italy" = "Rome" print(w2v_model2.most_similar(['paris', 'italy'], ['france'], topn=3)) # [('venice', 0.7461134195327759), ('vienna', 0.7134193778038025), ('florence', 0.7019181251525879)] |
Here’s how you should think about analogies
From: https://www.tensorflow.org/images/linear-relationships.png
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | import numpy as np import pandas as pd from sklearn.manifold import TSNE from bokeh.io import push_notebook, show, output_notebook from bokeh.plotting import figure from bokeh.models import ColumnDataSource, LabelSet output_notebook() X = [] for word in w2v_model2.wv.vocab: X.append(w2v_model2.wv[word]) X = np.array(X) print("Computed X: ", X.shape) X_embedded = TSNE(n_components=2, n_iter=250, verbose=2).fit_transform(X) print("Computed t-SNE", X_embedded.shape) df = pd.DataFrame(columns=['x', 'y', 'word']) df['x'], df['y'], df['word'] = X_embedded[:,0], X_embedded[:,1], w2v_model2.wv.vocab source = ColumnDataSource(ColumnDataSource.from_df(df)) labels = LabelSet(x="x", y="y", text="word", y_offset=8, text_font_size="8pt", text_color="#555555", source=source, text_align='center') plot = figure(plot_width=600, plot_height=600) plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8) plot.add_layout(labels) show(plot, notebook_handle=True) |
Let’s have a look at the t-SNE word cloud for some neatly packed together words:
Cities Cluster in Word2Vec vectors t-SNE representation
Car Manufacturers Cluster in Word2Vec vectors t-SNE representation
Color Cluster in Word2Vec vectors t-SNE representation
Time Unit Cluster in Word2Vec vectors t-SNE representation
Country Cluster in Word2Vec vectors t-SNE representation
GLoVe
GLoVe (Global Vectors) is another method for deriving word vectors. It doesn’t have an implementation in the popular libraries we’re used to but they should not be ignored. The algorithm is derived from algebraic methods (similar to matrix factorization), performs very well and it converges faster than Word2Vec.
You can download pretrained vectors (on various corpora) from here: Stanford GLoVe Website.
Many libraries offer support for GLoVe vectors:
FastText with Python and Gensim
fastText is a library developed by Facebook that serves two main purposes:
- Learning of word vectors
- Text classification
If you are familiar with the other popular ways of learning word representations (Word2Vec and GloVe), fastText brings something innovative to the table. Rather than considering words being independent of one another (fast
theoretically has nothing in common with faster
or fastest
in the classic models), fastText takes into account all character subsequences (within a length range) when computing a representation for a word. This means that:
- Better vectors for rare words: If we don’t have enough context for a word because it appears very few times in the corpus, we might be able to use its related words’ context.
- We can compute vectors even for unseen words: Use the vectors of the subsequences to get the vector.
Installing fastText
Installing the Python wrapper for fastText is as easy as:
1 2 | pip install fasttext |
Unfortunately, the capabilities of the wrapper are pretty limited. You still need to work with on-disk text files rather than go about your normal Pythonesque way. Because of that, we’ll be using the gensim
fastText implementation.
Create a fastText model
Building a fastText model with gensim
is very similar to building a Word2Vec
model. It does take more time though.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | from gensim.models import FastText ft_model = FastText(Text8Corpus('~/Downloads/text8'), size=100, window=5, min_count=150, workers=4, min_n=3, max_n=10) # Getting most similar vectors print(ft_model.wv.most_similar('paris')) # [('vienna', 0.7305958271026611), # ('venice', 0.7068097591400146), # ('florence', 0.6955196261405945), # ('brussels', 0.682724118232727), # ('leipzig', 0.6486526131629944), # ('francesco', 0.6461360454559326), # ('amsterdam', 0.6385960578918457), # ('france', 0.6323560476303101), # ('cemetery', 0.6285153031349182), # ('hamburg', 0.6284394264221191)] # "King" - "Man" + "Woman" == "Queen" print(ft_model.most_similar(['woman', 'king'], ['man'], topn=3)) # [('emperor', 0.68890380859375), ('queen', 0.6823415160179138), ('princess', 0.6764928102493286)] # "Father" - "Boy" + "Girl" == "Mother" print(ft_model.most_similar(['girl', 'father'], ['boy'], topn=3)) # [('mother', 0.7996115684509277), ('grandfather', 0.7629683613777161), ('wife', 0.7478234767913818)] # "Paris" - "France" + "Italy" = "Rome" print(ft_model.most_similar(['paris', 'italy'], ['france'], topn=3)) # [('vienna', 0.6932680606842041), ('venice', 0.652579128742218), ('moscow', 0.6098273992538452)] |
The most amazing feature of FastText is that it’s able to fill in the blanks and figure out word vectors for out of vocabulary words:
1 2 3 4 5 6 7 8 9 10 11 12 | # Misspell something similar to Venice and we still get a vector ... print(ft_model.wv['veniciaaaaaa']) # [-6.31419778e-01 9.52705503e-01 1.35608479e-01 4.74076539e-01 ... # Let's see if indeed it understood we're trying to say Venice print(ft_model.most_similar('veniciaaaaaa', topn=3)) # [('venice', 0.7861752510070801), ('brussels', 0.771102786064148), ('francesco', 0.7474006414413452)] # What? print(ft_model.most_similar('whaaaa', topn=3)) # [('what', 0.8659393787384033), ('whatever', 0.7308462858200073), ('why', 0.6594464778900146)] |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | X = [] for word in ft_model.wv.vocab: X.append(ft_model.wv[word]) X = np.array(X) print("Computed X: ", X.shape) X_embedded = TSNE(n_components=2, n_iter=250, verbose=2).fit_transform(X) print("Computed t-SNE", X_embedded.shape) df = pd.DataFrame(columns=['x', 'y', 'word']) df['x'], df['y'], df['word'] = X_embedded[:,0], X_embedded[:,1], w2v_model2.wv.vocab source = ColumnDataSource(ColumnDataSource.from_df(df)) labels = LabelSet(x="x", y="y", text="word", y_offset=8, text_font_size="8pt", text_color="#555555", source=source, text_align='center') plot = figure(plot_width=600, plot_height=600) plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8) plot.add_layout(labels) show(plot, notebook_handle=True) |
Religion Cloud in t-SNE word cloud for fastText vectors
What are Word Vectors good for?
Word vectors can be very useful. They open a bunch of new opportunities. Here are some of them:
- Use them as the input of a neural network, usually boosting performance
- It can be considered automatic feature extraction
- Establish synonyms
- Dimensionality reduction
- Create Document Vectors: either by averaging the word vectors or by using the Doc2Vec extension.
Hey Mate,
It’s been quie a while since your last post. Hoping you are doing good and looking forward to some new content! 🙂
Hey there!
Thanks for asking, I’m great! I’ve been travelling for a while and now I’m back. New content soon.
Cheers,
Bogdan.
Hi, thanks for your blog,
However, you wrote :
But I think the right example is :
"King" - "Man" + "Woman" == "Queen"
You are right Sir, will correct it ASAP 🙂
Hello, Is it possible to show in a follow up to this article a simple example how to use the vectors in a classification task like NER?
Already here: https://nlpforhackers.io/lstm-pos-tagger-keras/
Salut Bogdan!
Imi place blogul tau!
Bafta
Salut Mihai,
Multumesc frumos 🙂
Ne cunoastem in real-life?
Numai bine,
Bogdan.