Complete Guide to spaCy

Updates

  • 29-Apr-2018 – Fixed import in extension code (Thanks Ruben)

spaCy is a relatively new framework in the Python Natural Language Processing environment but it quickly gains ground and will most likely become the de facto library. There are some really good reasons for its popularity:

It's really FAST
Written in Cython, it was specifically designed to be as fast as possible
It's really ACCURATE
spaCy implementation of its dependency parser is one of the best-performing in the world:
It Depends: Dependency Parser Comparison
Using A Web-based Evaluation Tool

Batteries included
  • Index preserving tokenization (details about this later)
  • Models for Part Of Speech tagging, Named Entity Recognition and Dependency Parsing
  • Supports 8 languages out of the box
  • Easy and beautiful visualizations
  • Pretrained word vectors
Extensible
It plays nicely with all the other already existing tools that you know and love: Scikit-Learn, TensorFlow, gensim
DeepLearning Ready
It also has its own deep learning framework that’s especially designed for NLP tasks:
Thinc

Quickstart

spaCy is easy to install:

Notice that the installation doesn’t automatically download the English model. We need to do that ourselves.

Notice the index preserving tokenization in action. Rather than only keeping the words, spaCy keeps the spaces too. This is helpful for situations when you need to replace words in the original text or add some annotations. With NLTK tokenization, there’s no way to know exactly where a tokenized word is in the original raw text. spaCy preserves this “link” between the word and its place in the raw text. Here’s how to get the exact index of a word:

The Token class exposes a lot of word-level attributes. Here are a few examples:

The spaCy toolbox

Let’s now explore what are the models bundled up inside spaCy.

Sentence detection

Here’s how to achieve one of the most common NLP tasks with spaCy:

Part Of Speech Tagging

We’ve already seen how this works but let’s have another look:

Named Entity Recognition

Doing NER with spaCy is super easy and the pretrained model performs pretty well:

You can also view the IOB style tagging of the sentence like this:

The spaCy NER also has a healthy variety of entities. You can view the full list here: Entity Types

Let’s use displaCy to view a beautiful visualization of the Named Entity annotated sentence:

displaCy Named Entity Visualization

Chunking

spaCy automatically detects noun-phrases as well:

Notice how the chunker also computes the root of the phrase, the main word of the phrase.

Dependency Parsing

This is what makes spaCy really stand out. Let’s see the dependency parser in action:

If this doesn’t help visualizing the dependency tree, displaCy comes in handy:

displaCy dependency parse visualization

Word Vectors

spaCy comes shipped with a Word Vector model as well. We’ll need to download a larger model for that:

The vectors are attached to spaCy objects: Token, Lexeme (a sort of unnatached token, part of the vocabulary), Span and Doc. The multi-token objects average its constituent vectors.

Explaining word vectors(aka word embeddings) are not the purpose of this tutorial. Here are a few properties word vectors have:

  • If two words are similar, they appear in similar contexts
  • Word vectors are computed taking into account the context (surrounding words)
  • Given the two previous observations, similar words should have similar word vectors
  • Using vectors we can derive relationships between words

Let’s see how we can access the embedding of a word in spaCy:

There’s a really famous example of word embedding math: "man" - "woman" + "queen" = "king". It sounds pretty crazy to be true, so let’s test that out:

Surprisingly, the closest word vector in the vocabulary for “man” – “woman” + “queen” is still “Queen” but “King” comes right after. Maybe behind every King is a Queen?

Computing Similarity

Based on the word embeddings, spaCy offers a similarity interface for all of it’s building blocks: Token, Span, Doc and Lexeme. Here’s how to use that similarity interface:

Let’s now use this technique on entire texts:

Extending spaCy

The entire spaCy architecture is built upon three building blocks: Document (the big encompassing container), Token(most of the time, a word) and Span (set of consecutive Tokens). The extensions you create can add extra functionality to anyone of the these components. There are some examples out there for what you can do. Let’s create an extension ourselves.

Creating Document level Extension

One can easily create extensions for every component type. Such extensions only have access to the context of that component. What happens if you need the tokenized text along with the Part-Of-Speech tags. Let’s now build a custom pipeline. Pipelines are another important abstraction of spaCy. The nlp object goes through a list of pipelines and runs them on the document. For example the tagger is ran first, then the parser and ner pipelines are applied on the already POS annotated document. Here’s how the nlp default pipeline structure looks like:

Creating a custom pipeline

Let’s build a custom pipeline that needs to be applied after the tagger pipeline is ran. We need the POS tags to get the Synset from Wordnet.

Let’s see how the pipeline structure looks like:

Conclusions

spaCy is a modern, reliable NLP framework that quickly became the standard for doing NLP with Python. Its main advantages are: speed, accuracy, extensibility. It also comes shipped with useful assets like word embeddings. It can act as the central part of your production NLP pipeline.