Quick Recipe: Building Word Clouds

What are Word Clouds?

Word Clouds are a popular way of displaying how important words are in a collection of texts. Basically, the more frequent the word is, the greater space it occupies in the image. One of the uses of Word Clouds is to help us get an intuition about what the collection of texts is about. Here are some classic examples of when Word Clouds can be useful:

  • Take a quick peek at the word distribution of a collection of texts
  • Clean the texts and want to see what are some frequent stopwords you want to filter out
  • See the differences between frequent words between two or more collections of texts

Let’s suppose you want to build a text classification system. If you’d want to see what are the different frequent words in the different categories, you’d build a Word Cloud for each category and see what are the most popular words inside each category.

As you probably expected, there’s a Python library that does building Word Clouds very easy: word_cloud

Build a simple Word Cloud

Let’s build a simple word cloud for the reuters corpus:

reuters word cloud

Reuters word cloud

As you can notice, the Reuters dataset seems to be composed of finance related articles. Let’s do the same for nltk.corpus.brown:

Brown word cloud

Brown word cloud

As you can see, brown is more general. The most frequent words seem to be the ones you’d expect.

Let’s check out a few more features:

Another Brown word cloud

Another Brown word cloud

Let’s now build something a bit more complex. Suppose we want to color the words in the cloud according to their part of speech. Here’s how we’d do that:

POS tagged word cloud

POS tagged word cloud