Quick Recipe: Building Word Clouds
What are Word Clouds?
Word Clouds are a popular way of displaying how important words are in a collection of texts. Basically, the more frequent the word is, the greater space it occupies in the image. One of the uses of Word Clouds is to help us get an intuition about what the collection of texts is about. Here are some classic examples of when Word Clouds can be useful:
- Take a quick peek at the word distribution of a collection of texts
- Clean the texts and want to see what are some frequent stopwords you want to filter out
- See the differences between frequent words between two or more collections of texts
Let’s suppose you want to build a text classification system. If you’d want to see what are the different frequent words in the different categories, you’d build a Word Cloud for each category and see what are the most popular words inside each category.
As you probably expected, there’s a Python library that does building Word Clouds very easy: word_cloud
Build a simple Word Cloud
Let’s build a simple word cloud for the reuters
corpus:
1 2 3 4 5 6 7 8 9 10 | import matplotlib.pyplot as plt from nltk.corpus import reuters from wordcloud import WordCloud wc = WordCloud().generate(' '.join(reuters.words())) plt.imshow(wc, interpolation='bilinear') plt.axis("off") plt.show() |
Reuters word cloud
As you can notice, the Reuters dataset seems to be composed of finance related articles. Let’s do the same for nltk.corpus.brown
:
1 2 3 4 5 6 7 8 9 10 | import matplotlib.pyplot as plt from nltk.corpus import brown from wordcloud import WordCloud wc = WordCloud().generate(' '.join(brown.words())) plt.imshow(wc, interpolation='bilinear') plt.axis("off") plt.show() |
Brown word cloud
As you can see, brown
is more general. The most frequent words seem to be the ones you’d expect.
Let’s check out a few more features:
1 2 3 4 5 6 7 8 9 | from nltk.corpus import stopwords stop_words = stopwords.words('english') # Set a max number of words, set a list of stopwords and set the max font size wc = WordCloud(max_words=100, stopwords=stop_words, max_font_size=50).generate(' '.join(brown.words())) plt.imshow(wc, interpolation='bilinear') plt.axis("off") plt.show() |
Another Brown word cloud
Let’s now build something a bit more complex. Suppose we want to color the words in the cloud according to their part of speech. Here’s how we’d do that:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | import collections from nltk import pos_tag_sents from nltk.corpus import reuters from wordcloud import WordCloud tagged_sents = pos_tag_sents(reuters.sents()) word_tag_counters = collections.defaultdict(collections.Counter) for tagged_sent in tagged_sents: for w, t in tagged_sent: word_tag_counters[w.lower()][t] += 1 def get_color(word, **kwargs): # Get most common tag for the given word try: most_common_tag = word_tag_counters[word].most_common(1)[0][0] except: return 'gray' # Red for Nouns if most_common_tag.startswith('NN'): return 'red' # Blue for Verbs elif most_common_tag.startswith('VB'): return 'blue' # Orange for Adjectives elif most_common_tag.startswith('JJ'): return 'orange' # Green for Adverbs elif most_common_tag.startswith('RB'): return 'green' # Gray for everything else return 'gray' wc = WordCloud().generate(' '.join(brown.words()).lower()) wc.recolor(color_func=get_color) plt.figure() plt.imshow(wc, interpolation="bilinear") plt.axis("off") plt.show() |
POS tagged word cloud
Nice article! Looking forward for more stuff! Your blog is one of best blogs for ppl who start with NLP, like me. Cheers!
Hey Bart,
Thanks so much! Have a good one!
Bogdan.