Complete Guide to Topic Modeling
What is Topic Modeling?
Topic modelling, in the context of Natural Language Processing, is described as a method of uncovering hidden structure in a collection of texts. Although that is indeed true it is also a pretty useless definition. Let’s define topic modeling in more practical terms.
Definitions:
C
: collection of documents containingN
texts.V
: vocabulary (the set of unique words in the collection)
T
in its feature space as {Word_i: count(Word_i, T) for Word_i in V}
, we can represent the text in its topic space as {Topic_i: weight(Topic_i, T) for Topic_i in Topics}
. Notice that we’re using Topics
to represent the set of all topics.Why is Topic Modeling useful?
There are several scenarios when topic modeling can prove useful. Here are some of them:
- Text classification – Topic modeling can improve classification by grouping similar words together in topics rather than using each word as a feature
- Recommender Systems – Using a similarity measure we can build recommender systems. If our system would recommend articles for readers, it will recommend articles with a topic structure similar to the articles the user has already read.
- Uncovering Themes in Texts – Useful for detecting trends in online publications for example
Topic Modeling Algorithms
There are several algorithms for doing topic modeling. The most popular ones include
- LDA – Latent Dirichlet Allocation – The one we’ll be focusing in this tutorial. Its foundations are Probabilistic Graphical Models
- LSA or LSI – Latent Semantic Analysis or Latent Semantic Indexing – Uses Singular Value Decomposition (SVD) on the Document-Term Matrix. Based on Linear Algebra
- NMF – Non-Negative Matrix Factorization – Based on Linear Algebra
Here are some things all these algorithms have in common:
- The number of topics (
n_topics
) as a parameter. None of the algorithms can infer the number of topics in the document collection. - All of the algorithms have as input the Document-Word Matrix (or Document-Term Matrix).
DWM[i][j] = The number of occurrences of word_j in document_i
- All of them output 2 matrices:
WTM
(Word Topic Matrix) andTDM
(Topic Document Matrix). The matrices are significantly smaller and the result of their multiplication should be as close as possible to the originalDWM
matrix.
The purpose of this guide is not to describe in great detail each algorithm, but rather a practical overview and concrete implementations in Python using Scikit-Learn and Gensim. We’ll go over every algorithm to understand them better later in this tutorial. Next, we’re going to use Scikit-Learn and Gensim to perform topic modeling on a corpus.
Using Gensim for Topic Modeling
We’re going to first study the gensim
implementations because they offer more functionality out of the box and then we’ll replicate that functionality with sklearn
. Let’s first prepare the dataset we’ll be working with.
1 2 3 4 5 6 7 8 9 10 11 12 | from nltk.corpus import brown data = [] for fileid in brown.fileids(): document = ' '.join(brown.words(fileid)) data.append(document) NO_DOCUMENTS = len(data) print(NO_DOCUMENTS) print(data[:5]) |
Gensim doesn’t have an implementation for NMF so we’re only going to play with LDA and LSI (Latent Semantic Indexing AKA Latent Semantic Analysis) models.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | import re from gensim import models, corpora from nltk import word_tokenize from nltk.corpus import stopwords NUM_TOPICS = 10 STOPWORDS = stopwords.words('english') def clean_text(text): tokenized_text = word_tokenize(text.lower()) cleaned_text = [t for t in tokenized_text if t not in STOPWORDS and re.match('[a-zA-Z\-][a-zA-Z\-]{2,}', t)] return cleaned_text # For gensim we need to tokenize the data and filter out stopwords tokenized_data = [] for text in data: tokenized_data.append(clean_text(text)) # Build a Dictionary - association word to numeric id dictionary = corpora.Dictionary(tokenized_data) # Transform the collection of texts to a numerical form corpus = [dictionary.doc2bow(text) for text in tokenized_data] # Have a look at how the 20th document looks like: [(word_id, count), ...] print(corpus[20]) # [(12, 3), (14, 1), (21, 1), (25, 5), (30, 2), (31, 5), (33, 1), (42, 1), (43, 2), ... # Build the LDA model lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary) # Build the LSI model lsi_model = models.LsiModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary) |
Let’s now display the topics the two models have inferred:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | print("LDA Model:") for idx in range(NUM_TOPICS): # Print the first 10 most representative topics print("Topic #%s:" % idx, lda_model.print_topic(idx, 10)) print("=" * 20) print("LSI Model:") for idx in range(NUM_TOPICS): # Print the first 10 most representative topics print("Topic #%s:" % idx, lsi_model.print_topic(idx, 10)) print("=" * 20) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | LDA Model: Topic #0: 0.006*"would" + 0.006*"one" + 0.004*"said" + 0.003*"new" + 0.003*"two" + 0.003*"time" + 0.003*"could" + 0.002*"may" + 0.002*"man" + 0.002*"also" Topic #1: 0.005*"one" + 0.005*"would" + 0.004*"said" + 0.003*"new" + 0.003*"could" + 0.003*"made" + 0.003*"two" + 0.003*"time" + 0.002*"first" + 0.002*"may" Topic #2: 0.005*"one" + 0.005*"would" + 0.005*"said" + 0.004*"could" + 0.003*"time" + 0.002*"two" + 0.002*"even" + 0.002*"new" + 0.002*"way" + 0.002*"first" Topic #3: 0.007*"would" + 0.005*"one" + 0.004*"could" + 0.004*"said" + 0.003*"first" + 0.003*"new" + 0.003*"may" + 0.002*"two" + 0.002*"time" + 0.002*"man" Topic #4: 0.006*"one" + 0.004*"said" + 0.004*"would" + 0.003*"could" + 0.003*"new" + 0.003*"like" + 0.003*"even" + 0.002*"two" + 0.002*"time" + 0.002*"may" Topic #5: 0.007*"one" + 0.006*"would" + 0.004*"time" + 0.003*"could" + 0.003*"may" + 0.003*"man" + 0.003*"said" + 0.002*"like" + 0.002*"new" + 0.002*"two" Topic #6: 0.007*"one" + 0.003*"may" + 0.003*"would" + 0.003*"could" + 0.003*"time" + 0.003*"new" + 0.003*"first" + 0.003*"two" + 0.003*"said" + 0.002*"man" Topic #7: 0.005*"one" + 0.004*"would" + 0.003*"new" + 0.003*"said" + 0.003*"first" + 0.003*"man" + 0.003*"two" + 0.003*"may" + 0.003*"state" + 0.003*"could" Topic #8: 0.007*"one" + 0.004*"would" + 0.003*"may" + 0.003*"time" + 0.003*"new" + 0.003*"two" + 0.002*"said" + 0.002*"mrs." + 0.002*"many" + 0.002*"also" Topic #9: 0.007*"would" + 0.006*"one" + 0.004*"said" + 0.004*"could" + 0.003*"new" + 0.003*"like" + 0.003*"man" + 0.003*"time" + 0.003*"even" + 0.003*"first" ==================== LSI Model: Topic #0: 0.308*"one" + 0.280*"would" + 0.202*"said" + 0.175*"could" + 0.146*"time" + 0.144*"new" + 0.126*"man" + 0.125*"like" + 0.125*"two" + 0.120*"first" Topic #1: -0.294*"said" + 0.219*"may" + 0.179*"state" + -0.176*"could" + -0.153*"would" + 0.143*"states" + 0.141*"new" + -0.140*"like" + -0.138*"back" + -0.105*"man" Topic #2: -0.340*"said" + -0.338*"state" + 0.229*"one" + -0.190*"states" + -0.161*"year" + -0.152*"mrs." + -0.135*"would" + -0.132*"united" + -0.132*"federal" + -0.130*"government" Topic #3: 0.262*"new" + 0.256*"mrs." + -0.155*"feed" + -0.151*"per" + 0.150*"world" + -0.144*"used" + 0.142*"church" + 0.117*"god" + 0.106*"life" + 0.100*"people" Topic #4: 0.509*"mrs." + -0.237*"would" + -0.193*"states" + -0.153*"united" + -0.132*"could" + -0.122*"man" + -0.121*"state" + -0.109*"government" + 0.104*"year" + 0.099*"school" Topic #5: 0.376*"would" + -0.373*"feed" + -0.269*"per" + -0.246*"state" + -0.129*"god" + -0.126*"daily" + -0.122*"man" + -0.119*"drug" + 0.117*"school" + -0.116*"name" Topic #6: 0.274*"feed" + -0.265*"mrs." + 0.220*"per" + 0.179*"school" + -0.161*"states" + 0.161*"would" + -0.147*"state" + 0.144*"said" + -0.136*"united" + -0.132*"one" Topic #7: -0.381*"mrs." + -0.277*"would" + 0.263*"state" + -0.230*"feed" + 0.223*"said" + 0.221*"school" + -0.147*"united" + -0.141*"per" + -0.105*"government" + 0.102*"education" Topic #8: -0.373*"state" + -0.277*"mrs." + -0.277*"would" + 0.174*"new" + 0.159*"business" + 0.159*"united" + -0.157*"one" + -0.128*"feed" + 0.126*"development" + 0.117*"small" Topic #9: -0.201*"may" + -0.192*"mrs." + 0.191*"new" + -0.182*"shall" + -0.171*"said" + -0.165*"united" + -0.156*"school" + -0.155*"states" + 0.148*"would" + -0.137*"form" ==================== |
Let’s now put the models to work and transform unseen documents to their topic distribution:
1 2 3 4 5 6 7 8 9 10 | text = "The economy is working better than ever" bow = dictionary.doc2bow(clean_text(text)) print(lsi_model[bow]) # [(0, 0.091615426138426506), (1, -0.0085557463300508351), (2, 0.016744863677828108), (3, 0.040508186718598529), (4, 0.014201267714185898), (5, -0.012208538275305329), (6, 0.031254053085582149), (7, 0.017529584659403553), (8, 0.056957633371540077), (9, 0.025989149894888153)] print(lda_model[bow]) # [(0, 0.020005183), (1, 0.020005869), (2, 0.02000626), (3, 0.020005472), (4, 0.020009108), (5, 0.020005926), (6, 0.81994385), (7, 0.020006068), (8, 0.020006327), (9, 0.020005994)] |
The LDA result can be interpreted as a distribution over topics. Let’s take an example:
[(0, 0.020229582), (1, 0.48642197), (2, 0.020894188), (3, 0.020058075), (4, 0.022410348), (5, 0.025939714), (6, 0.20046122), (7, 0.13457063), (8, 0.048185956), (9, 0.02082831)]
. This result suggests that topic 1
has the strongest representation in this text.
Gensim offers a simple way of performing similarity queries using topic models.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | from gensim import similarities lda_index = similarities.MatrixSimilarity(lda_model[corpus]) # Let's perform some queries similarities = lda_index[lda_model[bow]] # Sort the similarities similarities = sorted(enumerate(similarities), key=lambda item: -item[1]) # Top most similar documents: print(similarities[:10]) # [(104, 0.87591344), (178, 0.86124849), (31, 0.8604598), (77, 0.84932965), (85, 0.84843522), (135, 0.84421808), (215, 0.84184396), (353, 0.84038532), (254, 0.83498049), (13, 0.82832891)] # Let's see what's the most similar document document_id, similarity = similarities[0] print(data[document_id][:1000]) |
Using Scikit-Learn for Topic Modeling
Let’s now go through the same process with sklearn
. This librabry offers a NMF implementation as well. The algorithms are more bare-bones than what we’ve seen with gensim
but on the plus side, they implement the fit/transform
interface we’re used with:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD from sklearn.feature_extraction.text import CountVectorizer NUM_TOPICS = 10 vectorizer = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}') data_vectorized = vectorizer.fit_transform(data) # Build a Latent Dirichlet Allocation Model lda_model = LatentDirichletAllocation(n_topics=NUM_TOPICS, max_iter=10, learning_method='online') lda_Z = lda_model.fit_transform(data_vectorized) print(lda_Z.shape) # (NO_DOCUMENTS, NO_TOPICS) # Build a Non-Negative Matrix Factorization Model nmf_model = NMF(n_components=NUM_TOPICS) nmf_Z = nmf_model.fit_transform(data_vectorized) print(nmf_Z.shape) # (NO_DOCUMENTS, NO_TOPICS) # Build a Latent Semantic Indexing Model lsi_model = TruncatedSVD(n_components=NUM_TOPICS) lsi_Z = lsi_model.fit_transform(data_vectorized) print(lsi_Z.shape) # (NO_DOCUMENTS, NO_TOPICS) # Let's see how the first document in the corpus looks like in different topic spaces print(lda_Z[0]) print(nmf_Z[0]) print(lsi_Z[0]) |
In order to inspect the inferred topics we need to implement a print function ourselves:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | def print_topics(model, vectorizer, top_n=10): for idx, topic in enumerate(model.components_): print("Topic %d:" % (idx)) print([(vectorizer.get_feature_names()[i], topic[i]) for i in topic.argsort()[:-top_n - 1:-1]]) print("LDA Model:") print_topics(lda_model, vectorizer) print("=" * 20) print("NMF Model:") print_topics(nmf_model, vectorizer) print("=" * 20) print("LSI Model:") print_topics(lsi_model, vectorizer) print("=" * 20) |
Transforming an unseen document goes like this:
1 2 3 4 | text = "The economy is working better than ever" x = nmf_model.transform(vectorizer.transform([text]))[0] print(x) |
Here’s how to implement the similarity functionality we’ve seen in the gensim
section:
1 2 3 4 5 6 7 8 9 10 11 12 | from sklearn.metrics.pairwise import euclidean_distances def most_similar(x, Z, top_n=5): dists = euclidean_distances(x.reshape(1, -1), Z) pairs = enumerate(dists[0]) most_similar = sorted(pairs, key=lambda item: item[1])[:top_n] return most_similar similarities = most_similar(x, nmf_Z) document_id, similarity = similarities[0] print(data[document_id][:1000]) |
Plotting words and documents in 2D with SVD
We can use SVD with 2 components (topics) to display words and documents in 2D. The process is really similar. Let’s start with displaying documents since it’s a bit more straightforward.
In case you are running this in a Jupyter Notebook, run the following lines to init bokeh
:
1 2 3 4 5 6 | import pandas as pd from bokeh.io import push_notebook, show, output_notebook from bokeh.plotting import figure from bokeh.models import ColumnDataSource, LabelSet output_notebook() |
Let’s plot documents in 2D:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | svd = TruncatedSVD(n_components=2) documents_2d = svd.fit_transform(data_vectorized) df = pd.DataFrame(columns=['x', 'y', 'document']) df['x'], df['y'], df['document'] = documents_2d[:,0], documents_2d[:,1], range(len(data)) source = ColumnDataSource(ColumnDataSource.from_df(df)) labels = LabelSet(x="x", y="y", text="document", y_offset=8, text_font_size="8pt", text_color="#555555", source=source, text_align='center') plot = figure(plot_width=600, plot_height=600) plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8) plot.add_layout(labels) show(plot, notebook_handle=True) |
You can try going through the documents to see if indeed closer documents on the plot are more similar. To display words in 2D we just need to transpose the vectorized data: words_2d = svd.fit_transform(data_vectorized.T)
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | svd = TruncatedSVD(n_components=2) words_2d = svd.fit_transform(data_vectorized.T) df = pd.DataFrame(columns=['x', 'y', 'word']) df['x'], df['y'], df['word'] = words_2d[:,0], words_2d[:,1], vectorizer.get_feature_names() source = ColumnDataSource(ColumnDataSource.from_df(df)) labels = LabelSet(x="x", y="y", text="word", y_offset=8, text_font_size="8pt", text_color="#555555", source=source, text_align='center') plot = figure(plot_width=600, plot_height=600) plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8) plot.add_layout(labels) show(plot, notebook_handle=True) |
To get a really good word representation we need a significantly larger corpus. Even with this corpus, if we zoom around a bit, we can find some meaningful representations:
More about Latent Dirichlet Allocation
LDA is the most popular method for doing topic modeling in real-world applications. That is because it provides accurate results, can be trained online (do not retrain every time we get new data) and can be run on multiple cores. Let’s repeat the process we did in the previous sections with sklearn
and LatentDirichletAllocation
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | from sklearn.decomposition import LatentDirichletAllocation from sklearn.feature_extraction.text import CountVectorizer NUM_TOPICS = 10 vectorizer = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}') data_vectorized = vectorizer.fit_transform(data) # Build a Latent Dirichlet Allocation Model lda_model = LatentDirichletAllocation(n_topics=NUM_TOPICS, max_iter=10, learning_method='online') lda_Z = lda_model.fit_transform(data_vectorized) text = "The economy is working better than ever" x = lda_model.transform(vectorizer.transform([text]))[0] print(x, x.sum()) |
Notice how the factors corresponding to each component (topic) add up to 1. That’s not a coincidence. Indeed, LDA considers documents as being generated by a mixture of the topics. The purpose of LDA is to compute how much of the document was generated by which topic. In this example, more than half of the document has been generated by the second topic:
1 2 3 | [ 0.02501077 0.5133853 0.02500456 0.02500208 0.02500785 0.02500306 0.02500211 0.28657666 0.02500757 0.02500003] |
LDA is an iterative algorithm. Here are the two main steps:
- In the initialization stage, each word is assigned to a random topic.
- Iteratively, the algorithm goes through each word and reassigns the word to a topic taking into consideration:
- What’s the probability of the word belonging to a topic
- What’s the probability of the document to be generated by a topic
Due to these important qualities, we can visualize LDA results easily. We’re going to use a specialized tool called PyLDAVis:
1 2 3 4 5 6 | import pyLDAvis.sklearn pyLDAvis.enable_notebook() panel = pyLDAvis.sklearn.prepare(lda_model, data_vectorized, vectorizer, mds='tsne') panel |
Let’s interpret the topic visualization. Notice how topics are shown on the left while words are on the right. Here are the main things you should consider:
- Larger topics are more frequent in the corpus.
- Topics closer together are more similar, topics further apart are less similar.
- When you select a topic, you can see the most representative words for the selected topic. This measure can be a combination of how frequent or how discriminant the word is. You can adjust the weight of each property using the slider.
- Hovering over a word will adjust the topic sizes according to how representative the word is for the topic.
As we mentioned before, LDA can be used for automatic tagging. We can go over each topic (pyLDAVis helps a lot) and attach a label to it. In the screenshot above you can see that the topic is mainly about Education. In the next example, we can see that this topic is mostly about Music. You can try doing this for all the topics. Unfortunately, not all topics are so clearly defined as the ones we looked at. Results can be improved by experimenting with different num_topics
values. In this case, our corpus is not really that large, it only has 500 instances. A larger corpus will induce more clearly defined topics.
Thanks for the nice tutorial! Do you have any tips for finding out an optimal number of topics? For instance, if I am trying to predict number of views per post on a blog would it be okay if I just iterate over different number of topics that give me the best prediction?
Indeed, checking the predictive power for different topics is one way to go about it. If you are doing this fine tuning manually, you can use pyLDAVis and try to get the “best well-separated” topics you can.
Thanks for the post. Do you have any tip on how cohenrence and perplexity can be computed using the three models used in your post?