TextRank for Text Summarization
The task of summarization is a classic one and has been studied from different perspectives. The task consists of picking a subset of a text so that the information disseminated by the subset is as close to the original text as possible. The subset, named the summary, should be human readable. The task is not about picking the most common words or entities. Think of it as a quick digest for a news article.
Extractive versus abstractive
The are 2 fundamentally different approaches in summarization.
The extractive approach entails selecting the X most representative sentences that best cover the whole information expressed by the original text. This is the most popular approach, especially because it’s a much easier task than the abstractive approach.
In the abstractive approach, we basically build a summary of the text, in the way a human would build one. We pick ideas from several paragraphs or sentences, build readable sentences and present them in a concise form. The abstractive approach is usually a thing in the deep learning realm and we won’t cover it in this article. It’s not a solved problem and the resources available are not that handy or plentiful.
Why is summarization useful
Summarization is mainly useful because it condenses information for easier consumption and analysis. Imagine reading the summaries of all the news of the day and picking only the ones about events you really care about, instead of needing to skim through the whole newspaper. Imagine reading only the most important clauses in contract rather than wasting time on all the legal jargon. In fact, summaries are popular on the web. There’s an online web app, https://skim.it/, that does just that. It uses extractive summarization to help you read faster. Even if not done automatically, many blog posts have a “teaser” that basically does just that, presents you the main subject of the article you are about to read. This is usually an example of abstractive summary,
The PageRank algorithm
In this article we’ll be learning about a very popular and accurate extractive text summarization algorithm. It’s called TextRank. Before diving into TextRank algorithm, we must first make sure we understand the PageRank algorithm, because it’s the foundation of TextRank.
You might have already heard of PageRank since it’s used heavily on the service you use most on the web: Google Search. It’s used to compute the rank of web pages, but funny enough, it’s not named after its use (ranking pages) but rather after its creator: Larry Page, one of Google’s founders. Obviously, the algorithm used inside Google is almost too transfigured to be recognised as PageRank. This is due to the fact that inside Google the algorithm works at an incredible scale. We’ll be focusing here only on the original, plain PageRank algorithm. The algorithm itself is very interesting and worth your attention. Here are the fundamentals of it:
- The main insight: Important pages are linked by important pages (a recurrent definition)
- The PageRank value of a page is essentially the probability of a user visiting that page
Let’s now define the variables we’ll be using:
- We have
N
pages with links between them P
= PageRank vector (P[i]
= the pagerank of pagei
)A[i][j]
is the probability of the user transitioning from pagei
to pagej
A[i][j]
is initialized with1.0 / outbound_links_count(from_node=i)
if there’s a link betweeni
andj
and0
otherwise- If a page
i
has no outbound links, thenA[i][j] = 1.0/N
. Assign equal probability to transitioning to any page. Pages with no outbound links are called dangling pages.
Let’s write a basic implementation of PageRank. First, we’ll write a simple 10 web page graph to apply the algorithm on:
1 2 3 4 5 6 7 8 9 10 11 12 13 | links = { 'webpage-1': set(['webpage-2', 'webpage-4', 'webpage-5', 'webpage-6', 'webpage-8', 'webpage-9', 'webpage-10']), 'webpage-2': set(['webpage-5', 'webpage-6']), 'webpage-3': set(['webpage-10']), 'webpage-4': set(['webpage-9']), 'webpage-5': set(['webpage-2', 'webpage-4']), 'webpage-6': set([]), # dangling page 'webpage-7': set(['webpage-1', 'webpage-3', 'webpage-4']), 'webpage-8': set(['webpage-1']), 'webpage-9': set(['webpage-1', 'webpage-2', 'webpage-3', 'webpage-8', 'webpage-10']), 'webpage-10': set(['webpage-2', 'webpage-3', 'webpage-8', 'webpage-9']), } |
Next, we’ll write a function that builds an index of the webpages and assigns them a numeric index. We’ll use this to build an adjacency matrix.
1 2 3 4 5 6 7 8 | def build_index(links): website_list = links.keys() return {website: index for index, website in enumerate(website_list)} website_index = build_index(links) print(website_index) # {'webpage-10': 3, 'webpage-9': 0, 'webpage-8': 1, 'webpage-1': 2, 'webpage-3': 4, 'webpage-2': 5, 'webpage-5': 6, 'webpage-4': 7, 'webpage-7': 8, 'webpage-6': 9} |
We need to feed a transition probability matrix to PageRank. A[i][j] is the probability of transitioning from page i to page j]
. Here’s how to build that:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | import numpy as np def build_transition_matrix(links, index): total_links = 0 A = np.zeros((len(index), len(index))) for webpage in links: # dangling page if not links[webpage]: # Assign equal probabilities to transition to all the other pages A[index[webpage]] = np.ones(len(index)) / len(index) else: for dest_webpage in links[webpage]: total_links += 1 A[index[webpage]][index[dest_webpage]] = 1.0 / len(links[webpage]) return A A = build_transition_matrix(links, website_index) print(A) |
Now we’ll write the actual PageRank algorithm. The algorithm takes 2 parameters:
1. eps
: stop the algorithm when the difference between 2 consecutive iterations is smaller or equal to eps
2. d
: damping factor: With a probability of 1-d
the user will simply pick a web page at random as the next destination, ignoring the link structure completely.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | def pagerank(A, eps=0.0001, d=0.85): P = np.ones(len(A)) / len(A) while True: new_P = np.ones(len(A)) * (1 - d) / len(A) + d * A.T.dot(P) delta = abs(new_P - P).sum() if delta <= eps: return new_P P = new_P results = pagerank(A) print("Results:", results) # [ 0.13933698, 0.09044235, 0.1300934 , 0.13148714, 0.08116465, 0.1305122 , 0.09427366, 0.085402 , 0.02301397, 0.09427366] print(sum(results)) # 1.0 print([item[0] for item in sorted(enumerate(results), key=lambda item: -item[1])]) # [0, 3, 5, 2, 6, 9, 1, 7, 4, 8] |
From PageRank to TextRank
You might be asking yourself: “How does PageRank help me with text summarization?”
Here’s the trick:
- We consider sentences the equivalent of web pages
- The probability of going from sentence A to sentence B is equal to the similarity of the 2 sentences
- Apply the PageRank algorithm over this sentence graph
A few observations:
- In the case of TextRank, the graph is symmetrical
- It’s up to us to come up with a similarity measure and the algorithm’s performance depends strongly upon this measure
- The algorithm is language agnostic. (It can become language dependent due to a language dependent implementation of the similarity measure)
- It doesn’t require any training, it’s a totally unsupervised method (which we like)
Here are some conditions a good similarity measure has to obey:
0 <= similarity(A, B) <= 1
similarity(A, A) = 1
similarity(A, B) =/= 1
ifA =/= B
A widely used measure in Natural Language Processing is the Cosine Similarity. The Cosine Similarity computes the cosine of the angle between 2 vectors. If the vectors are identical, the cosine is 1.0
. If the vectors are orthogonal, the cosine is 0.0
. This means the cosine similarity is a measure we can use.
NLTK implements cosine_distance
, which is 1 - cosine_similarity
. The concept of distance is opposite to similarity. Two identical vectors are located at 0 distance and are 100% similar.
Let’s write a good sentence similarity measure:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | from nltk.corpus import brown, stopwords from nltk.cluster.util import cosine_distance def sentence_similarity(sent1, sent2, stopwords=None): if stopwords is None: stopwords = [] sent1 = [w.lower() for w in sent1] sent2 = [w.lower() for w in sent2] all_words = list(set(sent1 + sent2)) vector1 = [0] * len(all_words) vector2 = [0] * len(all_words) # build the vector for the first sentence for w in sent1: if w in stopwords: continue vector1[all_words.index(w)] += 1 # build the vector for the second sentence for w in sent2: if w in stopwords: continue vector2[all_words.index(w)] += 1 return 1 - cosine_distance(vector1, vector2) # One out of 5 words differ => 0.8 similarity print(sentence_similarity("This is a good sentence".split(), "This is a bad sentence".split())) # One out of 2 non-stop words differ => 0.5 similarity print(sentence_similarity("This is a good sentence".split(), "This is a bad sentence".split(), stopwords.words('english'))) # 0 out of 2 non-stop words differ => 1 similarity (identical sentences) print(sentence_similarity("This is a good sentence".split(), "This is a good sentence".split(), stopwords.words('english'))) # Completely different sentences=> 0.0 print(sentence_similarity("This is a good sentence".split(), "I want to go to the market".split(), stopwords.words('english'))) |
Let’s build the PageRank transition matrix by building the sentence similarity matrix:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | import numpy as np # Get a text from the Brown Corpus sentences = brown.sents('ca01') print(sentences) # [[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of', u"Atlanta's", u'recent', u'primary', u'election', u'produced', u'``', u'no', u'evidence', u"''", u'that', u'any', u'irregularities', u'took', u'place', u'.'], [u'The', u'jury', u'further', u'said', u'in', u'term-end', u'presentments', u'that', u'the', u'City', u'Executive', u'Committee', u',', u'which', u'had', u'over-all', u'charge', u'of', u'the', u'election', u',', u'``', u'deserves', u'the', u'praise', u'and', u'thanks', u'of', u'the', u'City', u'of', u'Atlanta', u"''", u'for', u'the', u'manner', u'in', u'which', u'the', u'election', u'was', u'conducted', u'.'], ...] print(len(sentences)) # 98 # get the english list of stopwords stop_words = stopwords.words('english') def build_similarity_matrix(sentences, stopwords=None): # Create an empty similarity matrix S = np.zeros((len(sentences), len(sentences))) for idx1 in range(len(sentences)): for idx2 in range(len(sentences)): if idx1 == idx2: continue S[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words) # normalize the matrix row-wise for idx in range(len(S)): S[idx] /= S[idx].sum() return S S = build_similarity_matrix(sentences, stop_words) print(S) |
Now that we know how to compute the similarity matrix, we can compute the summary. For demonstration purposes, we’re using the first document from the Brown corpus. Here’s a rough version of the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | from operator import itemgetter sentence_ranks = pagerank(S) print(sentence_ranks) # Get the sentences ordered by rank ranked_sentence_indexes = [item[0] for item in sorted(enumerate(sentence_ranks), key=lambda item: -item[1])] print(ranked_sentence_indexes) # Suppose we want the 5 most import sentences SUMMARY_SIZE = 5 SELECTED_SENTENCES = sorted(ranked_sentence_indexes[:SUMMARY_SIZE]) print(SELECTED_SENTENCES) # Fetch the most important sentences summary = itemgetter(*SELECTED_SENTENCES)(sentences) # Print the actual summary for sentence in summary: print(' '.join(sentence)) |
Let’s bundle all this code in a nice & neat function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | def textrank(sentences, top_n=5, stopwords=None): """ sentences = a list of sentences [[w11, w12, ...], [w21, w22, ...], ...] top_n = how may sentences the summary should contain stopwords = a list of stopwords """ S = build_similarity_matrix(sentences, stop_words) sentence_ranks = pagerank(S) # Sort the sentence ranks ranked_sentence_indexes = [item[0] for item in sorted(enumerate(sentence_ranks), key=lambda item: -item[1])] selected_sentences = sorted(ranked_sentence_indexes[:top_n]) summary = itemgetter(*selected_sentences)(sentences) return summary for idx, sentence in enumerate(textrank(sentences, stopwords=stopwords.words('english'))): print("%s. %s" % ((idx + 1), ' '.join(sentence))) # 1. `` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' . # 2. Nevertheless , `` we feel that in the future Fulton County should receive some portion of these available funds '' , the jurors said . # 3. -- After a long , hot controversy , Miller County has a new school superintendent , elected , as a policeman put it , in the `` coolest election I ever saw in this county '' . # 4. `` This was the coolest , calmest election I ever saw '' , Colquitt Policeman Tom Williams said . # 5. `` Everything went real smooth '' , the sheriff said . |
Conclusions
- TextRank uses the intuition behind the PageRank algorithm to rank sentences
- TextRank is an unsupervised method for computing the extractive summary of a text. No training is necessary.
- Given that it’s language independent, can be used on any language, not only English.
- In this article, we used as the similarity measure, the cosine similarity. We convert sentences to vectors to compute this similarity.
how can we build a text dictionary and make a comprehensive tf-idf from built dictionary
Hey,
Think maybe Google Translate failed you here. If you want to learn about Tf-Ifd head here: http://nlpforhackers.io/tf-idf/
Cheers,
Bogdan.
Hey,
Thanks for this great tutorial! I have been attempting to use this to summarize responses from free-text survey data, but running the function “build_similarity_matrix” on a data sample outputs a matrix with “nan” row values, which creates a Runtime Warning when the function “pagerank” runs. The “nan” values seem to map to incomplete sentences in my survey responses. Would you have any advice or best practices for handling “nan” values as an output from the “build_similarity_matrix” function?
Thanks!
Hi Jane,
That’s weird … can you send me the text that causes this? (you can use the contact form here:
http://nlpforhackers.io/contact/
Bogdan.
For others that might have this issue:
I think it comes from
# normalize the matrix row-wise
for idx in range(len(S)):
S[idx] /= S[idx].sum()
when there are no words that are the same between the two sentences, thus there is a divide by zero. But luckily you don’t need to normalize a zero row!
So I added:
if S[idx].sum()==0:
continue
and it seemed to fix the problem.
Sounds like a good idea, good catch
i think your delta function should be abs(new_P – P).sum() in the page rank section.
Wow, thank you for catching this! Very appreciated 😀
if idx1 == idx2, shouldn’t the similarity matrix be updated to 1.