TextRank for Text Summarization

The task of summarization is a classic one and has been studied from different perspectives. The task consists of picking a subset of a text so that the information disseminated by the subset is as close to the original text as possible. The subset, named the summary, should be human readable. The task is not about picking the most common words or entities. Think of it as a quick digest for a news article.

Extractive versus abstractive

The are 2 fundamentally different approaches in summarization.

The extractive approach entails selecting the X most representative sentences that best cover the whole information expressed by the original text. This is the most popular approach, especially because it’s a much easier task than the abstractive approach.

In the abstractive approach, we basically build a summary of the text, in the way a human would build one. We pick ideas from several paragraphs or sentences, build readable sentences and present them in a concise form. The abstractive approach is usually a thing in the deep learning realm and we won’t cover it in this article. It’s not a solved problem and the resources available are not that handy or plentiful.

Why is summarization useful

Summarization is mainly useful because it condenses information for easier consumption and analysis. Imagine reading the summaries of all the news of the day and picking only the ones about events you really care about, instead of needing to skim through the whole newspaper. Imagine reading only the most important clauses in contract rather than wasting time on all the legal jargon. In fact, summaries are popular on the web. There’s an online web app, https://skim.it/, that does just that. It uses extractive summarization to help you read faster. Even if not done automatically, many blog posts have a “teaser” that basically does just that, presents you the main subject of the article you are about to read. This is usually an example of abstractive summary,

The PageRank algorithm

In this article we’ll be learning about a very popular and accurate extractive text summarization algorithm. It’s called TextRank. Before diving into TextRank algorithm, we must first make sure we understand the PageRank algorithm, because it’s the foundation of TextRank.

You might have already heard of PageRank since it’s used heavily on the service you use most on the web: Google Search. It’s used to compute the rank of web pages, but funny enough, it’s not named after its use (ranking pages) but rather after its creator: Larry Page, one of Google’s founders. Obviously, the algorithm used inside Google is almost to transfigured to be recognised as PageRank. This is due to the fact that inside Google the algorithm works at an incredible scale. We’ll be focusing here only on the original, plain PageRank algorithm. The algorithm itself is very interesting and worth your attention. Here are the fundamentals of it:

  • The main insight: Important pages are linked by important pages (a recurrent definition)
  • The PageRank value of a page is essentially the probability of a user visiting that page

Let’s now define the variables we’ll be using:

  • We have N pages with links between them
  • P = PageRank vector (P[i] = the pagerank of page i)
  • A[i][j] is the probability of the user transitioning from page i to page j
  • A[i][j] is initialized with 1.0 / outbound_links_count(from_node=i) if there’s a link between i and j and 0 otherwise
  • If a page i has no outbound links, then A[i][j] = 1.0/N. Assign equal probability to transitioning to any page. Pages with no outbound links are called dangling pages.

Let’s write a basic implementation of PageRank. First, we’ll write a simple 10 web page graph to apply the algorithm on:

Next, we’ll write a function that builds an index of the webpages and assigns them a numeric index. We’ll use this to build an adjacency matrix.

We need to feed a transition probability matrix to PageRank. A[i][j] is the probability of transitioning from page i to page j]. Here’s how to build that:

Now we’ll write the actual PageRank algorithm. The algorithm takes 2 parameters:
1. eps: stop the algorithm when the difference between 2 consecutive iterations is smaller or equal to eps
2. d: damping factor: With a probability of 1-d the user will simply pick a web page at random as the next destination, ignoring the link structure completely.

From PageRank to TextRank

You might be asking yourself: “How does PageRank help me with text summarization?”

Here’s the trick:
* We consider sentences the equivalent of web pages
* The probability of going from sentence A to sentence B is equal to the similarity of the 2 sentences
* Apply the PageRank algorithm over this sentence graph

A few observations:
* In the case of TextRank, the graph is symmetrical
* It’s up to us to come up with a similarity measure and the algorithm’s performance depends strongly upon this measure
* The algorithm is language agnostic. (It can become language dependent due to a language dependent implementation of the similarity measure)
* It doesn’t require any training, it’s a totally unsupervised method (which we like)

Here are some conditions a good similarity measure has to obey:
* 0 <= similarity(A, B) <= 1
* similarity(A, A) = 1
* similarity(A, B) =/= 1 if A =/= B

A widely used measure in Natural Language Processing is the Cosine Similarity. The Cosine Similarity computes the cosine of the angle between 2 vectors. If the vectors are identical, the cosine is 1.0. If the vectors are orthogonal, the cosine is 0.0. This means the cosine similarity is a measure we can use.

NLTK implements cosine_distance, which is 1 - cosine_similarity. The concept of distance is opposite to similarity. Two identical vectors are located at 0 distance and are 100% similar.

Let’s write a good sentence similarity measure:

Let’s build the PageRank transition matrix by building the sentence similarity matrix:

Now that we know how to compute the similarity matrix, we can compute the summary. For demonstration purposes, we’re using the first document from the Brown corpus. Here’s a rough version of the code:

Let’s bundle all this code in a nice & neat function:

Conclusions

  • TextRank uses the intuition behind the PageRank algorithm to rank sentences
  • TextRank is an unsupervised method for computing the extractive summary of a text. No training is necessary.
  • Given that it’s language independent, can be used on any language, not only English.
  • In this article, we used as the similarity measure, the cosine similarity. We convert sentences to vectors to compute this similarity.