Weighting words using Tf-Idf
Updates
- 29-Apr-2018 – Added string instance check Python 2.7, Python3.6 compatibility (Thanks Greg)
If I ask you “Do you remember the article about electrons in NY Times?” there’s a better chance you will remember it than if I asked you “Do you remember the article about electrons in the Physics books?”. Here’s why: an article about electrons in NY Times is far less common than in a collection of physics books. It is less likely to stumble upon the “electron” concept in NY Times than in a physics book.
Let’s consider now the scenario of a single article. Suppose you read an article and you’re asked to rank the concepts found in the article by importance. The chances are you’ll basically order the concepts by frequency. The reason is simply that important stuff would be mentioned repeatedly because the narrative gravitates around them.
Combining the 2 insights, given a term, a document and a collection of documents we can loosely say that:
importance ~ appearances(term, document) / count(documents containing term in collection)
This technique is called Tf-Idf – Term Frequency – Inverse Document Frequency. Here’s how the measure is defined:
tf = count(word, document) / len(document)
– term frequencyidf = log( len(collection) / count(document_containing_term, collection)
– inverse document frequency )tf-idf = tf * idf
– term frequency – inverse document frequency
Let’s test this theory on some data. We’re going to use the Reuters dataset bundles inside NLTK.
1 2 3 4 5 6 7 8 9 10 11 | from nltk.corpus import reuters print reuters.fileids() # The list of file names inside the corpus print len(reuters.fileids()) # Number of files in the corpus = 10788 # Print the categories associated with a file print reuters.categories('training/999') # [u'interest', u'money-fx'] # Print the contents of the file print reuters.raw('test/14829') |
Let’s build a tokenizer that ignores punctuation and stopwords:
1 2 3 4 5 6 7 8 9 10 11 12 | from string import punctuation from nltk.corpus import stopwords from nltk import word_tokenize stop_words = stopwords.words('english') + list(punctuation) def tokenize(text): words = word_tokenize(text) words = [w.lower() for w in words] return [w for w in words if w not in stop_words and not w.isdigit()] |
We now need to know all the words inside the collection
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | # build the vocabulary in one pass vocabulary = set() for file_id in reuters.fileids(): words = tokenize(reuters.raw(file_id)) vocabulary.update(words) vocabulary = list(vocabulary) word_index = {w: idx for idx, w in enumerate(vocabulary)} VOCABULARY_SIZE = len(vocabulary) DOCUMENTS_COUNT = len(reuters.fileids()) print VOCABULARY_SIZE, DOCUMENTS_COUNT # 10788, 51581 |
Let’s compute the Idf for every word in the vocabulary:
1 2 3 4 5 6 7 8 9 10 11 12 | word_idf = defaultdict(lambda: 0) for file_id in reuters.fileids(): words = set(tokenize(reuters.raw(file_id))) for word in words: word_idf[word] += 1 for word in vocabulary: word_idf[word] = math.log(DOCUMENTS_COUNT / float(1 + word_idf[word])) print word_idf['deliberations'] # 7.49443021503 print word_idf['committee'] # 3.61286641709 |
Let’s write, as an exercise, the numpy parallelized version of the Idf computation:
1 2 3 4 5 6 7 8 9 10 11 12 | import numpy as np word_idf = np.zeros(VOCABULARY_SIZE) for file_id in reuters.fileids(): words = set(tokenize(reuters.raw(file_id))) indexes = [word_index[word] for word in words] word_idf[indexes] += 1.0 word_idf = np.log(DOCUMENTS_COUNT / (1 + word_idf).astype(float)) print word_idf[word_index['deliberations']] # 7.49443021503 print word_idf[word_index['committee']] # 3.61286641709 |
Since Idf doesn’t depend on the current document but only on the collection we can preprocess the results as we did above. Here’s the code for the final computation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | from six import string_types def word_tf(word, document): isinstance(s, string_types) document = tokenize(document) return float(document.count(word)) / len(document) def tf_idf(word, document): # If not tokenized if isinstance(document, basestring): document = tokenize(document) if word not in word_index: return .0 return word_tf(word, document) * word_idf[word_index[word]] |
Let’s run a few computations:
1 2 3 4 5 6 7 | print tf_idf('year', reuters.raw('test/14829')) # 0.0209031169481 print tf_idf('following', reuters.raw('test/14829')) # 0.0306117802726 print tf_idf('provided', reuters.raw('test/14829')) # 0.0388082713404 print tf_idf('structural', reuters.raw('test/14829')) # 0.0534999300236 print tf_idf('japanese', reuters.raw('test/14829')) # 0.0613707825494 print tf_idf('downtrend', reuters.raw('test/14829')) # 0.068131183773 |
Notice how I sneakily computed the words in the order of the Tf-Idf score.
That’s how we compute the Tf-Idf ourselves. Let’s also use some libraries to make the job a bit easier. Note that the scores might be different but the order should be the same. The difference is probably due to different smoothing strategies.
Here’s the code for computing the Tf-Idf score using scikit-learn:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer(stop_words=stop_words, tokenizer=tokenize, vocabulary=vocabulary) # Fit the TfIdf model tfidf.fit([reuters.raw(file_id) for file_id in reuters.fileids()]) # Transform a document into TfIdf coordinates X = tfidf.transform([reuters.raw('test/14829')]) # Check out some frequencies print X[0, tfidf.vocabulary_['year']] # 0.0562524229373 print X[0, tfidf.vocabulary_['following']] # 0.057140265658 print X[0, tfidf.vocabulary_['provided']] # 0.0689364372666 print X[0, tfidf.vocabulary_['structural']] # 0.0900802810906 print X[0, tfidf.vocabulary_['japanese']] # 0.114492409303 print X[0, tfidf.vocabulary_['downtrend']] # 0.111137191743 |
Here’s the code for computing the Tf-Idf score using gensim:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | from gensim.models import TfidfModel from gensim.corpora import Dictionary documents = [tokenize(reuters.raw(file_id)) for file_id in reuters.fileids()] dictionary = Dictionary(documents) tfidf_model = TfidfModel([dictionary.doc2bow(d) for d in documents], id2word=dictionary) tfidf_values = dict(tfidf_model[dictionary.doc2bow(tokenize(reuters.raw('test/14829')))]) print tfidf_values[dictionary.token2id['year']] # 0.0367516096888 print tfidf_values[dictionary.token2id['following']] # 0.0538505795815 print tfidf_values[dictionary.token2id['provided']] # 0.0683210467787 print tfidf_values[dictionary.token2id['structural']] # 0.0945807226371 print tfidf_values[dictionary.token2id['japanese']] # 0.107960637598 print tfidf_values[dictionary.token2id['downtrend']] # 0.122670341446 |
Conclusions
- TfIdf is a really popular technique for weighting the importance of the terms inside a collection of documents
- It is used in Information Retrieval to rank results
- It is used for extracting keywords on web pages
hi
i need to implement the normal tfidf with 20 news group and classify the results with out calculate the similarity can ?
TfIdf doesn’t imply any similarity computations. If you want to compute similarities between documents I suggest gensim’s interface:http://radimrehurek.com/gensim/tut3.html
I need to run a few classifier on both tfidf and NPtfidf.
If you need to use features from 2 vectorizers, you can combine them using a sklearn FeatureUnion: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html
talking with you is very valuable
Hi,
can you help me for modify this code from calculate the similarity to run few classifiers in python im working on pycharm 2016 .
Thank you
Regards .
# get training and testing data
import nltk
import string
import os
from nltk.corpus import conll2000
import re, math
from collections import Counter
from nltk.corpus import stopwords
#from sklearn import metrics
from nltk.corpus import conll2000
print (len(conll2000.chunked_sents()))
print (len(conll2000.chunked_words()))
print ((conll2000.chunked_words()))
WORD = re.compile(r’\w+’)
patterns = ” NP: {+}”
NPChunker = nltk.RegexpParser(patterns) # create a chunk parser
test_sents = conll2000.chunked_sents(‘test.txt’, chunk_types=[‘NNP’, ‘NP’,’NN’])
train_sents = conll2000.chunked_sents(‘train.txt’, chunk_types=[‘NNP’,’NP’,’NN’])
stop = set(stopwords.words(‘english’))
from nltk.stem.porter import PorterStemmer
path=’./tf-idf’
def tf(word, txt):
return txt.words.count(word) / len(txt.words)
def n_containing(word, txtlist):
return sum(1 for blob in txtlist if word in txtlist.words)
def idf(word, txtlist):
return math.log(len(txtlist) / (1 + n_containing(word, txtlist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
token_dict={}
def tokenize(text):
tokens =nltk.word_tokenize(text)
stems=[]
for item in tokens:
stems.append(PorterStemmer(). stem (item))
return stems
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x] ** 2 for x in vec1.keys()])
sum2 = sum([vec2[x] ** 2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
words = WORD.findall(text)
return Counter(words)
vector1 = text_to_vector(‘test.txt’)
vector2 = text_to_vector(‘train.txt’)
cosine = get_cosine(vector1, vector2)
print (‘Cosine:’, cosine)
# training the chunker, ChunkParser is a class defined in the next slide
class ChunkParser(nltk.ChunkParserI):
def __init__(self, train_sents):
train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
for sent in train_sents]
self.tagger = nltk.TrigramTagger(train_data)
def parse(self, sentence):
pos_tags = [pos for (word,pos) in sentence]
tagged_pos_tags = self.tagger.tag(pos_tags)
chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
in zip(sentence, chunktags)]
return nltk.chunk.conlltags2tree(conlltags)
NPChunker = ChunkParser(train_sents)
print (NPChunker.evaluate(test_sents))
Here’s a script that:
1. Trains a NP-Chunker
2. Uses the Chunker to build a NP TfIdf Vectorizer
https://gist.github.com/bogdan-ivanov/909600fa93a04794b83a5a532df56dec
C:\Python27\python.exe C:/Users/Inspiron/PycharmProjects/untitled8/newWork.py
Traceback (most recent call last):
File “C:/Users/Inspiron/PycharmProjects/untitled8/newWork.py”, line 11, in
from sklearn.feature_extraction.text import TfidfVectorizer
File “C:\Users\Inspiron\AppData\Roaming\Python\Python27\site-packages\sklearn\__init__.py”, line 56, in
from . import __check_build
ImportError: cannot import name __check_build
Process finished with exit code 1
first i try to implement it and got this results
when i delete this step
from sklearn.feature_extraction.text import TfidfVectorizer
i got this results
C:\Python27\python.exe C:/Users/Inspiron/PycharmProjects/untitled8/newWork.py
ChunkParse score:
IOB Accuracy: 95.3%%
Precision: 86.4%%
Recall: 90.8%%
F-Measure: 88.6%%
Traceback (most recent call last):
File “C:/Users/Inspiron/PycharmProjects/untitled8/newWork.py”, line 132, in
NP_vectorizer = TfidfVectorizer(lowercase=True, tokenizer=np_tokenizer)
NameError: name ‘TfidfVectorizer’ is not defined
[‘This’, ‘a complex sentence’, ‘I’, ‘NP chunking’]
what you advise me ?
Search Stackoverflow: http://stackoverflow.com/questions/15274696/importerror-in-importing-from-sklearn-cannot-import-name-check-build
(install scipy)
it doesn’t work
i try to download the scipy seems it dosent work on my system 64 bit . for conll2000 data set all this data with chunking or we can get normal data for conll200 . my question is can i get data that i can implement it with chunking and with out chunking because i have to compare like this
phase one 1- input data set 2- apply tfidf 3-implement few classifiers (do classification )4- get results
phase two 1-input data set 2- do chunking with noun phrases 3- apply tfidf 4- do classification 5- get results
and compare with these two results
Hi ,
I implement the work that you sent it for me thank you very much for you help for me .pleas can i implement these steps with out chunking on this conll2000 data set .
and if possible can you explain for me how. I want to prove that chunking with noun phrases based tfidf give better accuracy than the weighting with terms with out chunking thank you very much . Regards .
Just one doubt.
Is the IDF calculation right?
word_idf = np.log(DOCUMENTS_COUNT / (1 + word_idf).astype(float))
This will give a negative Idf value for a word if there are 100 document and it is present in all 100 or 0 if it is present in 99 documents . Isn’t that counter intuitive ?
The fact that it is present in all documents (or almost all) means that the term is not at all discriminant. On the other hand, if the term appears in only one document then we get a high log(50) value.
Hi,
I’m getting an error
when I run the last example. Where should this be defined?
I’m using python 3.6.5, but I don’t see how that would matter.
TIA!
P.S. Having fun with these blog posts! Keep ’em coming!
Greg–
Got it. See how-to-check-if-variable-is-string-with-python-2-and-3-compatibility
It was a python 3 issue.
Yep, Python2.7 code. Will add your suggestion, thanks!
Will do Sir 🙂
in word_idf = defaultdict(lambda: 0)
defaultdict is not defined
from collections import defaultdict