Recipe: Text clustering using NLTK and scikit-learn
Simple recipe for text clustering. This sometimes creates issues in scikit-learn because text has sparse features.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | import string import collections from nltk import word_tokenize from nltk.stem import PorterStemmer from nltk.corpus import stopwords from sklearn.cluster import KMeans from sklearn.feature_extraction.text import TfidfVectorizer from pprint import pprint def process_text(text, stem=True): """ Tokenize text and stem words removing punctuation """ text = text.translate(None, string.punctuation) tokens = word_tokenize(text) if stem: stemmer = PorterStemmer() tokens = [stemmer.stem(t) for t in tokens] return tokens def cluster_texts(texts, clusters=3): """ Transform texts to Tf-Idf coordinates and cluster texts using K-Means """ vectorizer = TfidfVectorizer(tokenizer=process_text, stop_words=stopwords.words('english'), max_df=0.5, min_df=0.1, lowercase=True) tfidf_model = vectorizer.fit_transform(texts) km_model = KMeans(n_clusters=clusters) km_model.fit(tfidf_model) clustering = collections.defaultdict(list) for idx, label in enumerate(km_model.labels_): clustering[label].append(idx) return clustering if __name__ == "__main__": articles = [...] clusters = cluster_texts(articles, 7) pprint(dict(clusters)) |
can i pass levenshtein distance instead of euclidean distance in this code.Please help. Its very urgent.
How do you define the Levenstein distance between 2 documents?
can i perform evaluation using f-score on clustered text
Strongly advise to revisit basic machine learning concepts. F-Score (which is harmonic mean between precision and recall) makes sense only for supervised machine learning. Clustering is a form of unsupervised machine learning. You don’t “know” what is the correct solution.
I have 5 columns of text data in an excel sheet, which has a list of industries in every column.
How can I cluster them?
Hi Daivik,
Of the top of my head, you need to transform each industry into a feature: col1_industry1, col1_industry5, col2_industry2 …
Then use a
DictVectorizer
and perform a normal clustering.Bogdan.
I have a text file with list of phrases from reviews . Can I cluster them? Can you please help me through this?
Yes, you can cluster them. just follow the instructions in this tutorial
Very interesting indeed! Thank you very much for the great tutorial. I would be also interested in plotting the results. Any hint you could give me here? Thank you in advance.
Have a look at the t-SNE algorithm. You can find it in sklearn as well
Hi, there.
How would you go about visualising the clusters in a 2d scatterplot? Is that possible from this resulting dictionary?
Thanks!
Hey Craig,
Yeah, check out the tSNE algorithm for displaying the data in 2D
Thanks,
Bogdan.
Hi,
I have 1000s of queries, in excel rows. Like each query in each row, Can I cluster those queries into some meaningful intents using this?
Thanks in advance,
Veerendra
yes, absolutely 🙂 Make sure you pick a meaningful distance measure (you might need to experiment with this) and the number of intents
Hi,
It normally says translate() takes only ONE parameter. How come you manage to pass two of them ? If I run it this line by itself with a variable of string type, I get an error that says TypeError: translate() takes exactly one argument (2 given)
translate used to have a different API: https://docs.python.org/2/library/string.html#string.translate
Hi,
I am taking the code and running it on my end to see the results and to see if I can use it for my problem. I am running into an error saying:
File “<ipython-input-1-75e2ebe43bc8>”, line 2 articles = […] ^ SyntaxError: invalid syntax
And when i edit the line to articles = [] and run it , it gives me a
LookupErrorTraceback (most recent call last)
<ipython-input-1-60eaf331b4c0> in <module>()
if __name__ == “__main__”:
articles = []
—-> clusters = cluster_texts(articles, 7)
pprint(dict(clusters))
Any insight as to what is happening?
“…” is a placeholder telling you to put some articles in that list. I didn’t want to take up the article space with some news articles or whatever. Just put some strings in there.