Natural Language Processing Corpora
One of the reasons why it’s so hard to learn, practice and experiment with Natural Language Processing is due to the lack of available corpora. Building a gold standard corpus is seriously hard work. That’s why resources are so scarce or cost a lot of money. In this post, I’m going to aggregate some cool resources, some very well known, some a bit under the radar.
General
- Brown – Categorized and part of speech tagged annotated corpus – available in NLTK:
nltk.corpus.brown
- Reuters – Categorized corpus – available in NLTK:
nltk.corpus.reuters
- CoNLL2000 – part of speech and chunk annotated corpus – available in NLTK:
nltk.corpus.conll2000
- CoNLL2002 – NER and part of speech and chunk annotated corpus – available in NLTK:
nltk.corpus.conll2002
- Information Extraction and Entity Recognition Corpus – NER annotated corpus – available in NLTK:
nltk.corpus.ieer
- Wordnet – large lexical database of English – available in NLTK:
nltk.corpus.wordnet
- 20 Newsgroups data set – Categorized corpus – available in Scikit-learn:
sklearn.datasets.fetch_20newsgroups
- American National Corpus – General Corpus with various annotations including (part of speech, named entity, and shallow parsing). The POS annotations can be found in NLTK in
nltk.corpus.masc_tagged
- Groningen Meaning Bank (GMB) – NER and part of speech annotated corpus
- WikiCorpus – The Wikicorpus is a trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia
- text8 – Cleaned up Wikipedia articles by Matt Mahoney
- webtext – User generated content on the web – available in NLTK:
nltk.corpus.wordnet
- gutenberg – Text from the Gutenberg Project – available in NLTK:
nltk.corpus.gutenberg
- inaugural – US Presidential Inaugural Addresses – available in NLTK:
nltk.corpus.inaugural
- genesis – Bible text – available in NLTK:
nltk.corpus.genesis
- abc – Australian Broadcasting Commission 2006* – available in NLTK:
nltk.corpus.abc
Sentiment Analysis
- IMDB Movie Reviews – 50.000 annotated IMDB movie reviews
- Multi-Domain Sentiment Dataset – contains product reviews taken from Amazon.com from 4 product types (domains): Kitchen, Books, DVDs, and Electronics
- Opinion Lexicon – Curated list of positive/negative words – available in NLTK:
nltk.corpus.opinion_lexicon
- UMICH SI650 – Sentiment Classification on Kaggle – Positive/Negative Sentiment annotated sentences
- Sanders Analytics Twitter Sentiment Corpus – 5513 hand-classified tweets
- SentiWordNet – Polarity annotated Wordnet Synsets – available in NLTK:
nltk.corpus.sentiwordnet
- Movie Reviews – 2000 Sentiment annotated movie reviews – available in NLTK:
nltk.corpus.movie_reviews
- Twitter Samples – Sentiment annotated tweets –
nltk.corpus.twitter_samples
- Subjectivity Dataset – 5000 subjective and 5000 objective processed sentences – available in NLTK:
nltk.corpus.subjectivity
- Opinion Dataset – Miscellaneous Opinion annotated datasets
- Twitter airline sentiment on Kaggle – What travelers expressed about their adventures with the airlines on Twitter in February 2015
- Amazon Fine food Reviews
- First GOP Debate Twitter Sentiment – Analyze tweets on the first 2016 GOP Presidential Debate
Spam/Not-Spam
Social Media
- OSU Twitter NLP Tools – Contains POS, Chunk and NER annotated tweets
- Tweebank – Twitter CoNLL-like annotated data: documentation
- Sanders Analytics Twitter Sentiment Corpus – 5513 hand-classified tweets
- Twitter Samples – Sentiment annotated tweets –
nltk.corpus.twitter_samples
- Twitter airline sentiment on Kaggle – What travelers expressed about their adventures with the airlines on Twitter in February 2015
- First GOP Debate Twitter Sentiment
Fake News
- Getting Real about Fake News – Kaggle
- Fake News Challenge
- BuzzFeed Partisan Sites
- Fake News Corpus – curated list of 1001 domains from opensources.co
- Liar Liar Pants on Fire – Using POLITIFACT API
Miscellaneous
- Simpsons scripts – script lines for approximately 600 Simpsons episodes, dating back to 1989
- Enron Email Dataset – 0.5M email messages.
- Ubuntu Chat Corpus
Other Languages
- CoNLL2002 – NER and part of speech and chunk annotated corpus – available in NLTK:
nltk.corpus.conll2002
- CoNLL2007 – dependency relations annotated corpus – Italian Language – available in NLTK:
nltk.corpus.conll2007
- WikiCorpus – The Wikicorpus is a trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia
As I already obsessively said, there are a lot of resources bundled in NLTK. You can also consult the list here: NLTK Corpora.
I’m really keen to keep this list up to date, so if you know some cool corpus I should include here, please leave a comment or use the Contact Form.
How are you bogdani? good job with the post. can u post codes on how to leverage the combination of LSA and any anomaly detection technique in large corpus for detecting text anomaly.
Thanks
Hit me with some resources đ
Hi! Sorry for the late response, was a lil busy. I will definitely send u the Conceptual model of the proposed program as soon as am done designing it. Plus can I have your social media id or handle so I can chat with u better. Tnx for the usual post. It’s great to have brilliant minds like you
Hi bogdani,
Thanks for sharing good resource.
But I need Spanish language corpus. Specifically for q&a machine reading comprehension. Do we have any such?
Thanks.
Hi Bogdani . Thanks for sharing such good consolidated resources.
Do you happen to have unlabeled SMS dataset( text SMS dataset, not preprocessed ) ? I am trying to categorize SMS/messages into few category i.e. Promotional, E-Commerce etc.
If you know about such dataset, can you mail me the link.
Does this help? https://www.kaggle.com/uciml/sms-spam-collection-dataset