Natural Language Processing Corpora

One of the reasons why it’s so hard to learn, practice and experiment with Natural Language Processing is due to the lack of available corpora. Building a gold standard corpus is seriously hard work. That’s why resources are so scarce or cost a lot of money. In this post, I’m going to aggregate some cool resources, some very well known, some a bit under the radar.


  • Brown – Categorized and part of speech tagged annotated corpus – available in NLTK: nltk.corpus.brown
  • Reuters – Categorized corpus – available in NLTK: nltk.corpus.reuters
  • CoNLL2000 – part of speech and chunk annotated corpus – available in NLTK: nltk.corpus.conll2000
  • CoNLL2002 – NER and part of speech and chunk annotated corpus – available in NLTK: nltk.corpus.conll2002
  • Information Extraction and Entity Recognition Corpus – NER annotated corpus – available in NLTK: nltk.corpus.ieer
  • Wordnet – large lexical database of English – available in NLTK: nltk.corpus.wordnet
  • 20 Newsgroups data set – Categorized corpus – available in Scikit-learn: sklearn.datasets.fetch_20newsgroups
  • Groningen Meaning Bank (GMB) – NER and part of speech annotated corpus
  • text8 – Cleaned up Wikipedia articles by Matt Mahoney
  • webtext – User generated content on the web – available in NLTK: nltk.corpus.wordnet
  • gutenberg – Text from the Gutenberg Project – available in NLTK: nltk.corpus.gutenberg
  • inaugural – US Presidential Inaugural Addresses – available in NLTK: nltk.corpus.inaugural
  • genesis – Bible text – available in NLTK: nltk.corpus.genesis
  • abc – Australian Broadcasting Commission 2006* – available in NLTK:

Sentiment Analysis

  • Multi-Domain Sentiment Dataset – contains product reviews taken from from 4 product types (domains): Kitchen, Books, DVDs, and Electronics
  • Opinion Lexicon – Curated list of positive/negative words – available in NLTK: nltk.corpus.opinion_lexicon
  • UMICH SI650 – Sentiment Classification on Kaggle – Positive/Negative Sentiment annotated sentences
  • Sanders Analytics Twitter Sentiment Corpus – 5513 hand-classified tweets
  • SentiWordNet – Polarity annotated Wordnet Synsets – available in NLTK: nltk.corpus.sentiwordnet
  • Movie Reviews – 2000 Sentiment annotated movie reviews – available in NLTK: nltk.corpus.movie_reviews
  • Twitter Samples – Sentiment annotated tweets – nltk.corpus.twitter_samples
  • Subjectivity Dataset – 5000 subjective and 5000 objective processed sentences – available in NLTK: nltk.corpus.subjectivity
  • Opinion Dataset – Miscellaneous Opinion annotated datasets

Social Media


  • Simpsons scripts – script lines for approximately 600 Simpsons episodes, dating back to 1989

Other Languages

  • CoNLL2007 – dependency relations annotated corpus – Italian Language – available in NLTK: nltk.corpus.conll2007

As I already obsessively said, there are a lot of resources bundled in NLTK. You can also consult the list here: NLTK Corpora.

