Natural Language Processing Corpora

One of the reasons why it’s so hard to learn, practice and experiment with Natural Language Processing is due to the lack of available corpora. Building a gold standard corpus is seriously hard work. That’s why resources are so scarce or cost a lot of money. In this post, I’m going to aggregate some cool resources, some very well known, some a bit under the radar.

General

  • Brown – Categorized and part of speech tagged annotated corpus – available in NLTK: nltk.corpus.brown
  • Reuters – Categorized corpus – available in NLTK: nltk.corpus.reuters
  • CoNLL2000 – part of speech and chunk annotated corpus – available in NLTK: nltk.corpus.conll2000
  • CoNLL2002 – NER and part of speech and chunk annotated corpus – available in NLTK: nltk.corpus.conll2002
  • Information Extraction and Entity Recognition Corpus – NER annotated corpus – available in NLTK: nltk.corpus.ieer
  • Wordnet – large lexical database of English – available in NLTK: nltk.corpus.wordnet
  • 20 Newsgroups data set – Categorized corpus – available in Scikit-learn: sklearn.datasets.fetch_20newsgroups
  • American National Corpus – General Corpus with various annotations including (part of speech, named entity, and shallow parsing). The POS annotations can be found in NLTK in nltk.corpus.masc_tagged
  • Groningen Meaning Bank (GMB) – NER and part of speech annotated corpus
  • WikiCorpus – The Wikicorpus is a trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia
  • text8 – Cleaned up Wikipedia articles by Matt Mahoney
  • webtext – User generated content on the web – available in NLTK: nltk.corpus.wordnet
  • gutenberg – Text from the Gutenberg Project – available in NLTK: nltk.corpus.gutenberg
  • inaugural – US Presidential Inaugural Addresses – available in NLTK: nltk.corpus.inaugural
  • genesis – Bible text – available in NLTK: nltk.corpus.genesis
  • abc – Australian Broadcasting Commission 2006* – available in NLTK: nltk.corpus.abc

Sentiment Analysis

Spam/Not-Spam

Social Media

Fake News

Miscellaneous

Other Languages

  • CoNLL2002 – NER and part of speech and chunk annotated corpus – available in NLTK: nltk.corpus.conll2002
  • CoNLL2007 – dependency relations annotated corpus – Italian Language – available in NLTK: nltk.corpus.conll2007
  • WikiCorpus – The Wikicorpus is a trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia

As I already obsessively said, there are a lot of resources bundled in NLTK. You can also consult the list here: NLTK Corpora.

I’m really keen to keep this list up to date, so if you know some cool corpus I should include here, please leave a comment or use the Contact Form.