NLP corpora

NLP CORPORA

can be used to access the contents of a diverse set of corpora

datasets on every topic

datasets paid

list of free/public domain datasets for NLP

list of datasets/corpora for NLP

datasets for NLP

datasets for all fields

a big collection of free books that can be retrieved in plain text

top 100 in ranking books

A big sample of English words

Google’s one billion words corpus

a large collection of conversations extracted from raw movie scripts

lists of datasets

BBC Datasets

open data for deep learning

NLTK corpora

an annotated list of resources

Alphabetical list of free/public domain datasets with text data for NLP

wikipedia list of text datasets

What are the major text corpora used by computational linguists and NLP

Sentiment Analysis lexicons and datasets

question answering corpora

machine learning repository of datasets

N-grams datasets

language datasets only for university researchers

optimized for quick inquiries into the usage of small sets of phrases

tool to download the Google corpora

A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books

data set containing Google Books n-gram corpora in Hadooop format

Google Research Blog

N-gram counts and language models from the CommonCrawl

data to accompany the chapter Natural Language Corpus Data from the book Beautiful Data

Linguistic Data Consortium corpora- paid for non-members

based on the Corpus of Contemporary American English (COCA)

the Corpus of Contemporary American English (COCA)

build corpora

people will score reviews for sentiment

people will score reviews for sentiment

people will score reviews for sentiment

labelling data (manually by people)

people will score reviews for sentiment

courses

how to build your own corpora