Full-text corpus data

from corpus.byu.edu


 Purchase data 

Overview
Corpora
Using the data
Formats / samples
   Database/SQL

Limitations: "10/200"

Related sites
  Word frequency
  Collocates
  N-grams
  WordAndPhrase
  Academic vocabulary
  corpus.byu.edu

Contact us


For more information on texts and composition, click on the    icon at the top of the page of each corpus.
 
Corpus Texts (95% available in full-text data) Focus / strengths
COCA: Corpus of Contemporary American English
(info on 2012-2015 update)
520 million words / 220,000 texts. US, 1990-2015. Best coverage of all types of genres (informal to formal): spoken, fiction, magazines, newspaper, academic. The most widely-used corpus of English.
COHA: Corpus of Historical American English 400 million words / 107,000 texts. US, 1810-2009. Historical change. 100x as large as next-largest historical corpus of English.
GloWbE: Global Web-based English 1.9 billion words / 1.8 million texts. 20 countries. About 60% blogs (very informal). Recent: 2013. Comparing varieties of English: American, British, Australian, etc. 100x as large as the next-largest corpus of English dialects.
New full-text data (December 2016)
NOW: News on the Web
(more info)
4.79 billion words / 6.0+ million texts. (As of early Dec 2016; continually growing). 20 countries. The most up-to-date corpus of English. 4-5 words added each day (130 million each month, 1.5 billion each year). Wide range of online newspapers and magazines (technology, entertainment, sports, politics, etc)
Wikipedia Corpus
(more info)
1.9 billion words / 4.4 million texts. Best corpus for specialized language for an almost unlimited range of topics: science, entertainment, technology, history, sports, etc
Corpus del Espaņol (Spanish)
(more info)
2.0 billion words / 2.0 million texts. 21 countries. The largest well-annotated corpus of Spanish. All of the strengths of GloWbE (above), but for Spanish