ONLINE CORPORA

The following are some of the freely-available linguistic corpora that have been created by Mark Davies, Professor of Corpus Linguistics at Brigham Young University.

NEW: You can help (as little as 30 seconds)

Corpus name Number
of words
Language / dialect
Time period
Content Searches / architecture / interface
English

These five corpora now have exactly the same architecture and interface. Users can:

* Search by word, phrase, substring, part of speech (e.g. nouns or verbs) and lemma (e.g. all forms of go: goes, went, etc)
* Find the collocates (nearby words) of a given word or phrase, which provides insight into the meaning of the word
* Compare the collocates of two words, to see differences in meaning or usage (e.g. collocates of rob vs. steal, or warm vs hot, or men vs. women, or Democrats vs. Republicans)
* Compare the collocates across time periods (provides insight into changes in meaning, such as new uses with green)
* Compare the collocates across genres (to show differences in 'word sense', e.g. chair = 'committee leader' (academic) vs. 'piece of furniture' (fiction))
* Order results by Mutual Information Score (shows 'relevance', in addition to raw frequency)
* With integrated thesauruses, find the frequency and distribution of synonyms of a given word (to see which synonyms are most frequent, in which genres they are used most, which are increasing or decreasing in use, etc)
* Create personalized lists of words and phrases (e.g. for a particular semantic field) and then re-use them as part of subsequent queries

The Corpus of Contemporary American English (COCA) 385 million American English
1990-present
20 million words each year, 1990-present. Equally divided into spoken, fiction, popular magazine, newspaper, and academic. Will be updated at least two times a year.
BYU-BNC: The British National Corpus 100 million British English
~1980s-1993
90 million words written (fiction, newspaper, academic, etc); 10 million spoken. [Website for the original BNC]
TIME Magazine 100 million American English
1923-present
More than 275,000 articles from TIME Magazine. Wide range of topics: news, sports, business, culture, health, entertainment, etc.
Other languages      
Corpus del Espaņol 100 million Spanish
1200s-1900s
20 million words 1900s, 20m 1800s, 40m 1500s-1700s, 20m 1200s-1400s
Corpus do Portuguęs 45 million Portuguese
1300s-1900s
20 million words 1900s, including spoken, fiction, newspaper, and academic. Equally divided Brazil/Portugal. 10m 1800s, 15m 1300s-1700s
Corpus del Espaņol: Registers 20 million Spanish
1900s
Enhanced version of the 1900s component of the Corpus del Espaņol. Equally divided between spoken, fiction, newspaper, and academic. Compare frequency of 110+ grammatical constructions in twenty different registers.
BYU-only (limited to on-campus use by BYU students and faculty)
Oxford English Dictionary (OED) [SEARCH]
 
37 million Old English-1900s 2.2 million quotations in the Oxford English Dictionary. Find the frequency of word, phases, substrings, and constructions in each century since Old English. Can limit hits by frequency limits in any century.
EEBO / LION [SEARCH] 700 million 1500s-1900s Early English Books Online (1500s-1600s; 350m words) and Literature Online (mainly 1700s-1800s; 350m words) Basic interface to these corpora. Find the frequency by decade and century for words, phrases, and substrings.
LDS General Conferences 23 million 1851-present Every General Conference talk from 1851 to the current time Basic interface to these corpora. Find the frequency by decade for words, phrases, and substrings.