Corpora Overview PDF Videos Resources Help / FAQ My account

English-Corpora.org

INSIGHT INTO VARIATION

The corpora from www.english-corpora.org allow research on variation -- historical, between dialects, and between genres -- in ways that are not possible with other corpora. This is due to at least three factors:

1. CORPORA: texts from a wide range of genres, dialects, and time periods -- not just a huge "blob" of billions of words of easily-obtainable newspapers or web pages. In that case, you might have information on a linguistic feature in just one genre in one country at one time period, and really miss out on the richness and variety of language.

2. SIZE: the corpora are 100-200 times as large as (otherwise) similar corpora, and so they potentially yield many more tokens (and yet they are still very fast)

3. QUERIES: our proprietary corpus architecture and interface are designed "from the ground up" to allow comparisons of different portions of the corpus (time periods, dialects, and genres).

Note: click on any link on this page to see the corpus data, and then click on the "BACK" image (see left) at the top of the page to come back to this page.

HISTORICAL VARIATION (1810s-2000s)
COHA: 475 million words, 1820s-2010s. 100-200 times as large as any other structured historical corpus of English.

Lexical: the frequency of any word or phrase, e.g. bestow, swell (ADJ), guys, of no little, as though to, freak out
Lexical: compare all words in different time periods, e.g. *ism words (compare earlier/later), *heart* words (earlier/later)
Phraseology: so ADJ as to V, BE but, HAVE quite V-ed, a most ADJ NOUN
Syntax/grammar: e.g. end up V-ing, post-verbal negation with need, need to VERB, sentence initial hopefully, get passive
Semantics/meaning: use collocates to see change over time, e.g. gay (compare earlier/later), chip, engine, web
Discourse/culture: use collocates to see what we're saying about topics over time: women (compare earlier/later), religion (earlier/later)

HISTORICAL VARIATION (recent: 1990-2019)
COCA: 1 billion words, 1990-2019. The only large corpus that keeps the same genre balance year to year (more...)

Lexical: the frequency of any word or phrase, e.g. morph, old-school, FREAK out, (think) outside the box, throw someone under the bus, BE likely a|the
Lexical: compare all words in different time periods, e.g. increases from 1990-94 (left) to 2010-2019 (right): *ism words, *gate words (potentially "scandal"), *friendly words (note increase), and phrasal verbs with up. Note that not every entry is relevant, but it's a good starting point.
Syntax/grammar: e.g. END up V-ing, GET passive (got hired), "quotative like" (he's like, I'm not going), so not ADJ (I'm so not interested in her)
Semantics/meaning: use collocates to see change over time, e.g. green, web, engine
Discourse/culture: changes in frequency: blacks, retarded; use collocates to see what we're saying about topics over time: crisis, terror, gay

HISTORICAL VARIATION (Google Books)
Google Books (Advanced): 155 billion words, 1810s-2000s. Much more advanced interface/searches than the standard Google Books n-grams.

Lexical: the frequency of any word or phrase, e.g. BESTOW, a swell NOUN (chart), guys, of no little, as though to, FREAK out
Lexical: compare all words in different time periods, e.g. *ism words (compare earlier/later), *heart* words (earlier/later)
Phraseology: so ADJ as to VERB (table), [be] but a NOUN (table), HAVE quite V-ed, a most ADJ NOUN (table)
Syntax/grammar: e.g. [end] up VERB-ing (chart | table), VERB someone into VERB-ing (chart | table), VERB one's way PREP (e.g. force his way into), and who / whom + did + PRON (e.g. who/whom did you (VERB); see chart showing increase in who). Also, must VERB, should VERB, ought to VERB, has to VERB, or need to VERB.
Semantics/meaning: synonyms: "beautiful" woman, "clever" person; collocates show change in meaning, e.g. gay (compare earlier/later)
Discourse/culture: changes in frequency: negro, colored person, blacks, deaf and dumb, retarded, handicapped; use collocates to see what we're saying about topics over time (1800s vs 1970s-2000s): fast, art, women, music, food

VARIATION BETWEEN DIALECTS: compare 20 dialects of World English
GloWbE: 1.9 billion words, 20 different countries. 100 times as large as the next-largest corpus of English dialects (more...)

Lexical: the frequency of any word or phrase, e.g. fortnight, on holiday, banjax*, bikkies, thrice, eve teas*, ACT the maggot, lah!, ackee
Lexical: compare all words in different dialects, e.g. *ism words by dialect ("core" vs. South Asia), *ies nouns in Australian
Phraseology: e.g. BE different to, rather more ADJ, take ADJ food, in over ~ head, USE ~ head, MAKE ~ head spin
Syntax/grammar: VERB likely VERB (e.g. would likely remember), like construction, way construction, try and VERB, go + ADJ, STOP someone V-ing
Semantics/meaning: use collocates to see differences between dialects, e.g. scheme (US/CA = negative), cupboards (US/CA = mainly kitchen)
Discourse/culture: frequency of words, e.g. Quran, Buddh*, feminism. With collocates, e.g. ADJ belief (South Asia vs "core"), ADJ wife (+/- "core")

VARIATION BETWEEN GENRES: American (COCA)
COCA: 1 billion words, 1990-2019. The largest freely-available, genre-balanced corpus currently available.

Lexical: the frequency of any word or phrase, e.g. (spoken) I guess, , you know , (fiction) muffled, frowned (academic) validity, correlate
Lexical: compare all words in different genres (give these 10-15 seconds each to run), e.g. verbs (past tense) in fiction, ADJ in academic, verbs in religion magazines, adjectives in medical academic
Phraseology: e.g. . In particular , a lot of, kind of NOUN, type of NOUN; phrasal verbs with out (FIC/ACAD)
Syntax/grammar: (spoken) and I'm like , get passive, end up V-ing (fiction) had been V-ing, (academic) be passive, appear to VERB, must + VERB
Semantics/meaning: use collocates to see differences between genres, e.g. FIC (left) vs ACAD (right): chair, chain, string; synonyms of strong, weak
Discourse/culture: frequency of words and phrase, e.g. global warming, climate change, crippled, people|person of color

VARIATION BETWEEN GENRES: British (BNC)
BNC 100 million words, 1980s-1993. Note: somewhat lower counts than COCA, since the BNC is a much smaller corpus.

Lexical: the frequency of any word or phrase, e.g. (spoken) I reckon, , you know , (fiction) muffled, frowned (academic) validity, correlate
Lexical: compare all words in different genres, e.g. verbs (past tense) in fiction, ADJ in academic, verbs in sermons, ADJ in tabloid news
Phraseology: e.g. . In particular , a lot of, kind of NOUN, type of NOUN; phrasal verbs with out (FIC/ACAD)
Syntax/grammar: (spoken) get passive, BE V-ing, (fiction) had been V-ing, (academic) be passive, appear to V, HAVE to VERB, whom
Semantics/meaning: use collocates to see differences between genres, e.g. FIC (left) vs ACAD (right): chair, chain, string; synonyms of strong, weak