English-Corpora.org

English-Corpora.org


INSIGHT INTO VARIATION

The corpora from www.english-corpora.org allow research on variation -- historical, between dialects, and between genres -- in ways that are not possible with other corpora. This is due to at least three factors:

1. CORPORA: texts from a wide range of genres, dialects, and time periods -- not just a huge "blob" of billions of words of easily-obtainable newspapers or web pages. In that case, you might have information on a linguistic feature in just one genre in one country at one time period, and really miss out on the richness and variety of language.

2. SIZE: the corpora are 100-200 times as large as (otherwise) similar corpora, and so they potentially yield many more tokens  (and yet they are still very fast)

3. QUERIES: our proprietary corpus architecture and interface are designed "from the ground up" to allow comparisons of different portions of the corpus (time periods, dialects, and genres).

Note: click on any link on this page to see the corpus data, and then click on the "BACK" image (see left) at the top of the page to come back to this page.

 

HISTORICAL VARIATION (1810s-2000s)
COHA: 475 million words, 1820s-2010s. 100-200 times as large as any other structured historical corpus of English.

HISTORICAL VARIATION (recent: 1990-2019)
COCA: 1 billion words, 1990-2019. The only large corpus that keeps the same genre balance year to year (more...)

HISTORICAL VARIATION (Google Books)
Google Books (Advanced): 155 billion words, 1810s-2000s. Much more advanced interface/searches than the standard Google Books n-grams.

VARIATION BETWEEN DIALECTS: compare 20 dialects of World English
GloWbE: 1.9 billion words, 20 different countries. 100 times as large as the next-largest corpus of English dialects (more...)

VARIATION BETWEEN GENRES: American (COCA)
COCA: 1 billion words, 1990-2019. The largest freely-available, genre-balanced corpus currently available.

VARIATION BETWEEN GENRES: British (BNC)
BNC 100 million words, 1980s-1993. Note: somewhat lower counts than COCA, since the BNC is a much smaller corpus.