corpus.byu.edu

corpora, size, queries = better resources, more insight


Overview
Corpora
Size, speed, queries
Insight into variation

History / updates
FAQ / questions
Researchers
Publications

Register
Modify profile

Related resources
   Full-text data 
   Word frequency
   Collocates
   N-grams
   WordAndPhrase
   Academic vocabulary

Problems
Contact us


INSIGHT INTO VARIATION

The corpora from corpus.byu.edu allow research on variation -- historical, between dialects, and between genres -- in ways that are not possible with other corpora. This is due to at least three factors:

1) CORPORA: texts from a wide range of genres, dialects, and time periods -- not just hundreds of millions of words of easily-obtainable newspapers or web pages. In that case, you might have information on a linguistic feature in just one genre in one country at one time period.
2) SIZE: the corpora are 100-200 times as large as other (otherwise) similar corpora, and so they potentially yield many more tokens  (and yet they are still very fast)
3) QUERIES: our proprietary corpus architecture and interface are designed "from the ground up" to allow comparisons of different portions of the corpus (time periods, dialects, and genres).

Note: click on any link on this page to see the corpus data, and then click on "RETURN" in the upper right-hand corner of the corpus to come back to this page.

If you are mainly interested in single words, remember that you can download a 100,000 word list that shows the frequency of each word by genre (COCA, BNC, SOAP), by time period (COHA), and by dialect (COCA and BNC). [ Samples: Web, Excel ]
 


HISTORICAL VARIATION (1810s-2000s)
COHA: 400 million words, 1810s-2000s. 100-200 times as large as any other structured historical corpus of English.


HISTORICAL VARIATION
(recent: 1990-2012)
COCA: 450 million words, 1990-2012. The only large corpus that keeps the same genre balance year to year (more...)


HISTORICAL VARIATION (Google Books)
Google Books (Advanced): 155 billion words, 1810s-2000s. Much more advanced interface/searches than the standard Google Books n-grams.


VARIATION BETWEEN DIALECTS: compare 20 dialects of World English
GloWbE: 1.9 billion words, 20 different countries. 100 times as large as the next-largest corpus of English dialects (more...)


VARIATION BETWEEN DIALECTS: compare two corpora
COCA and BYU-BNC; comparing two corpora from different countries (more...)


VARIATION BETWEEN GENRES: American (COCA)
COCA: 450 million words, 1990-present. The largest freely-available, genre-balanced corpus currently available.


VARIATION BETWEEN GENRES: British (BNC)
BYU-BNC 100 million words, 1980s-1993. Note: somewhat lower counts than COCA, since the BNC is a much smaller corpus.