corpus.byu.edu

corpora, size, queries = better resources, more insight


 Contribute   Contributors 

 Academic site license 

Overview
Corpora
Size, speed, queries
Insight into variation

Updates (May 2016)
History / updates
FAQ / questions
Researchers
Log in / password
Profile / register

Related resources
   Full-text data
   Word frequency
   Collocates
   N-grams
   WordAndPhrase
   Academic vocabulary

Problems
Contact us


INSIGHT INTO VARIATION

The corpora from corpus.byu.edu allow research on variation -- historical, between dialects, and between genres -- in ways that are not possible with other corpora. This is due to at least three factors:

1. CORPORA: texts from a wide range of genres, dialects, and time periods -- not just hundreds of millions of words of easily-obtainable newspapers or web pages. In that case, you might have information on a linguistic feature in just one genre in one country at one time period, which would be a fairly narrow view of language.

2. SIZE: the corpora are 100-200 times as large as (otherwise) similar corpora, and so they potentially yield many more tokens  (and yet they are still very fast)

3. QUERIES: our proprietary corpus architecture and interface are designed "from the ground up" to allow comparisons of different portions of the corpus (time periods, dialects, and genres).

Note: click on any link on this page to see the corpus data, and then click on the "BACK" image (see left) at the top of the page to come back to this page.

If you are mainly interested in single words, remember that you can download a 100,000 word list that shows the frequency of each word by genre (COCA, BNC, SOAP), by time period (COHA), and by dialect (COCA and BNC). [ Samples: Web, Excel ]
 


HISTORICAL VARIATION (1810s-2000s)
COHA: 400 million words, 1810s-2000s. 100-200 times as large as any other structured historical corpus of English.


HISTORICAL VARIATION
(recent: 1990-2012)
COCA: 520 million words, 1990-2015. The only large corpus that keeps the same genre balance year to year (more...)


HISTORICAL VARIATION (Google Books)
Google Books (Advanced): 155 billion words, 1810s-2000s. Much more advanced interface/searches than the standard Google Books n-grams.


VARIATION BETWEEN DIALECTS: compare 20 dialects of World English
GloWbE: 1.9 billion words, 20 different countries. 100 times as large as the next-largest corpus of English dialects (more...)


VARIATION BETWEEN DIALECTS: compare two corpora
COCA and BYU-BNC; comparing two corpora from different countries (more...). (Note that these comparisons between corpora use the older (pre May 2016) version of the corpora. We may adapt them for the new interface at some point in the future.)


VARIATION BETWEEN GENRES: American (COCA)
COCA: 520 million words, 1990-present. The largest freely-available, genre-balanced corpus currently available.


VARIATION BETWEEN GENRES: British (BNC)
BYU-BNC 100 million words, 1980s-1993. Note: somewhat lower counts than COCA, since the BNC is a much smaller corpus.