corpus.byu.edu

corpora, size, queries = better resources, more insight


Overview
Corpora
Size, speed, queries
Insight into variation

History / updates
FAQ / questions
Researchers

Register
Modify profile

Related resources
   Full-text data 
   Word frequency
   Collocates
   N-grams
   WordAndPhrase
   Academic vocabulary

Problems
Contact us


In addition to the regular corpus interface, there are a wide range of other corpus-based resources, some of which allow you to download large amounts of data for offline use.
 

Full-text   NEW!  

Download 440 million words of full-text data for COCA (190,000 texts), or 1.8 billion words for GloWbE (1,800,000 texts). With this data, you will have the texts from the corpora on your own computer, rather than having to use the web interface. The data comes in three formats: relational database, word/lemma/PoS (vertical format), or text (linear format).

Word and Phrase
(analyze texts)

Enter entire texts and see detailed frequency information on the words in the text, and create word lists based on your text. Click through the words to see detailed information on any word. Highlight phrases in your text and have it search for related phrases in COCA.

Word and Phrase
(frequency lists)

Search and browse the most complete frequency dictionary of English. See detailed information (all on one page) -- definition, frequency by genre, collocates (nearby words), concordance lines, synonyms, and Wordnet-related words, all with useful links from one resource to another.

Word Frequency
100,000 list

Download free lists, including the top 5000 lemmas. You can also download other lists, which show the frequency of the top 60,000 lemmas by genre (and sub-genre). You can also download a 100,000 integrated word list from COCA, COHA, BNC, and SOAP -- the largest, corrected frequency list of English.

Collocates

Download lists with the top 200-300 collocates (nearby words) for 60,000 different lemmas -- 4,300,000 node/collocate pairs in all.

N-grams

Download free lists containing the top 1,000,000 2-grams (two word sequences), 3-grams, 4-grams, and 5-grams in COCA. There are also other lists that contain the frequency of all 2, 3, and 4-grams (up to 155 million rows of data).

Academic Vocabulary

Download free lists from the 120 million words of COCA-Academic texts, including academic words grouped by word families, lists of "core" academic English, and "technical" word lists for the nine domains of COCA-Academic (e.g. Law, Medicine, or Business).

Word and Phrase
(academic)

Similar to the two resources below, but limited strictly to the 120 million words of COCA-Academic. Get detailed information on words and phrases, frequency by sub-genre, and concordances and collocates in just the academic genre. Also, analyze entire academic texts that you input.