Full-text corpus data

from corpus.byu.edu


 Purchase data 

Overview
Corpora
Using the data
Formats / samples
   Database/SQL

Limitations: "10/200"

Related sites
  Word frequency
  Collocates
  N-grams
  WordAndPhrase
  Academic vocabulary
  corpus.byu.edu

Contact us


The full-text corpus data is available in three different formats. When you purchase the data, you purchase the rights to all three formats, and you can download whichever ones you want.

Samples: The sample data that is linked to below is taken completely at random from each of the corpora (usually about 1/100th the total number of texts). No attempt has been made to "clean up" this sample data in any way. If you're happy with the sample data that you download, you should be equally as happy with the full data.

Note that the size shown in the first column is the total amount of words that you can download, after purchasing the data. The size in the other three columns is for the samples. Note also that the shared files below (sources and lexicon) are just for the sample texts.

Corpus (size of
complete full-text data)

 
Database (more)
 
Word/lemma/PoS Linear text


COCA (440 million)
COHA (385 million)
GloWbE (1.8 billion)

NOW (4.79 billion)
Wikipedia (1.8 billion)
Spanish (1.8 billion)

Shared files
Lexicon  Sources
Lexicon  Sources
Lexicon  Sources

Lexicon  Sources
Lexicon  Sources
Lexicon  Sources

Samples
GloWbE: 2.1 mw
COCA: 1.7 mw
COHA: 3.6 mw

NOW: 1.7 mw
Wikipedia: 1.8 mw
Spanish: 2.0 mw

Samples
GloWbE: 2.1 mw
COCA: 1.7 mw
COHA: 3.6 mw

NOW: 1.7 mw
Wikipedia: 1.8 mw
Spanish: 2.0 mw

Samples:
GloWbE: 2.1 mw
COCA: 1.7 mw
COHA: 3.6 mw

NOW: 1.7 mw
Wikipedia: 1.8 mw
Spanish: 2.0 mw

Explanation and notes   Most robust format, but requires knowledge of SQL. Allows for powerful JOINs across corpus, lexicon, and sources tables. Word, lemma, and part of speech in vertical format; can be imported into a database. In most of the corpora, texts are separated by a line with ## and the textID. In COHA, each text is its own file). This format provides a textID for each text, and then the entire text on the same line. In this format, words are not annotated for part of speech or lemma. In addition, contracted words like <can't> are separated into two parts (ca n't) and punctuation is separated from words (eye level . As her).
Short sample  
textID ID wordID
2002364153180333 69
2002364153180334 3
2002364153180335 978
2002364153180336 8880
2002364153180337 8047
2002364153180338 12
2002364153180339 3
2002364153180340 351
2002364153180341 19630
2002364153180342 134
2002364153180343 6720
2002364153180344 38
2002364153180345 42
2002364153180346 3355
2002364153180347 3923
2002364153180348 52
2002364153180349 10985
2002364153180350 3
2002364153180351 44306
2002364153180352 3792
2002364153180353 22
2002364153180354 3
2002364153180355 809
2002364153180356 449
2002364153180357 3531
word lemma PoS
Butbutccb
thetheat
hugehugejj
bonusbonusnn1
prizeprizenn1
isbevbz
thetheat
realrealjj
drawdrawnn1@
----x
announcedannounce vvn
bybyii
anaat1
electronicelectronic jj
displaydisplay nn1
thatthatcst_dd1
resemblesresemble vvz
thetheat
tickingticking jj
wheelwheelnn1
ononii
thetheat
TVtvnn1
gamegamenn1
showshownn1_vv0
##2002364 But the huge bonus prize is the real draw -- announced by an electronic display that resembles the ticking wheel on the TV game show , placed just above eye level . As her losses mounted to more than $200 , Budz fed the machine $5 tokens , pressing the Spin button almost rhythmically -- no serious slot player touches the pull handle on a one-armed bandit .