Full-text corpus data

from corpus.byu.edu


 Purchase data 

Overview
Corpora
Using the data
Formats / samples
   Database/SQL

Limitations: "10/200"

Related sites
  Word frequency
  Collocates
  N-grams
  WordAndPhrase
  Academic vocabulary
  corpus.byu.edu

Contact us


The full-text corpus data is available in three different formats. When you purchase the data, you purchase the rights to all three formats, and you can download whichever ones you want.

The sample data is taken completely at random from each of the corpora. No attempt has been made to "clean up" the data in any way. If you're happy with the sample data that you download, you should be equally as happy with the full data. Note that the sizes shown below are for the samples, not the size of the actual downloadable corpus. Note also that the shared files below (sources and lexicon) are just for the sample texts (usually about 1/100th the total number of texts).

Format Database (more information)
(mw: millions of words)
Word / lemma / PoS Word: linear text

Samples (shared files)
COCA: lexicon  sources
COHA: lexicon  sources
GloWbE: lexicon  sources

NOW: lexicon  sources
Wikipedia: lexicon  sources
Spanish: lexicon  sources

GloWbE: 2.1 mw
COCA: 1.7 mw
COHA: 3.6 mw

NOW: 1.7 mw
Wikipedia: 1.8 mw
Spanish: 2.0 mw

GloWbE: 2.1 mw
COCA: 1.7 mw
COHA: 3.6 mw

NOW: 1.7 mw
Wikipedia: 1.8 mw
Spanish: 2.0 mw

GloWbE: 2.1 mw
COCA: 1.7 mw
COHA: 3.6 mw

NOW: 1.7 mw
Wikipedia: 1.8 mw
Spanish: 2.0 mw

Explanation and notes Most robust format, but requires knowledge of SQL. Allows for powerful JOINs across corpus, lexicon, and sources tables. Word, lemma, and part of speech in vertical format; can be imported into a database. In most of the corpora, texts are separated by a line with ## and the textID. In COHA, each text is its own file). This format provides a textID for each text, and then the entire text on the same line. In this format, words are not annotated for part of speech or lemma. In addition, contracted words like <can't> are separated into two parts (ca n't) and punctuation is separated from words (eye level . As her).
Short sample
textIDID wordID
200236415318033369
20023641531803343
2002364153180335978
20023641531803368880
20023641531803378047
200236415318033812
20023641531803393
2002364153180340351
200236415318034119630
2002364153180342134
20023641531803436720
200236415318034438
200236415318034542
20023641531803463355
20023641531803473923
200236415318034852
200236415318034910985
20023641531803503
200236415318035144306
20023641531803523792
200236415318035322
20023641531803543
2002364153180355809
2002364153180356449
20023641531803573531
wordlemma PoS
Butbutccb
thetheat
hugehugejj
bonusbonusnn1
prizeprizenn1
isbevbz
thetheat
realrealjj
drawdrawnn1@
----x
announcedannouncevvn
bybyii
anaat1
electronicelectronicjj
displaydisplaynn1
thatthatcst_dd1
resemblesresemblevvz
thetheat
tickingtickingjj
wheelwheelnn1
ononii
thetheat
TVtvnn1
gamegamenn1
showshownn1_vv0
##2002364 But the huge bonus prize is the real draw -- announced by an electronic display that resembles the ticking wheel on the TV game show , placed just above eye level . As her losses mounted to more than $200 , Budz fed the machine $5 tokens , pressing the Spin button almost rhythmically -- no serious slot player touches the pull handle on a one-armed bandit .