Full-text corpus data

from corpus.byu.edu


 Purchase data 

Overview
Corpora
Using the data
Formats / samples
   Database/SQL

Limitations: "10/200"

Related sites
  Word frequency
  Collocates
  N-grams
  WordAndPhrase
  Academic vocabulary
  corpus.byu.edu

Contact us


Database is one of the three data formats. When you purchase the data, you purchase rights to all three formats, and you can download whichever ones you want.

This format is composed of three tables:

  1. [corpus]: one line for each word in the corpus, showing: the textID, the ID (offset value: 293, 294, 295, etc), and then an integer value for the word (wordID).

  2. [lexicon]: information on each wordID: word (walked), lemma (walk), and part of speech (vvd).

  3. [sources]: information on each textID: genre or country, source, title, etc,

This format allows you to carry out extremely powerful queries on the corpora. You can search by word form, lemma, or part of speech. You can also limit the search to just certain texts in the corpus via JOINs on the [sources] table (e.g. just Magazines-Sports in COCA, Great Britain in GloWbE, fiction from the 1870s in COHA, a particular website in NOW, or pages with a given word in the title in Wikipedia), . You can also modify the [lexicon] table to add whatever additional features you would like for each wordID, or modify the [sources] table to create your own sub-corpora.

corpus table lexicon table sources table
textID ID wordID
2040250110933753 848
2040250 1109337543
2040250 110933755560
2040250110933756459620
2040250 1109337576891
2040250 11093375810
20402511109337797140251
2040251 11093378011850
2040251 110933781 187678
2040251110933782 26
204025111093378319957
2040251 11093378419
2040251 11093378541
204025111093378664
2040251110933787 160
2040251110933788 4
20402511109337894155
wordID word lemma PoS
71186swabswab nn1
77653swagswag nn1
36155swaggerswagger nn1
62562Swahiliswahili jj_nn1
44807Swainswain np1
13782swallowswallow vvi
37384swallowswallow vv0
44376swallowswallow nn1@
73593swallowswallow nn1@_vv0
16873swallowed swallowvvd
22661swallowed swallowvvn
36293swallowed swallowvvd_vvn
23040swallowing swallowvvg
39247swallows swallowvvz
57594swallows swallownn2
14960swamswim vvd
15768swampswamp nn1
COCA
textID genre sub-genre year sourceTitle textTitle
2040250MAGReligion 2007Source_A Title_L
2040251MAGSports 2007Source_B Title_M
2040252NEWSFinancial 2012Source_C Title_N
2040253ACADScience 1993Source_D Title_O
2040254FICMovieScript 1997Source_E Title_P
2040255SPOKCNN 2001Source_F Title_Q
2040256NEWSFinancial 2012Source_G Title_R

COHA
textID Year genre sourceTitle textTitle
7282821837 FIC Source_ASampleTitle_N
7282831872 FIC Source_BSampleTitle_0
7282841904 NF Source_CSampleTitle_P
7282851938 MAG Source_DSampleTitle_Q
7282861959 NEWS Source_ESampleTitle_R
7282871987 MAG Source_FSampleTitle_S

GloWbE
textID country genre url textTitle
3282569AU BLOG SampleURL_A SampleTitle_N
3282570IN BLOG SampleURL_B SampleTitle_0
3282571US GENL SampleURL_C SampleTitle_P
3282572GB BLOG SampleURL_D SampleTitle_Q
3282573IE GENL SampleURL_E SampleTitle_R
3282574NZ GENL SampleURL_F SampleTitle_S
3282575SG BLOG SampleURL_G SampleTitle_T

There are similar tables for NOW, Wikipedia, and Spanish
 

NOTE: The database format assumes that you know SQL (Structured Query Language), and that you can create the tables, populate them with the downloaded data, and (most importantly) run the SQL queries to extract the data. Please do not use this format unless you are well-acquainted with databases and SQL. The following are just a handful of SQL queries that you can run on the data, but there is of course no limit to what you can do.

 

1. Find 1000 most frequent nouns in (COCA) ACAD-Science

select count(*),lex.word
from lexicon as lex, sources, corpus where
sources.genre = 'ACAD' and sources.sub-genre = 'Science' and
sources.textID = corpus.textID and
lex.pos like 'nn%' and
lex.wordID = corpus.wordID
group by lex.word
order by count(*) desc

2. Find top 500 strings of get V-ed (e.g. got married, gets paid)
(using "runtime" self-join on corpus; much faster with multi-column table; see below)

select count(*),lex1.word, lex2.word
from lexicon as lex1, lexicon as lex2, corpus as corpus1, corpus as corpus where
lex1.lemma = 'get' and
lex2.pos like 'v_n%' and
lex1.wordID = corpus1.ID and
lex2.wordID = corpus.ID and
corpus.ID = corpus1.ID + 1
group by lex1.word, lex2.word
order by count(*) desc

3. Find top 500 3-grams, with point in the second position
(using "runtime" self-join on corpus; much faster with multi-column table; see below)

select count(*),lex1.word, lex2.word, lex3.word
from lexicon as lex1, lexicon as lex2, lexicon as lex3, corpus as corpus1, corpus as corpus, corpus as corpus3 where
lex2.word = 'point' and lex2.wordID = corpus.wordID and
lex1.wordID = corpus1.wordID and
lex3.wordID = corpus3.wordID and
group by lex1.word, lex2.word, lex3.word
order by count(*) desc


Note: rather than using self-joins (as in #2 and 3 above) the BYU corpus architecture has tables like that shown below. The [w5] column here corresponds to the [wordID] column in the [corpus] table above, but a massive self-join has been done on this table (as the corpus was created; not as each query is run) to create "adjacent" [w1]-[w4] and [w6]-[w9] columns. As a result, the four preceding and four following words are already on the row when one searches [w5]. With the full-text data, you can create similar tables yourself.

w1 w2 w3 w4 w5 w6 w7 w8 w9
43 3 858 5 432 3319 9 132 2876
3 858 5 432 3319 9 132 2876 3643
858 5 432 3319 9 132 2876 3643 5
5 432 3319 9 132 2876 3643 5 1729
432 3319 9 132 2876 3643 5 1729 72
3319 9 132 2876 3643 5 1729 72 43
9 132 2876 3643 5 1729 72 43 21887
132 2876 3643 5 1729 72 43 21887 746929
2876 3643 5 1729 72 43 21887 746929 676
3643 5 1729 72 43 21887 746929 676 62900

This allows for much faster queries (than self-joins at SQL runtime). For example, to find the most frequent collocates for a given word, the SQL query would be:

4. Find top 200 noun collocates of break as a verb, in the four "slots" after break (columns w6-w9 above).

select top 200 count(*),w6 from (
SELECT x.w6 FROM corpus as x, lexicon as x1 where x1.lemma like 'break' and x1.pos like 'v%' and x.w5 = x1.wordID UNION ALL
SELECT x.w7 FROM corpus as x, lexicon as x1 where x1.lemma like 'break' and x1.pos like 'v%' and x.w5 = x1.wordID UNION ALL
SELECT x.w8 FROM corpus as x, lexicon as x1 where x1.lemma like 'break' and x1.pos like 'v%' and x.w5 = x1.wordID UNION ALL
SELECT x.w9 FROM corpus as x, lexicon as x1 where x1.lemma like 'break' and x1.pos like 'v%' and x.w5 = x1.wordID
) a, lexicon b where
b.pos like 'nn%' and
a.word2 = b.wordID
group by a.w6
order by count(*) desc

On a fairly fast machine, this will only take about two seconds for COCA (440 million words)