corpus.byu.edu

corpora, size, queries = better resources, more insight


 Upgrade   Contributors 

 Academic site license 

Overview
Corpora
Size, speed, queries
Insight into variation

Updates (May 2016)
History / updates
FAQ / questions
Researchers

Register / create profile
Log in / password
Reset password

Related resources
   Full-text data
   Word frequency
   Collocates
   N-grams
   WordAndPhrase
   Academic vocabulary

Problems
Contact us


SIZE

Corpus size is incredibly important, in terms of the richness of the corpus data. A tiny one million word corpus is extremely limited in terms of the phenomena that it can study -- compared to a 400 million word corpus, where there might be 400 times as much data.

The following are just a handful of examples that show the importance of size, from just two different types of searches -- low-level grammatical constructions and collocates. In each case, we show the number of tokens in the BNC (100 million words), COCA (520 million words), and GloWbE (1.9 billion words). Imagine also that we had a tiny 1 million word corpus. There would be virtually no tokens of any of these phenomena. (More information on the importance of size for historical corpora and data)

Note: click on any link on this page to see the corpus data, and then click on "RETURN" in the upper right-hand corner of the corpus to come back to this page.

GRAMMATICAL CONSTRUCTIONS

Construction Example BNC COCA GloWbE
[have] been being [v?n*] had been being considered 2 15 149
[vv*] + me into [v?g*] (grouped by lemma; >=3) coerce + me into going 4 19 76
[love] for [p*] to [v*] (I'd) love for him to help (us) 2 106 900
[vv*] [ap*] way [i*] [a*] [nn*] (# strings >= 3) pushed his way through the crowd 16 103 848

COLLOCATES (see Excel file with hundreds of examples)

Corpus size is crucial in terms of finding collocates (nearby words, which provide valuable insight into meaning and usage). The following chart shows the number of collocates for a small sampling of "node" words. As you can see, there are many words that have a good frequency as a node word in the BNC (e.g. 166 tokens with browse), but which simply do not have many collocates (which occur five times or more, in a span of 4L to 4R). The size of the corpus is crucial, in terms of the richness of the collocates. A corpus like GloWbE (at nearly two billion words) provides much more insight than a (now) "small-ish" corpus like the BNC.

Node (PoS) + collocate (PoS) Example BNC (node) COCA (node) BNC (coll) COCA (coll) GloWbE
Verb + Noun browse 166 995 2 74 878
Noun + Adjective stewardship 169 1415 0 39 123
Adjective + Noun outlandish 97 745 0 22 168
Adverb + Verb rightfully 69 747 1 19 202