corpus.byu.edu

corpora, size, queries = better resources, more insight


 Upgrade   Contributors 

 Academic site license 

Overview
Corpora
Size, speed, queries
Insight into variation

Updates (May 2016)
History / updates
FAQ / questions
Researchers

Register / create profile
Log in / password
Reset password

Related resources
   Full-text data
   Word frequency
   Collocates
   N-grams
   WordAndPhrase
   Academic vocabulary

Problems
Contact us


SPEED

For very large corpora, Sketch Engine (which is based on Corpus Workbench and CQPWeb) is just about the fastest corpus architecture available. Our architecture, however, is even faster -- about six times as fast, on average, for "string searches" like those shown below. This means that with GloWbe, for example, you might spend 5 minutes doing a series of searches, whereas it would take you 30 minutes total (25 minutes more waiting for results) in a similar-sized corpus in Sketch Engine.

The following data is based on the 1.9 billion GloWbE corpus and a 2.7 billion word corpus in Sketch Engine [enTenTen08 = 3.3 billion tokens, including punctuation, etc). Since [enTenTen08] is about 50% larger (2.7 vs 1.9 billion words), it should take about 50% longer for each search. But in fact, it takes much longer than that. For example, the first search shown below -- [have] quite [vvn*] -- takes about 2.6 seconds in GloWbE. Allowing for the 50% larger size of [enTenTen08], it should take about 3.9 seconds there. In fact, though, it takes about 25 seconds (11 seconds for the concordance lines (SE1) + 14 seconds to find and sort the node words (SE2)), and this is about 6-7 times as slow as GloWbe.

Note: click on any link on this page to see the corpus data, and then click on "RETURN" in the upper right-hand corner of the corpus to come back to this page.

 
GloWbE Sketch Engine (enTenTen08) GloWbE SE1 SE2 Faster (x)
[have] quite [vvn*] [lemma = "have"] [word = "quite"] [tag = "VVN"]  2.6 11 14 6.4
several [nn*] [word = "several"] [tag = "NN."]  3.3 12 75 17.6
I [vv*] if [word = "I"] [tag = "VV."] [word = "if"]  5.7 24 29 6.2
just [vv*] [p*] [vv*] that [word = "just"] [tag = "VV."] [tag = "PP$"] [tag = "VV."] [word = "that"] 5.5 36 5 5.0
[j*] places [tag = "AJ"] [word = "places"]  3.6 14 31 8.3
in no [nn*] [word = "in"] [word = "no"] [tag = "NN."] 4.9 14 7 2.9
to only [v*] [word = "to"] [word = "only"] [tag = "VV."] 5.0 21 5 3.5
[vv*] [p*] into [v?g*] [tag = "VV."] [tag = "PP"] [word = "into"] [tag = "V.G"] 5.0 30 8 5.1
[r*] [vv*] whether [tag = "RB"] [tag = "VV."] [word = "whether"]  3.0 26 14 8.9
[go] [j*] [lemma = "go"] [tag = "JJ"]  6.8 14 28 4.1
     

Average

6.8 x