English-Corpora.org

English-Corpora.org


SPEED

For very large corpora, Sketch Engine is just about the fastest corpus architecture available. Our architecture, however, is even faster -- about 10-15 times as fast, on average, for "string searches" like those shown below. This means that with a large corpus like iWeb, for example, you might spend 5 minutes doing a series of searches, whereas it would take you a little more than an hour (i.e. 60 minutes just sitting there, waiting for results) in a similar-sized corpus in Sketch Engine.

The following data is based on the 14.0 billion iWeb corpus and the 13.2 billion word enTenTen15 corpus from Sketch Engine (since these two corpora are roughly the same size). The first two columns show the search strings in iWeb and enTenTen15. The last four columns show the speed (in seconds) in iWeb and SketchEngine. The (prelim[inary]) column provides a preliminary estimate of how much faster the search is in iWeb (e.g. 6.8 times as fast for ADJ plans). Because iWeb is a bit larger than enTenTen15 (and so any search should take a little bit longer than enTenTen15), the [x faster] column (rightmost column) takes into account this difference.

As you can see, the English-Corpora.org architecture is about 10-15 times as fast as the Sketch Engine interface. 3 As far as we are aware, this is the fastest architecture available for any full-featured, structured corpora.

Note: click on any link on this page to see the corpus data, and then click on the "BACK" image (see left) at the top of the page to come back to this page.
 
iWeb (14 billion words) 1 Sketch Engine: enTenTen15 (13.2 billion words) 2 iWeb SE (Prelim) x faster

ADJ plans

[tag = "J.*"] [word = "plans"]

3.7

25

6.8

7.1

long NOUN  *

[word = "long"] [tag = "N.*"]

7.6

103

13.6

14.3

I VERB whether

[word = "I"] [tag = "V.*"] [word = "whether"]

3.2

54

16.9

17.8

never really VERB+  *

[word = "never"] [word = "really"] [tag = "V.*"]

5.6

47

8.4

8.8

the best NOUN  *

[word="the"] [word="best"] [tag="N.*"]

9.2

171

18.6

19.6

ADV ADJ places

[tag = "R.*"] [tag = "J.*"] [word = "places"]

3.7

34

9.2

9.7

VERB them make  *

[tag = "V.*"] [word = "them"] [word = "make"]

10.3

39

3.8

4.0

NOUN PRON BUY

[tag="N.*"] [tag="PP.*"] [lemma="buy"]

4.8

86

17.9

18.9

THINK PRON VERB+  *

[lemma="think"] [tag="PP.*"] [tag="V.*"]

7.3

67

9.2

9.7

DO NEG it seem

[lemma="do"] [word="n't"] [word="seem"] [word="to"] [tag="V.*"]

1.9

44

23.2

24.4

VERB her way PREP  *

[tag="V.*"] [word="her"] [word="way"] [tag="IN.*"]

10.3

80

7.8

8.2

VERB through the NOUN  *

[tag="V.*"] [word="through"] [word="the"] [tag="N.*"]

10

210

21.0

22.1

Notes:

1. Click on the link to do the search in iWeb.  If there is an asterisk after the search, the first results will be from pre-calculated "n-grams" tables, which should be much faster than the times shown here. But it wouldn't really be a fair comparison to Sketch Engine, since Sketch Engine doesn't have pre-calculated n-grams tables. Therefore, to have a "fair" comparison and to search "from scratch", click on "Use Large N-grams" and then "See Full List" in the iWeb results.

2. To search in Sketch Engine, select Concordance / Advanced / CQL and then insert the CQL string in Sketch Engine. Once it starts showing the KWIC results, click on the [Frequency] icon in the row of icons at the top. The time shown in the SE (Sketch Engine) column above is the combined time from when it starts displaying KWIC results and when it finally produces the frequency list of matching strings, with about 2-3 seconds subtracted to click on the [Frequency] link.

3. Some people might wonder why we haven't compared our results to CQPWeb as well. This is because CQPWeb is limited to corpora of 2 billion words or less, and so there are no comparable corpora (10-15 billion words) in the CQPWeb format. Preliminary results from very small corpora like the BNC, however, show that Sketch Engine is much faster than CQPWeb. So by extension, English-Corpora.org is 10-15 times faster than that (or probably 30-40 times as fast as CQPWeb).