The Corpus of Contemporary American English (COCA)
and
Wordbanks Online


Note: the comparison below is to WordBanks Online, which is related to and derived from the Bank of English. The Bank of English itself is only available to a small group of researchers at the University of Birmingham. The vast majority of people who use the data from the Bank of English from the 1990s (the period under discussion below) will do it via WordBanks Online.


The Bank of English (BoE) has been used as the basis for many insightful analyses of Modern English, especially by those from the "Birmingham School" of corpus linguists. Because of its composition, many books and article on corpus linguistics suggested that the BoE could be used as a "monitor corpus", to look at recent and ongoing changes in English.

In spite of its high degree of usefulness for purely synchronic studies, we would argue that WordBanks Online (WbO) / the Bank of English (BoE) has certain limitations that seriously limit its usefulness as a monitor corpus. The Corpus of Contemporary American English (COCA), on the other hand, was designed from the ground up as a monitor corpus, and it does provide rich, useful data that is not available from WbO/BoE.


Corpus size and growth

Both corpora are quite large. COCA has about 520 million words, while WordBanks Online (WbO) is about 455 million words. Both are much larger than the widely-used 100 million word British National Corpus (BNC; see comparison of COCA and the BNC). One important difference between COCA and WbO, however, is that COCA continues to be updated, as a true monitor corpus should be. Another 20 million words are added to COCA each year (the last update was December 2015), while work on WordBanks Online has (apparently) stopped -- the last texts in WbO are from 2005.
 


Genres

It is a bit difficult to know exactly what is in WordBanks Online, since there is only one page with a sketchy outline online. With the Corpus of Contemporary American English, on the other hand, we have details by year, genre, sub-genre, and even down to the level of each of the 160,000 individual texts.

It looks, however, like the following is the composition of WordBanks Online for the American and British sub-corpora, along with the equivalent sizes from COCA:

GENRE

COCA (millions of words)

WbO: UK (millions of words)

WbO: US (millions of words)

Spoken

109

41.4

20.1

Fiction

105

24.1

33.1

Popular magazines

110

16.3

15.3

Newspaper

106

125.6

77.8

Academic

103

---

---

Other (Non-fiction books)

---

51.6

43.1

TOTAL

533

259.4

189.4

As one can see, COCA is evenly balanced between the five genres of spoken, fiction, popular magazines, newspapers, and academic journals. WordBanks Online, on the other hand, is heavily weighted towards newspapers (about 50%) because they are easy to acquire from online sources. There are apparently no academic journals in WordBanks Online (or at least they are not labeled as such). Finally, much of the spoken in WordBanks Online is taken from transcripts that are read, whereas in COCA they come from spontaneous speech on TV and radio programs.


Where is the informal speech?

In several searches of informal constructions in WordBanks Online that we have done, it appears that WbO has far too little data, which suggests that the limited Spoken texts in WbO do not represent actual spoken English very well. (This is probably because their texts come just from transcripts from the Voice of America, and there is little or no spontaneous speech). To give just one example, the following are the number of tokens of the "quotative like" (and she's like, "I don't know").
 

Years WbO (just the American texts) COCA
  tokens size per million tokens size per million
1990-94 5 20,883,000 0.24 128 103,300,000 1.2
1995-99 1 19,187,000 0.05 336 102,900,000 3.3
2000-04 173 123,055,000 1.41 453 102,600,000 4.4

As can be seen, there is a huge disparity between COCA and WordBanks Online. In terms of normalized frequencies (per million words), this informal construction is 3.1 times as common in COCA as in WbO in 2000-04, 5.0 times as common 1990-94, and 66.0 times as common 1995-99. We could repeat this with many other phenomena (and in forthcoming publications we do so). The bottom line is that COCA -- even though it has at times been (incorrectly) criticized for not having enough "informal" spoken texts, has much more of this than WordBanks Online (and the Bank of English, on which it is based). (See COCA / Help-Information / Texts / Spoken texts)


Genre balance over time

In order to use frequency statistics to look at changes over time -- as we would want to do with a monitor corpus -- each historical period needs to have the same genre composition. To take a worst-case example, suppose that a corpus had only newspapers from the 1990s and then only fiction from the 2000s. For any change that we see from the 1990s to the 2000s, we would not know if the change had actually occurred in the language as a whole, or if it is just an "artifact" of the changing genre composition from one period to the next.

What we find is that COCA is balanced across genres -- almost perfectly -- from year to year. In each and every year from 1990-2010, the corpus has been divided between spoken (20%), fiction (20%), popular magazines (20%), newspapers (20%), and academic journals (20%). Even at the level of sub-genre (e.g. Newspaper-Sports, or Academic-Medicine), the corpus composition changes very little from year to year.

In WordBanks Online however, the genre composition varies widely from one year (or set of years) to another. For example, the following figures show the percentage of fiction in the US sub-corpus in different time periods:

Time period

Fiction

Total

% fiction

1960-1979

1,030,000

1,414,000

72.8%

1980-1989

3,087,000

8,792,000

35.1%

1990-1994

6,049,000

20,833,000

29.0%

1995-1999

3,100,000

19,187,000

16.2%

2000-2004

18,800,000

123,055,000

15.3%

Notice how the percentage of fiction decreases by nearly 50% from the early 1990s to the late 1990s. Let us briefly look at how this distorts the corpus data for these periods.
 
 

ALL
90-94

per
million

ALL
95-99

per
million

FIC
90-94

per
million

FIC
95-99

per
million

mutter (all forms)

378

18.1

269

14.0

326

53.9

159

51.3

she said

3948

189.5

2783

145.0

3271

540.8

1793

578.4

had + VBN (e.g. had seen)

56239

2699.5

31125

1622.2

21590

3569.2

10418

3360.7

All three of these forms (mutter, she said, and had + VBN) are characteristic of fiction. Notice that in just the US fiction part of WbO (green cells), the frequency per million words stays about the same from 1990-94 to 1995-99, as we would expect. But in the entire US part of WbO (all genres; in blue), the normalized frequency (per million words) decreases much more from 1990-94 to 1995-99. For example, had + VBN decreases by about 40%. Why is this? Well, notice that in the table above that the percentage of the US corpus in WbO that is fiction decreased by about 55% during the same period. In other words, the decrease in the corpus is probably just a function of the change in genre balance, rather than any change in "real world" language. (It would, after all, be quite strange if people really did all of the sudden say had eaten, had noticed, etc. only 50% as much in the late 1990s as the early 1990s!)

In COCA, on the other hand, the relative frequency of these three forms in the overall corpus stays quite flat from 1990-94 until 2005-09, because the percentage of texts in the corpus that are from fiction (20% each year) stays the same.
 

mutter

1990-1994

1995-1999

2000-2004

2005-2009

 

 

PER MIL

14.9

13.4

14.8

15.9

SIZE (MW)

103.3

102.9

102.6

93.6

FREQ

1542

1378

1516

1484

she said

1990-1994

1995-1999

2000-2004

2005-2009

 

 

PER MIL

197.9

210.7

190.4

204.5

SIZE (MW)

103.3

102.9

102.6

93.6

FREQ

20444

21684

19531

19130

had [VVN]

1990-1994

1995-1999

2000-2004

2005-2009

 

 

PER MIL

1,173.1

1,066.2

1,059.0

1,095.4

SIZE (MW)

103.3

102.9

102.6

93.6

FREQ

121208

109731

108624

102491


Strange  data from WordBanks Online

Even beyond this serious problem with genre balance, it appears that there might be an even more fundamental problem with WordBanks Online (which again, is quite similar to the Bank of English for the 1990s). To see what this is, consider the following table:

WordBanks Online

 

1990-94

1995-99

2000-04

90-94 >
95-99

95-99 >
00-04

was VVN

1550
(32370)

1071
(20558)

1458
(179367)

0.69

1.36

to be

1411
(29467)

1289
(24726)

1153
(141880)

0.91

0.89

is

6443
(134551)

8225
(157808)

6558
(810686)

1.28

0.80

and

22400
(467783)

22517
(432037)

18580
(2286364)

1.01

0.83

This table shows the frequency of four common words (is and and), phrases (to be), and grammatical constructions (was VVN: was seen, was considered) in WordBanks Online in three periods – 1990-94, 1995-1999, and 2000-2004. (The raw frequency data is in parentheses, while the normalized value – per million words – is in bold.) The two columns at the right (90-94 / 95-99 and 95-99 / 00-04) shows the percentage change (for the normalized figures) between 1990-94 and 1995-99 and for 1995-99 and 2000-04. For example, in WbO the frequency of the passive “decreased” 31% between 1990-94 and 1995-1999, and then “increased” 36% between 1995-99 and 2000-04.

One might wonder why the passive would increase or decrease 30-35% per cent between two adjacent periods, or why a very common word like is or and would vary by 20-30% from one period to the next. And notice that it is not just a problem with corpus sizes and bad calculations – with one word the frequency might increase dramatically between two periods in WbO, while with another word it might decrease dramatically during the same period. With frequency statistics this strange for common, predictable words, it is difficult to have confidence that WordBanks Online will provide accurate data for other words, phrase, and grammatical constructions that we might be researching.

COCA

 

1990-94

1995-99

2000-04

90-94 >
95-99

95-99 >
00-04

was VVN

1305

1235

1234

0.95

1.00

to be

1560

1517

1490

0.97

0.98

is

9549

9414

9190

0.99

0.98

and

26606

26731

26782

1.00

1.00

The table above shows the normalized frequencies (per million words) for four common words, phrases, and grammatical constructions in COCA from the early 1990s to 2004 (data is also available for 2005-2010 but we have omitted it here, to enable easier comparison with the WbO data). Notice that the frequency of these words is essentially flat over time (as we would expect it to be), and we do not have the strange anomalies that are found WordBanks Online (which again are very similar to the data from the Bank of English for the 1990s).


Summary

For those who can afford the $1,150 per year, WordBanks Online provides very good data for British and American English. Contrary to what has commonly been said about WordBanks Online (and the Bank of English, to which it is related, and which is very similar for the 1990s), it is probably not an overly-reliable "monitor corpus", because its genre balance varies so much from year to year. One can never know whether the changes that one sees are a function of the changing genre balance or whether they represent actual changes in the "real world".

The freely-available Corpus of Contemporary American English (COCA), on the other hand, was explicitly and carefully designed as a monitor corpus. This is especially apparent in the corpus design, where the corpus maintains the same genre balance from year to year. As far as we are aware, COCA is the only large corpus that is designed this way, and which can thus be used to accurately measure recent shifts in English.