|
GENRE |
COCA (millions of words) |
||
Spoken |
109 |
41.4 |
20.1 |
Fiction |
105 |
24.1 |
33.1 |
Popular magazines |
110 |
16.3 |
15.3 |
Newspaper |
106 |
125.6 |
77.8 |
Academic |
103 |
--- |
--- |
Other (Non-fiction books) |
--- |
51.6 |
43.1 |
TOTAL |
533 |
259.4 |
189.4 |
As one can see, COCA is evenly balanced between the five genres of spoken, fiction, popular magazines, newspapers, and academic journals. WordBanks Online, on the other hand, is heavily weighted towards newspapers (about 50%) because they are easy to acquire from online sources. There are apparently no academic journals in WordBanks Online (or at least they are not labeled as such). Finally, much of the spoken in WordBanks Online is taken from transcripts that are read, whereas in COCA they come from spontaneous speech on TV and radio programs.
Where is the informal speech?
In several searches of informal constructions in WordBanks Online that we have done, it appears that
WbO has far too little data,
which suggests that the limited Spoken texts in WbO do not represent
actual spoken English very well. (This is probably because their texts
come just from transcripts from the Voice of America, and there is little or no spontaneous speech). To give
just one example, the following are the number of tokens of the "quotative
like" (and she's like, "I don't know").
Years | WbO (just the American texts) | COCA | ||||
tokens | size | per million | tokens | size | per million | |
1990-94 | 5 | 20,883,000 | 0.24 | 128 | 103,300,000 | 1.2 |
1995-99 | 1 | 19,187,000 | 0.05 | 336 | 102,900,000 | 3.3 |
2000-04 | 173 | 123,055,000 | 1.41 | 453 | 102,600,000 | 4.4 |
As can be seen, there is a huge disparity between COCA and WordBanks Online. In terms of normalized frequencies (per million words), this informal construction is 3.1 times as common in COCA as in WbO in 2000-04, 5.0 times as common 1990-94, and 66.0 times as common 1995-99. We could repeat this with many other phenomena (and in forthcoming publications we do so). The bottom line is that COCA -- even though it has at times been (incorrectly) criticized for not having enough "informal" spoken texts, has much more of this than WordBanks Online (and the Bank of English, on which it is based). (See COCA / Help-Information / Texts / Spoken texts)
Genre balance over time
In order to use frequency statistics to look at changes over time -- as we would want to do with a monitor corpus -- each historical period needs to have the same genre composition. To take a worst-case example, suppose that a corpus had only newspapers from the 1990s and then only fiction from the 2000s. For any change that we see from the 1990s to the 2000s, we would not know if the change had actually occurred in the language as a whole, or if it is just an "artifact" of the changing genre composition from one period to the next.
What we find is that COCA is balanced across genres -- almost perfectly -- from year to year. In each and every year from 1990-2010, the corpus has been divided between spoken (20%), fiction (20%), popular magazines (20%), newspapers (20%), and academic journals (20%). Even at the level of sub-genre (e.g. Newspaper-Sports, or Academic-Medicine), the corpus composition changes very little from year to year.
In WordBanks Online however, the genre composition varies widely from one year (or set of years) to another. For example, the following figures show the percentage of fiction in the US sub-corpus in different time periods:
Time period |
Fiction |
Total |
% fiction |
1960-1979 |
1,030,000 |
1,414,000 |
72.8% |
1980-1989 |
3,087,000 |
8,792,000 |
35.1% |
1990-1994 |
6,049,000 |
20,833,000 |
29.0% |
1995-1999 |
3,100,000 |
19,187,000 |
16.2% |
2000-2004 |
18,800,000 |
123,055,000 |
15.3% |
Notice how the percentage of
fiction decreases by nearly 50% from the early 1990s to the late 1990s.
Let us briefly look at how this distorts the corpus data for these
periods.
ALL |
per |
ALL |
per |
FIC |
per |
FIC |
per |
|
mutter (all forms) |
378 |
18.1 |
269 |
14.0 |
326 |
53.9 |
159 |
51.3 |
she said |
3948 |
189.5 |
2783 |
145.0 |
3271 |
540.8 |
1793 |
578.4 |
had + VBN (e.g. had seen) |
56239 |
2699.5 |
31125 |
1622.2 |
21590 |
3569.2 |
10418 |
3360.7 |
All three of these forms (mutter, she said, and had + VBN) are characteristic of fiction. Notice that in just the US fiction part of WbO (green cells), the frequency per million words stays about the same from 1990-94 to 1995-99, as we would expect. But in the entire US part of WbO (all genres; in blue), the normalized frequency (per million words) decreases much more from 1990-94 to 1995-99. For example, had + VBN decreases by about 40%. Why is this? Well, notice that in the table above that the percentage of the US corpus in WbO that is fiction decreased by about 55% during the same period. In other words, the decrease in the corpus is probably just a function of the change in genre balance, rather than any change in "real world" language. (It would, after all, be quite strange if people really did all of the sudden say had eaten, had noticed, etc. only 50% as much in the late 1990s as the early 1990s!)
In COCA, on the other hand, the relative frequency of these three forms
in the overall corpus stays quite flat from 1990-94 until
2005-09, because the percentage of texts in the corpus that are from
fiction (20% each year) stays the same.
mutter |
1990-1994 |
1995-1999 |
2000-2004 |
2005-2009 |
|
||||
PER MIL |
14.9 |
13.4 |
14.8 |
15.9 |
SIZE (MW) |
103.3 |
102.9 |
102.6 |
93.6 |
FREQ |
1542 |
1378 |
1516 |
1484 |
she said |
1990-1994 |
1995-1999 |
2000-2004 |
2005-2009 |
|
||||
PER MIL |
197.9 |
210.7 |
190.4 |
204.5 |
SIZE (MW) |
103.3 |
102.9 |
102.6 |
93.6 |
FREQ |
20444 |
21684 |
19531 |
19130 |
had [VVN] |
1990-1994 |
1995-1999 |
2000-2004 |
2005-2009 |
|
||||
PER MIL |
1,173.1 |
1,066.2 |
1,059.0 |
1,095.4 |
SIZE (MW) |
103.3 |
102.9 |
102.6 |
93.6 |
FREQ |
121208 |
109731 |
108624 |
102491 |
Strange data from WordBanks Online
Even beyond this serious problem with genre balance, it appears
that there might be an even more fundamental problem with
WordBanks Online (which again, is quite similar to
the Bank of English for the 1990s). To see what this is, consider the following table:
WordBanks Online
|
1990-94 |
1995-99 |
2000-04 |
90-94 > |
95-99 > |
was VVN |
1550 |
1071 |
1458 |
0.69 |
1.36 |
to be |
1411 |
1289 |
1153 |
0.91 |
0.89 |
is |
6443 |
8225 |
6558 |
1.28 |
0.80 |
and |
22400 |
22517 |
18580 |
1.01 |
0.83 |
This table shows the frequency of four common words (is and and), phrases (to be), and grammatical constructions (was VVN: was seen, was considered) in WordBanks Online in three periods – 1990-94, 1995-1999, and 2000-2004. (The raw frequency data is in parentheses, while the normalized value – per million words – is in bold.) The two columns at the right (90-94 / 95-99 and 95-99 / 00-04) shows the percentage change (for the normalized figures) between 1990-94 and 1995-99 and for 1995-99 and 2000-04. For example, in WbO the frequency of the passive “decreased” 31% between 1990-94 and 1995-1999, and then “increased” 36% between 1995-99 and 2000-04.
One might wonder why the passive would increase or decrease 30-35% per cent between two adjacent periods, or why a very common word like is or and would vary by 20-30% from one period to the next. And notice that it is not just a problem with corpus sizes and bad calculations – with one word the frequency might increase dramatically between two periods in WbO, while with another word it might decrease dramatically during the same period. With frequency statistics this strange for common, predictable words, it is difficult to have confidence that WordBanks Online will provide accurate data for other words, phrase, and grammatical constructions that we might be researching.
COCA
|
1990-94 |
1995-99 |
2000-04 |
90-94 > |
95-99 > |
was VVN |
1305 |
1235 |
1234 |
0.95 |
1.00 |
to be |
1560 |
1517 |
1490 |
0.97 |
0.98 |
is |
9549 |
9414 |
9190 |
0.99 |
0.98 |
and |
26606 |
26731 |
26782 |
1.00 |
1.00 |
The table above shows the normalized frequencies (per million words) for four common words, phrases, and grammatical constructions in COCA from the early 1990s to 2004 (data is also available for 2005-2010 but we have omitted it here, to enable easier comparison with the WbO data). Notice that the frequency of these words is essentially flat over time (as we would expect it to be), and we do not have the strange anomalies that are found WordBanks Online (which again are very similar to the data from the Bank of English for the 1990s).
Summary
For those who can afford the $1,150 per year, WordBanks Online provides very good data for British and American English. Contrary to what has commonly been said about WordBanks Online (and the Bank of English, to which it is related, and which is very similar for the 1990s), it is probably not an overly-reliable "monitor corpus", because its genre balance varies so much from year to year. One can never know whether the changes that one sees are a function of the changing genre balance or whether they represent actual changes in the "real world".
The freely-available Corpus of Contemporary American English (COCA), on the other hand, was explicitly and carefully designed as a monitor corpus. This is especially apparent in the corpus design, where the corpus maintains the same genre balance from year to year. As far as we are aware, COCA is the only large corpus that is designed this way, and which can thus be used to accurately measure recent shifts in English.