The Corpus of Contemporary American English (COCA)
and
the British National Corpus (BNC)

The British National Corpus (BNC) and the Corpus of Contemporary American English (COCA) complement each other nicely, since they are the only large, well-balanced corpora of English that are freely-available online. Here we will briefly compare the two corpora in terms of corpus size, genre coverage, and how up-to-date they are.


Corpus size

The Corpus of Contemporary American English (450+ million words) is more than four times as large as the British National Corpus (100 million words). As a result, it often provides data for lower-frequency constructions that are not available from the BNC. In terms of concrete examples, let us focus here on just two types of phenomena -- collocates and syntax.

Collocates / semantics. The following table shows the number of different collocates that occur at least 3-5 times with the given node words. Notice that with a word like nibble, the word itself only occurs 4-5 times as often in COCA as the BNC (1194 to 244; to be expected from a corpus four times the size). But in terms of collocates, there are 14 times as many in COCA that occur 5 times or more as there are in the BNC. For low frequency words like these, there is often a real difference between a 100 million word corpus and a 450 million word corpus.

(Note: the COCA numbers below are based on a smaller 410 million version of the corpus (1990-2010). The newest 2012 update (450 million words) will have more tokens for each construction.)

Word (PoS)

COCA
freq
BNC
freq

collocate
PoS / span

COCA (click to see)

BNC (click to see)

click (noun)

3145 445

adj
2L / 0R

27
loud, audible, double, sharp

5
double, sharp, loud

nibble (verb)

1194 244

noun
0L / 3R

29
edges, grass, ear, lip

2
ear, bait

serenely (adv)

308 83

verb
4L / 4R

23
smile, float, gaze, glide

2
said, smiled

crumbled (adj)

446 27

noun
0L / 3R

32
cheese, bacon, bread, cornbread

0
---

Syntax. Consider the following three examples.

  • [like] for [p*] to [v*] (Id really like for you to stay)
    There are 5 tokens in the BNC, but 352 tokens in COCA. With the BNC there aren't enough examples to see if this is a feature of informal or formal English, but the data from COCA show that it is clearly a feature of spoken English. The data also shows that it is increasing slowly over time, when compared as a ratio to the construction [ like -- him to V ].
     

  • Is it excel in V-ing, or excel at V-ing ? (she excels in/at playing the piano)
    Granted, this is a very narrow issue, but it is precisely the thing that translators and non-native speakers are interested in. With the BNC there are 5 tokens with at and 6 with in -- probably not enough to say which is more common. In COCA, however, there are 136 with at and 47 with in. This is enough to begin to see which genres prefer one or the other, as well as which subordinate clause verbs occur with each. Such granularity is not possible with the BNC.

  • [have] been being [vvn] (she had been being watched)
    There are 2 tokens in the BNC (1 spoken, 1 fiction), and this is not enough data to see any possible genre variation. In COCA, on the other hand, there are 14 tokens (10 spoken, 2 fiction, 1 news, 1 academic). This is enough to show that this is a feature of spoken English, and the data also shows that it is increasing since 1990. (By the way, most native speakers of both dialects will cringe at sentences like this, but they are in the corpora.)

In summary, while 100 million words is often adequate for studying syntax, for some very low-frequency phenomena, there is a real difference between 100 million words (BNC) and 450 million words (COCA).


How up-to-date are the corpora?

COCA has 20 million words in each year since the early 1990s (for a total of more than 450 million words total since the early 1990s), and the most recent texts are from Summer 2012. The most recent texts in the BNC, on the other hand, are from the early 1990s -- 20 years ago. This has important implications in terms of how the two corpora represent contemporary English.

Lexical. Perhaps the easiest comparison deals with words that have recently come into English, or which are used a lot more now than 20 years ago. The following lists show a few words (just a tiny sample of all such words) that are found less than half as often in the BNC than in COCA (per million words), and the words in italics are found less than 10% as often (often, there are no tokens in the BNC). Obviously, some are American words and wouldn't be in a corpus of British English. Many others, however, are words that are simply much more common in COCA, because it alone contains texts from the last 20 years.

Noun: website (COCA/BNC), blog (COCA/BNC), globalization/globalisation (COCA/BNC), SUV, RPG, Taliban, e-mail, anthrax, recount, adolescent, prep, tsunami, affiliation, Sunni, insurgent, insurgency, terrorism, coping, terrorist, cleric, yoga, homeland, genome, steroid, detainee, militant

Adjective: same-sex (COCA/BNC), Islamist (COCA/BNC), upscale (COCA/BNC), terrorist, faith-based, web-based, nonstick, dot-com, performance-enhancing, high-stakes, 21st-century, old-school, pandemic, iconic, insurgent, online, broadband, gated, wireless, clueless

Adverb: wirelessly (COCA/BNC), healthfully (COCA/BNC), multiculturally (COCA/BNC), preemptively, inferiorly, counterintuitively, online, forensically, intraoperatively, postoperatively, famously

Verb: mentor (COCA/BNC), morph (COCA/BNC), download (COCA/BNC), e-mail, makeover, prep, upload, workout, freak, transition, vaccinate, encrypt, reconnect, click, host, splurge, preheat, co-write, outsource, snack, partner

Although we have focused just on new "words" here, the same thing holds for other areas of language -- morphology (word formation), syntax (grammar), and semantics (word meaning, such as green = "environmentally friendly"), or discourse analysis (what we are saying about immigrants, or women, or the environment). Any changes that have occurred since the early 1990s will not show up in BNC, but should be modeled quite nicely with COCA.


Genre balance

The BNC is 10% spoken / 90% written, while in COCA the corpus is nearly evenly divided (20% in each genre) between spoken, fiction, popular magazines, newspaper, and academic.

GENRE COCA (millions of words) BNC (millions of words)
Spoken 95 10
Fiction 91 17
Popular magazines 95 16
Newspaper 92 11
Academic 91 16
Other   30

The BNC has a much wider range of spoken sub-genres, while COCA is composed of unscripted conversation on TV and radio shows (See notes on the naturalness of these conversations: COCA / Help-Information / Texts / Spoken). Both corpora are very well balanced in terms of sub-genres for the written genres (e.g. Newspaper-Sports, or Academic-Medicine). In addition, because there is a diachronic aspect to COCA (coverage over time), in COCA the distribution of 20% in each of the five genres stays constant from year to year.
 


Summary

COCA and the BNC complement each other nicely, and they are are only large, well-balanced corpora of English that are publicly-available. The BNC has better coverage of informal, everyday conversation, while COCA is much larger and more recent, which has important implications for the quantity and quality of the data overall.

Unless one is inherently interested in only British or American English, there is really no reason to not take advantage of both corpora. This is especially true when -- as with the interface at corpus.byu.edu -- both corpora can be used side-by-side, with the same interface. For most types of studies, academic publications and presentations that rely on just the BNC for data from Modern English will look increasingly outdated and insular as time goes on.