The British National Corpus (BNC) and the Corpus of Contemporary American English (COCA) complement each other nicely, since they are the only large, well-balanced corpora of English that are freely-available online. Here we will briefly compare the two corpora in terms of corpus size, genre coverage, and how up-to-date they are.

Corpus size

The Corpus of Contemporary American English (520+ million words) is more than five times as large as the British National Corpus (100 million words). As a result, it often provides data for lower-frequency constructions that are not available from the BNC. In terms of concrete examples, let us focus here on just two types of phenomena -- collocates and syntax.

Collocates / semantics. The following table shows the number of different collocates that occur at least 3-5 times with the given node words. Notice that with a word like nibble, the word itself only occurs 4-5 times as often in COCA as the BNC (1194 to 244; to be expected from a corpus four times the size). But in terms of collocates, there are 14 times as many in COCA that occur 5 times or more as there are in the BNC. For low frequency words like these, there is often a real difference between a 100 million word corpus and a 520 million word corpus.

Word (PoS)


PoS / span

COCA (click to see)

BNC (click to see)

click (noun)

3145 445

2L / 0R

loud, audible, double, sharp

double, sharp, loud

nibble (verb)

1194 244

0L / 3R

edges, grass, ear, lip

ear, bait

serenely (adv)

308 83

4L / 4R

smile, float, gaze, glide

said, smiled

crumbled (adj)

446 27

0L / 3R

cheese, bacon, bread, cornbread


Syntax. Consider the following three examples.

  • [like] for [p*] to [v*] (Id really like for you to stay)
    There are 5 tokens in the BNC, but 330 tokens in COCA. With the BNC there aren't enough examples to see if this is a feature of informal or formal English, but the data from COCA show that it is clearly a feature of spoken English. The data also shows that it is increasing slowly over time, when compared as a ratio to the construction [ like -- him to V ].

  • Is it excel in V-ing, or excel at V-ing ? (she excels in/at playing the piano)
    Granted, this is a very narrow issue, but it is precisely the thing that translators and non-native speakers are interested in. With the BNC there are 5 tokens with at and 6 with in -- probably not enough to say which is more common. In COCA, however, there are 122 with at and 42 with in. This is enough to begin to see which genres prefer one or the other, as well as which subordinate clause verbs occur with each. Such granularity is not possible with the BNC.

  • [have] been being [vvn] (she had been being watched)
    There are 2 tokens in the BNC (1 spoken, 1 fiction), and this is not enough data to see any possible genre variation. In COCA, on the other hand, there are 13 tokens (10 spoken, 2 fiction, 1 news). This is enough to show that this is a feature of spoken English, and the data also shows that it is increasing since 1990. (By the way, most native speakers of both dialects will cringe at sentences like this, but they are in the corpora.)

In summary, while 100 million words is often adequate for studying syntax, for some very low-frequency phenomena, there is a real difference between 100 million words (BNC) and 450 million words (COCA).

How up-to-date are the corpora?

COCA has 20 million words in each year since the early 1990s (for a total of more than 520 million words total since the early 1990s), and the most recent texts are from late 2015. The most recent texts in the BNC, on the other hand, are from the early 1990s -- more than twenty years (nearly a generation) ago. This has important implications in terms of how the two corpora represent contemporary English.

Lexical. Perhaps the easiest comparison deals with words that have recently come into English, or which are used a lot more now than 15-20 years ago. The following lists show a few words (just a tiny sample of all such words) that are found less than half as often in the BNC than in COCA (per million words), and the words in italics are found less than 10% as often (often, there are no tokens in the BNC). Obviously, some are American words and wouldn't be in a corpus of British English. Many others, however, are words that are simply much more common in COCA, because it alone contains texts from the last 20 years.

Noun: website (COCA/BNC), blog (COCA/BNC), globalization/globalisation (COCA/BNC), SUV, RPG, Taliban, e-mail, anthrax, recount, adolescent, prep, tsunami, affiliation, Sunni, insurgent, insurgency, terrorism, coping, terrorist, cleric, yoga, homeland, genome, steroid, detainee, militant

Adjective: same-sex (COCA/BNC), Islamist (COCA/BNC), upscale (COCA/BNC), terrorist, faith-based, web-based, nonstick, dot-com, performance-enhancing, high-stakes, 21st-century, old-school, pandemic, iconic, insurgent, online, broadband, gated, wireless, clueless

Adverb: wirelessly (COCA/BNC), healthfully (COCA/BNC), multiculturally (COCA/BNC), preemptively, inferiorly, counterintuitively, online, forensically, intraoperatively, postoperatively, famously

Verb: mentor (COCA/BNC), morph (COCA/BNC), download (COCA/BNC), e-mail, makeover, prep, upload, workout, freak, transition, vaccinate, encrypt, reconnect, click, host, splurge, preheat, co-write, outsource, snack, partner

Although we have focused just on new "words" here, the same thing holds for other areas of language -- morphology (word formation), syntax (grammar), and semantics (word meaning, such as green = "environmentally friendly"), or discourse analysis (what we are saying about immigrants, or women, or the environment). Any changes that have occurred since the early 1990s will not show up in BNC, but should be modeled quite nicely with COCA.

Genre balance

The BNC is 10% spoken / 90% written, while in COCA the corpus is nearly evenly divided (20% in each genre) between spoken, fiction, popular magazines, newspaper, and academic.

GENRE COCA (millions of words) BNC (millions of words)
Spoken 95 10
Fiction 90 17
Popular magazines 95 16
Newspaper 92 11
Academic 91 16
Other   30

The BNC has a much wider range of spoken sub-genres, while COCA is composed of unscripted conversation on TV and radio shows. Both corpora are very well balanced in terms of sub-genres for the written genres (e.g. Newspaper-Sports, or Academic-Medicine). In addition, because there is a diachronic aspect to COCA (coverage over time), in COCA the distribution of 20% in each of the five genres stays constant from year to year.


COCA and the BNC complement each other nicely, and they are are only large, well-balanced corpora of English that are publicly-available. The BNC has better coverage of informal, everyday conversation, while COCA is much larger and more recent, which has important implications for the quantity and quality of the data overall.

Unless one is inherently interested in only British or American English, there is really no reason to not take advantage of both corpora. This is especially true when -- as with the interface at -- both corpora can be used side-by-side, with the same interface. For most types of studies, academic publications and presentations that rely on just the BNC for data from Modern English will look increasingly outdated and insular as time goes on.