The British National Corpus (BNC) and the Corpus of Contemporary American English (COCA) complement each other nicely, since they are the only large, well-balanced corpora of English that are freely-available online. Here we will briefly compare the two corpora in terms of corpus size, genre coverage, and how up-to-date they are. (Note: some of the data below is from just the part of the BNC that was released a generation ago -- in the early 1990s, and which is the version available at http://bncweb.lancs.ac.uk/). Corpus size (Note that the data from this section is from 5-6 years ago, when COCA was about 400 million words in size.) The Corpus of Contemporary American English (560+ million words) is 5-6 times as large as the British National Corpus (100 million words). As a result, it often provides data for lower-frequency constructions that are not available from the BNC. In terms of concrete examples, let us focus here on just two types of phenomena -- collocates and syntax. Collocates / semantics. The following table shows the number of different collocates that occur at least 3-5 times with the given node words. Notice that with a word like nibble, the word itself only occurs 4-5 times as often in COCA as the BNC (1194 to 244; to be expected from a corpus 5-6 times the size). But in terms of collocates, there are 14 times as many in COCA that occur 5 times or more as there are in the BNC. For low frequency words like these, there is often a real difference between a 100 million word corpus and a 560 million word corpus.
Syntax. Consider the following three examples.
In summary, while 100 million words is often adequate for studying syntax, for some very low-frequency phenomena, there is a real difference between 100 million words (BNC) and 560 million words (COCA). How up-to-date are the corpora? COCA has 20 million words in each year since the early 1990s (for a total of more than 520 million words total since the early 1990s), and the most recent texts are from December 2017. The BNC was created in the late 1980s and was released in the early 1990s, and there was an update three years ago in 2014. This has important implications in terms of how the two corpora represent contemporary English. Lexical. Perhaps the easiest comparison deals with words that have recently come into English, or which are used a lot more now than 20-25 years ago. The following lists show a few words (just a tiny sample of all such words) that are found less than half as often in the BNC than in COCA (per million words), and the words in italics are found less than 10% as often (often, there are no tokens in the BNC). Obviously, some are American words and wouldn't be in a corpus of British English. Many others, however, are words that are simply much more common in COCA, because it is more recent. Noun: website (COCA/BNC), blog (COCA/BNC), globalization/globalisation (COCA/BNC), SUV, RPG, Taliban, e-mail, anthrax, recount, adolescent, prep, tsunami, affiliation, Sunni, insurgent, insurgency, terrorism, coping, terrorist, cleric, yoga, homeland, genome, steroid, detainee, militant Adjective: same-sex (COCA/BNC), Islamist (COCA/BNC), upscale (COCA/BNC), terrorist, faith-based, web-based, nonstick, dot-com, performance-enhancing, high-stakes, 21st-century, old-school, pandemic, iconic, insurgent, online, broadband, gated, wireless, clueless Adverb: wirelessly (COCA/BNC), healthfully (COCA/BNC), multiculturally (COCA/BNC), preemptively, inferiorly, counterintuitively, online, forensically, intraoperatively, postoperatively, famously Verb: mentor (COCA/BNC), morph (COCA/BNC), download (COCA/BNC), e-mail, makeover, prep, upload, workout, freak, transition, vaccinate, encrypt, reconnect, click, host, splurge, preheat, co-write, outsource, snack, partner Although we have focused just on new "words" here, the same thing holds for other areas of language -- morphology (word formation), syntax (grammar), and semantics (word meaning, such as green = "environmentally friendly"), or discourse analysis (what we are saying about immigrants, or women, or the environment). Any changes that have occurred since the early 1990s may not show up in BNC, but should be modeled quite nicely with COCA. Genre balance The BNC is 10% spoken / 90% written, while in COCA the corpus is nearly evenly divided (20% in each genre) between spoken, fiction, popular magazines, newspaper, and academic.
The BNC has a much wider range of spoken sub-genres, while COCA is
composed of unscripted conversation on TV and radio shows
(See notes on the naturalness of these conversations:
COCA /
Help-Information / Texts / Spoken).
Both corpora are very well balanced in terms of sub-genres for the
written genres (e.g. Newspaper-Sports, or Academic-Medicine). In
addition, because there is a diachronic aspect to COCA
(coverage over time), in COCA the distribution of 20% in each of the
five genres stays constant from year to year. Summary COCA and the BNC complement each other nicely, and they are are only large, well-balanced corpora of English that are publicly-available. The BNC has better coverage of informal, everyday conversation, while COCA is much larger and more recent, which has important implications for the quantity and quality of the data overall. Unless one is inherently interested in only British or American English, there is really no reason to not take advantage of both corpora. This is especially true when -- as with the interface at www.english-corpora.org -- both corpora can be used side-by-side, with the same interface. For most types of studies, academic publications and presentations that rely on just the BNC for data from Modern English will look increasingly outdated and insular as time goes on. |