|
Corpus size The Corpus of Contemporary American English (450+ million words) is 4-5 times as large as the British National Corpus (100 million words). As a result, it often provides data for lower-frequency constructions that are not available from the BNC. In terms of concrete examples, let us focus here on just two types of phenomena -- collocates and syntax. Collocates / semantics. The following table shows the number of different collocates that occur at least 3-5 times with the given node words. Notice that with a word like nibble, the word itself only occurs 4-5 times as often in COCA as the BNC (1194 to 244; to be expected from a corpus four times the size). But in terms of collocates, there are 14 times as many in COCA that occur 5 times or more as there are in the BNC. For low frequency words like these, there is often a real difference between a 100 million word corpus and a 450 million word corpus.
Syntax. Consider the following three examples.
In summary, while 100 million words is often adequate for studying syntax, for some very low-frequency phenomena, there is a real difference between 100 million words (BNC) and 450 million words (COCA). How up-to-date are the corpora? COCA has 20 million words in each year since the early 1990s (for a total of more than 450 million words total since the early 1990s), and the most recent texts are from Summer 2012. The most recent texts in the BNC, on the other hand, are from the early 1990s -- more than twenty years (nearly a generation) ago. This has important implications in terms of how the two corpora represent contemporary English. Lexical. Perhaps the easiest comparison deals with words that have recently come into English, or which are used a lot more now than 15-20 years ago. The following lists show a few words (just a tiny sample of all such words) that are found less than half as often in the BNC than in COCA (per million words), and the words in italics are found less than 10% as often (often, there are no tokens in the BNC). Obviously, some are American words and wouldn't be in a corpus of British English. Many others, however, are words that are simply much more common in COCA, because it alone contains texts from the last 20 years. Noun: website (COCA/BNC), blog (COCA/BNC), globalization/globalisation (COCA/BNC), SUV, RPG, Taliban, e-mail, anthrax, recount, adolescent, prep, tsunami, affiliation, Sunni, insurgent, insurgency, terrorism, coping, terrorist, cleric, yoga, homeland, genome, steroid, detainee, militant Adjective: same-sex (COCA/BNC), Islamist (COCA/BNC), upscale (COCA/BNC), terrorist, faith-based, web-based, nonstick, dot-com, performance-enhancing, high-stakes, 21st-century, old-school, pandemic, iconic, insurgent, online, broadband, gated, wireless, clueless Adverb: wirelessly (COCA/BNC), healthfully (COCA/BNC), multiculturally (COCA/BNC), preemptively, inferiorly, counterintuitively, online, forensically, intraoperatively, postoperatively, famously Verb: mentor (COCA/BNC), morph (COCA/BNC), download (COCA/BNC), e-mail, makeover, prep, upload, workout, freak, transition, vaccinate, encrypt, reconnect, click, host, splurge, preheat, co-write, outsource, snack, partner Although we have focused just on new "words" here, the same thing holds for other areas of language -- morphology (word formation), syntax (grammar), and semantics (word meaning, such as green = "environmentally friendly"), or discourse analysis (what we are saying about immigrants, or women, or the environment). Any changes that have occurred since the early 1990s will not show up in BNC, but should be modeled quite nicely with COCA. Genre balance The BNC is 10% spoken / 90% written, while in COCA the corpus is nearly evenly divided (20% in each genre) between spoken, fiction, popular magazines, newspaper, and academic.
The BNC has a much wider range of spoken sub-genres, while COCA is composed of unscripted conversation on TV and radio shows. Both corpora are very well balanced in terms of sub-genres for the written genres (e.g. Newspaper-Sports, or Academic-Medicine). In addition, because there is a diachronic aspect to COCA (coverage over time), in COCA the distribution of 20% in each of the five genres stays constant from year to year. Summary COCA and the BNC complement each other nicely, and they are are only large, well-balanced corpora of English that are publicly-available. The BNC has better coverage of informal, everyday conversation, while COCA is much larger and more recent, which has important implications for the quantity and quality of the data overall. Unless one is inherently interested in only British or American English, there is really no reason to not take advantage of both corpora. This is especially true when -- as with the interface at corpus.byu.edu -- both corpora can be used side-by-side, with the same interface. For most types of studies, academic publications and presentations that rely on just the BNC for data from Modern English will look increasingly outdated and insular as time goes on.
|