There are significant differences between the Corpus of Contemporary American English (COCA) and the American National Corpus (ANC), as is summarized in the following table. This leads to important differences in the quantity and quality of the data from the two corpora, as is noted below.
Notes 1 The Corpus of Contemporary American English
contained about 365 million words
in size when it was released in early 2008 (20 million words each year,
1990-2007). As of Dec 2019, it has more than one billion words. It will continue
to grow by 20 million words each year. With such a difference in the composition of the two corpora, it is not surprising that the two corpora yield very different data. In this section, we compare the 5,000-10,0000 most frequent words in the two corpora, but similar comparisons could be made for syntax, semantics, etc. +COCA / -ANC About 20-25% of the words in the top 5,000 COCA wordlist are not in the ANC list. In other words, of the top 5000 lemmas in COCA, the word is at least twice as infrequent in the ANC list (e.g. COCA #4000, ANC #8000 or lower). Things get much, much messier at lower levels, where the ANC lists will be missing 50-60% of the words in the COCA lists. The following words are examples. These words are in the top 3000-4000 words in COCA, but (in this case) they are at least four times farther down the list (for example, #2000 in COCA, #9000 in the ANC). As one can see, these are full of "everyday" words: Adjectives: left, far, concerned, involved, supposed, Christian, growing, clean, alone, married, Catholic, English, used, surprised, spiritual, existing, living, fun, remaining, leading Nouns: university, back, data, American, Republican, congress, south, east, Democrat, troop, institute, Christmas, learning, sir, fat, Jew, e-mail, academy, Indian, navy, teen, pine, Muslim, Olympics, handle
Verbs: need, stand, thank, lay, laugh, shake, smile, stare, drink,
lift, grab, lean, nod, stir, dance, bend, slide, kiss, whisper, glance,
pray, wave, bake, pause, shrug, cope, brush, sigh, excuse, hurry, burst,
spill, hug, blend +ANC / -COCA On the other hand, there are about 20-25% of the words in the ANC top 5000 list that are not in the COCA list, and things are much messier for lower frequency words. The following are words in the top 5000 words in the ANC list, which are at least four times less common in COCA (e.g. ANC #2000, COCA #9000). As one can see, they are either errors (bad part of speech or lemma) in the ANC, or are a function of the skewed text composition of the ANC (apparently, lots of academic journal articles on DNA sequencing): Adjective: uh-huh, um-hum, binding, e-mail, amino, conserved, mutant, genomic, molecular, incubated, viral, wild-type, purified, bye-bye, cultured, locus, correlated, putative, phylogenetic, endogenous, cytoplasmic, downstream, mammalian, catalytic, sequenced, transfected, recombinant, transgenic, terminus, gene-expression, eukaryotic Noun: yeah, um, cell, gene, datum, protein, sequence, gonna, tissue, acid, receptor, genome, mutation, tumor, huh, www, probe, cdna, mhm, mrna, clone, assay, membrane, activation, transcription, chromosome Verb: accord, detect, induce, calculate, isolate, label, activate, usee, controll, bind, stain, clone, cluster, inhibit, code, underlie, rang, amplify, overlap, school, sequence, encode, splice
|