The Corpus of Contemporary American English (COCA)
and
the American National Corpus (ANC)


There are significant differences between the Corpus of Contemporary American English (COCA) and the American National Corpus (ANC), as is summarized in the following table. This leads to important differences in the quantity and quality of the data from the two corpora, as is noted below.

 

Corpus of Contemporary American English

American National Corpus 2

Size

410+ million words 1

22 million words

Dates

1990 - 2010

1990 - ?? 3

Date distribution

20 million words each year

0.5-3 million

Updated

Yes, 1-2 times/year

No (??) 4

Availability / price

Free access (but only via web interface)

Free (via Open ANC), or DVD ($75, from the LDC). Full text access.

     

Spoken

85 million words (4m each year, 1990-2010)
Transcripts of unscripted conversation on 150+ different TV and radio programs (ABC, CBS, NBC, Fox, PBS, NPR, etc)

4 million words
Call-Home, Charlotte, MICASE, Switchboard

Fiction

81 million words (4m each year, 1990-2010)
Short stories and plays from literary magazines, children’s magazines, popular magazines, first chapters of first edition books 1990-present, movie and TV scripts

0.5 million words
From the publishers Hargraves and Eggan

Magazines
(popular)

86 million words (4m each year, 1990-2010)
100 magazines; balanced between news, health, home and gardening, women, financial, religion, sports, etc

5 million words
2 magazines: Slate (politics) and Verbatim (linguistics)

Newspapers

81 million words (4m each year, 1990-2010)
10 newspapers, including USA Today, New York Times, Atlanta Journal Constitution, San Francisco Chronicle, etc.

4 million words
1 newspaper: New York Times

Academic
(journals)

81 million words (4m each year, 1990-2010)
100 journals. Balanced coverage of the entire range of the Library of Congress classification system (K = education, T = technology, etc.),

4 million words
2 journals: BioMed and PLOS (Public Library of Science)

     

Other text types

 

3 million words: Blog (Buffy the Vampire Slayer)
1m: Travel guide (Berlitz)
1m: Government ("web data" [??] )
<1m: Miscellaneous (911 report, letters, other non-fiction)
 

Notes

1 The Corpus of Contemporary American English contained about 365 million words in size when it was released in early 2008 (20 million words each year, 1990-2007). As of mid-2010, it has more than 410 million words. It will continue to grow by 20 million words each year.
2 Refers to the Second Release (2005) of the American National Corpus. There has not been a Third Release since that time.
3 This is probably a function of whether/when the ANC is completed
4 The ANC was projected to have 100 million words upon completion in c2005. No plans have been announced to expand the corpus beyond that size, if/when the corpus is completed.


Comparison of data

With such a difference in the composition of the two corpora, it is not surprising that the two corpora yield very different data. In this section, we compare the 5,000-10,0000 most frequent words in the two corpora, but similar comparisons could be made for syntax, semantics, etc.
 

+COCA / -ANC

About 20-25% of the words in the top 5,000 COCA wordlist are not in the ANC list. In other words, of the top 5000 lemmas in COCA, the word is at least twice as infrequent in the ANC list (e.g. COCA #4000, ANC #8000 or lower). Things get much, much messier at lower levels, where the ANC lists will be missing 50-60% of the words in the COCA lists.

The following words are examples. These words are in the top 3000-4000 words in COCA, but (in this case) they are at least four times farther down the list (for example, #2000 in COCA, #9000 in the ANC). As one can see, these are full of "everyday" words:

Adjectives: left, far, concerned, involved, supposed, Christian, growing, clean, alone, married, Catholic, English, used, surprised, spiritual, existing, living, fun, remaining, leading

Nouns: university, back, data, American, Republican, congress, south, east, Democrat, troop, institute, Christmas, learning, sir, fat, Jew, e-mail, academy, Indian, navy, teen, pine, Muslim, Olympics, handle

Verbs: need, stand, thank, lay, laugh, shake, smile, stare, drink, lift, grab, lean, nod, stir, dance, bend, slide, kiss, whisper, glance, pray, wave, bake, pause, shrug, cope, brush, sigh, excuse, hurry, burst, spill, hug, blend

(Note that many of these words come from fiction and from "popular magazines". They occur very infrequently in the ANC, since the ANC has essentially no texts from fiction or popular magazines. COCA, on the other hand, has 150+ million words from these genres).

+ANC / -COCA

On the other hand, there are about 20-25% of the words in the ANC top 5000 list that are not in the COCA list, and things are much messier for lower frequency words. The following are words in the top 5000 words in the ANC list, which are at least four times less common in COCA (e.g. ANC #2000, COCA #9000). As one can see, they are either errors (bad part of speech or lemma) in the ANC, or are a function of the skewed text composition of the ANC (apparently, lots of academic journal articles on DNA sequencing):

Adjective: uh-huh, um-hum, binding, e-mail, amino, conserved, mutant, genomic, molecular, incubated, viral, wild-type, purified, bye-bye, cultured, locus, correlated, putative, phylogenetic, endogenous, cytoplasmic, downstream, mammalian, catalytic, sequenced, transfected, recombinant, transgenic, terminus, gene-expression, eukaryotic

Noun: yeah, um, cell, gene, datum, protein, sequence, gonna, tissue, acid, receptor, genome, mutation, tumor, huh, www, probe, cdna, mhm, mrna, clone, assay, membrane, activation, transcription, chromosome

Verb: accord, detect, induce, calculate, isolate, label, activate, usee, controll, bind, stain, clone, cluster, inhibit, code, underlie, rang, amplify, overlap, school, sequence, encode, splice