Corpus of Historical American English

In March 2009 we received a large grant from the National Endowment for the Humanities to create a 300 million word corpus of historical American English (early 1800s - present time). The corpus will be balanced in each decade (and therefore overall, as well) between fiction, popular magazines, newspapers, and academic. The 300 million word COHA (1800s-2000s) will nicely complement the 400+ million word Corpus of Contemporary American English (COCA; 1990s-2000s). Most importantly, it will allow researchers to examine a wide range of changes in American English with much more accuracy and detail than with any other available corpus. An online beta version of the corpus should be available in Summer 2010, and the final version will be available in December 2010.


The major sources for each genre are as follows:

Fiction Project Gutenberg (1810-1930), Making of America (1810-1900), scanned books (1930-1990), COCA (1990-2010)
Magazine Making of America (1810-1900), scanned and PDF (1900-1990), COCA (1990-2010)
- In each decade, the magazines are balanced across at least ten magazines (with equivalent sub-genres for the 1900s)
Newspaper PDF > TXT of at least five newspapers (1850-1980), COCA etc (1990-2010)
Non-fiction Project Gutenberg (1810-1900), www.archive.org (1810-1900), scanned books (1900-1990), COCA (1990-2010)
- In each decade, the non-fiction is balanced across the Library of Congress classification system


The following is an overview of the composition of the corpus, and shows the number of words in each genre in each decade.
 
DECADE Fiction Magazine Newspaper Non-fiction TOTAL
1810 2,600,000 100,000   1,500,000 4,200,000
1820 4,000,000 1,600,000   2,400,000 8,000,000
1830 6,200,000 3,100,000   3,600,000 12,900,000
1840 6,200,000 3,200,000   3,500,000 12,900,000
1850 6,500,000 4,000,000 400,000 3,100,000 14,000,000
1860 6,500,000 4,000,000 500,000 3,100,000 14,100,000
1870 6,600,000 4,000,000 600,000 3,100,000 14,300,000
1880 6,600,000 4,000,000 700,000 3,100,000 14,400,000
1890 6,600,000 4,100,000 800,000 3,100,000 14,600,000
1900 6,600,000 4,200,000 1,000,000 3,100,000 14,900,000
1910 6,600,000 4,200,000 1,800,000 3,100,000 15,700,000
1920 6,600,000 4,200,000 3,800,000 3,100,000 17,700,000
1930 6,600,000 4,200,000 3,800,000 3,100,000 17,700,000
1940 6,600,000 4,300,000 3,800,000 3,100,000 17,800,000
1950 6,700,000 4,300,000 3,800,000 3,000,000 17,800,000
1960 6,700,000 4,300,000 3,800,000 3,000,000 17,800,000
1970 6,700,000 4,300,000 3,800,000 3,000,000 17,800,000
1980 6,700,000 4,300,000 3,800,000 3,000,000 17,800,000
1990 6,700,000 4,300,000 3,800,000 3,000,000 17,800,000
2000 6,700,000 4,300,000 3,800,000 3,000,000 17,800,000
TOTAL 125,000,000 75,000,000 40,000,000 60,000,000 300,000,000