|
Corpus of Historical American English
In March 2009 we received a large grant from the
National Endowment for
the Humanities to create a 300 million word corpus of historical
American English (early 1800s - present time). The corpus will be
balanced in each decade (and therefore overall, as well) between
fiction, popular magazines, newspapers, and academic. The 300
million word COHA (1800s-2000s) will nicely complement the
400+ million word Corpus of
Contemporary American English (COCA; 1990s-2000s). Most
importantly, it will allow researchers to examine a wide range of
changes in American English with much more accuracy and detail than
with any other available corpus. An online beta version of the
corpus should be available in Summer 2010, and the final version
will be available in December 2010.
The major sources for each genre are as follows:
| Fiction |
Project Gutenberg (1810-1930),
Making of America
(1810-1900), scanned books (1930-1990),
COCA (1990-2010) |
| Magazine |
Making of America (1810-1900), scanned and PDF
(1900-1990), COCA (1990-2010)
- In each decade, the magazines are balanced across at least ten
magazines (with equivalent sub-genres for the 1900s) |
| Newspaper |
PDF > TXT of at least five newspapers (1850-1980),
COCA etc (1990-2010) |
| Non-fiction |
Project Gutenberg (1810-1900),
www.archive.org
(1810-1900), scanned books (1900-1990),
COCA (1990-2010)
- In each decade, the non-fiction is balanced across the
Library of
Congress classification system |
The following is an overview of the
composition of the corpus, and shows the number of words in each genre
in each decade.
| DECADE |
Fiction |
Magazine |
Newspaper |
Non-fiction |
TOTAL |
| 1810 |
2,600,000 |
100,000 |
|
1,500,000 |
4,200,000 |
| 1820 |
4,000,000 |
1,600,000 |
|
2,400,000 |
8,000,000 |
| 1830 |
6,200,000 |
3,100,000 |
|
3,600,000 |
12,900,000 |
| 1840 |
6,200,000 |
3,200,000 |
|
3,500,000 |
12,900,000 |
| 1850 |
6,500,000 |
4,000,000 |
400,000 |
3,100,000 |
14,000,000 |
| 1860 |
6,500,000 |
4,000,000 |
500,000 |
3,100,000 |
14,100,000 |
| 1870 |
6,600,000 |
4,000,000 |
600,000 |
3,100,000 |
14,300,000 |
| 1880 |
6,600,000 |
4,000,000 |
700,000 |
3,100,000 |
14,400,000 |
| 1890 |
6,600,000 |
4,100,000 |
800,000 |
3,100,000 |
14,600,000 |
| 1900 |
6,600,000 |
4,200,000 |
1,000,000 |
3,100,000 |
14,900,000 |
| 1910 |
6,600,000 |
4,200,000 |
1,800,000 |
3,100,000 |
15,700,000 |
| 1920 |
6,600,000 |
4,200,000 |
3,800,000 |
3,100,000 |
17,700,000 |
| 1930 |
6,600,000 |
4,200,000 |
3,800,000 |
3,100,000 |
17,700,000 |
| 1940 |
6,600,000 |
4,300,000 |
3,800,000 |
3,100,000 |
17,800,000 |
| 1950 |
6,700,000 |
4,300,000 |
3,800,000 |
3,000,000 |
17,800,000 |
| 1960 |
6,700,000 |
4,300,000 |
3,800,000 |
3,000,000 |
17,800,000 |
| 1970 |
6,700,000 |
4,300,000 |
3,800,000 |
3,000,000 |
17,800,000 |
| 1980 |
6,700,000 |
4,300,000 |
3,800,000 |
3,000,000 |
17,800,000 |
| 1990 |
6,700,000 |
4,300,000 |
3,800,000 |
3,000,000 |
17,800,000 |
| 2000 |
6,700,000 |
4,300,000 |
3,800,000 |
3,000,000 |
17,800,000 |
| TOTAL |
125,000,000 |
75,000,000 |
40,000,000 |
60,000,000 |
300,000,000 |
|