Download detailed list of all 107,000 texts
(Excel format, 17 MB; with tables and charts by genre, sub-genre, and sources)

The corpus is composed of more than 400 million words of text in more than 100,000 individual texts. The major sources for each genre are as follows:

Fiction Project Gutenberg (1810-1930), Making of America (1810-1900), scanned books (1930-1990), movie and play scripts, COCA (1990-2010)
Magazine Making of America (1810-1900), scanned and PDF (1900-1990), COCA (1990-2010)
- In each decade, the magazines are balanced across at least ten magazines (with equivalent sub-genres for the 1900s)
Newspaper PDF > TXT of at least five newspapers (1850-1980), COCA etc (1990-2010)
Non-fiction Project Gutenberg (1810-1900), www.archive.org (1810-1900), scanned books (1900-1990), COCA (1990-2010)
- In each decade, the non-fiction is balanced across the Library of Congress classification system

The corpus is balanced by genre across the decades. For example, fiction accounts for 48-55% of the total in each decade (1810s-2000s), and the corpus is balanced across decades for sub-genres and domains as well (e.g. by Library of Congress classification for non-fiction; and by sub-genre for fiction -- prose, poetry, drama, etc). This balance across genres and sub-genres allows researchers to examine changes and be reasonably certain that the data reflects actual changes in the "real world", rather than just being artifacts of a changing genre balance.

Download all 115,000 texts, for use on your own computer.

DECADE FICTION POPULAR
MAGAZINES
NEWSPAPERS NON-FICTION
BOOKS
TOTAL % FICTION
1810s 641,164 88,316 0 451,542 1,181,022 0.54
1820s 3,751,204 1,714,789 0 1,461,012 6,927,005 0.54
1830s 7,590,350 3,145,575 0 3,038,062 13,773,987 0.55
1840s 8,850,886 3,554,534 0 3,641,434 16,046,854 0.55
1850s 9,094,346 4,220,558 0 3,178,922 16,493,826 0.55
1860s 9,450,562 4,437,941 262,198 2,974,401 17,125,102 0.55
1870s 10,291,968 4,452,192 1,030,560 2,835,440 18,610,160 0.55
1880s 11,215,065 4,481,568 1,355,456 3,820,766 20,872,855 0.54
1890s 11,212,219 4,679,486 1,383,948 3,907,730 21,183,383 0.53
1900s 12,029,439 5,062,650 1,433,576 4,015,567 22,541,232 0.53
1910s 11,935,701 5,694,710 1,489,942 3,534,899 22,655,252 0.53
1920s 12,539,681 5,841,678 3,552,699 3,698,353 25,632,411 0.49
1930s 11,876,996 5,910,095 3,545,527 3,080,629 24,413,247 0.49
1940s 11,946,743 5,644,216 3,497,509 3,056,010 24,144,478 0.49
1950s 11,986,437 5,796,823 3,522,545 3,092,375 24,398,180 0.49
1960s 11,578,880 5,803,276 3,404,244 3,141,582 23,927,982 0.48
1970s 11,626,911 5,755,537 3,383,924 3,002,933 23,769,305 0.49
1980s 12,152,603 5,804,320 4,113,254 3,108,775 25,178,952 0.48
1990s 13,272,162 7,440,305 4,060,570 3,104,303 27,877,340 0.48
2000s 14,590,078 7,678,830 4,088,704 3,121,839 29,479,451 0.49
TOTAL 207,633,395 97,207,399 40,124,656 61,266,574 406,232,024 0.51