DOWNLOAD LIST OF ALL 220,225 TEXTS AND SUMMARY BY YEAR, GENRE, AND SUB-GENRE


The corpus is composed of more than 520 million words in 220,225 texts, including 20 million words each year from 1990-2015. The most recent addition of texts (July 2012 - December 2015) was completed in December 2015.

For each year (and therefore overall, as well), the corpus is evenly divided between the five genres of spoken, fiction, popular magazines, newspapers, and academic journals. The texts come from a variety of sources:

  • Spoken: (109 million words [109,391,643]) Transcripts of unscripted conversation from more than 150 different TV and radio programs (examples: All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes (CBS), Hannity and Colmes (Fox), Jerry Springer, etc). [See notes on the naturalness and authenticity of the language from these transcripts).

  • Fiction: (105 million words [104,900,827]) Short stories and plays from literary magazines, childrenís magazines, popular magazines, first chapters of first edition books 1990-present, and movie scripts.

  • Popular Magazines: (110 million words [110,110,637]) Nearly 100 different magazines, with a good mix (overall, and by year) between specific domains (news, health, home and gardening, women, financial, religion, sports, etc). A few examples are Time, Menís Health, Good Housekeeping, Cosmopolitan, Fortune, Christian Century, Sports Illustrated, etc.

  • Newspapers: (106 million words [105,963,844]) Ten newspapers from across the US, including: USA Today, New York Times, Atlanta Journal Constitution, San Francisco Chronicle, etc. In most cases, there is a good mix between different sections of the newspaper, such as local news, opinion, sports, financial, etc.

  • Academic Journals: (103 million words [103,421,981]) Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress classification system (e.g. a certain percentage from B (philosophy, psychology, religion), D (world history), K (education), T (technology), etc.), both overall and by number of words per year

YEAR SPOKEN FICTION MAGAZINE NEWSPAPER ACADEMIC TOTAL
1990 4,332,983 4,176,786 4,061,059 4,072,572 3,943,968 20,587,368
1991 4,275,641 4,152,690 4,170,022 4,075,636 4,011,142 20,685,131
1992 4,493,738 3,862,984 4,359,784 4,060,218 3,988,593 20,765,317
1993 4,449,330 3,936,880 4,318,256 4,117,294 4,109,914 20,931,674
1994 4,416,223 4,128,691 4,360,184 4,116,061 4,008,481 21,029,640
1995 4,506,463 3,925,121 4,355,396 4,086,909 3,978,437 20,852,326
1996 4,060,792 3,938,742 4,348,339 4,062,397 4,070,075 20,480,345
1997 3,874,976 3,750,256 4,330,117 4,114,733 4,378,426 20,448,508
1998 4,424,874 3,754,334 4,353,187 4,096,829 4,070,949 20,700,173
1999 4,417,997 4,130,984 4,353,229 4,079,926 3,983,704 20,965,840
2000 4,414,772 3,925,331 4,353,049 4,034,817 4,053,691 20,781,660
2001 3,987,514 3,869,790 4,262,503 4,066,589 3,924,911 20,111,307
2002 4,329,856 3,745,852 4,279,955 4,085,554 4,014,495 20,455,712
2003 4,404,978 4,094,865 4,295,543 4,022,457 4,007,927 20,825,770
2004 4,330,018 4,076,462 4,300,735 4,084,584 3,974,453 20,766,252
2005 4,396,030 4,075,210 4,328,642 4,089,168 3,890,318 20,779,368
2006 4,304,513 4,081,287 4,279,043 4,085,757 4,028,620 20,779,220
2007 3,882,586 4,028,998 4,185,161 3,975,474 4,267,452 20,339,671
2008 3,635,622 4,155,298 4,205,477 4,031,769 4,015,545 20,043,711
2009 3,969,587 4,143,814 3,855,815 3,971,607 4,144,064 20,084,887
2010 4,095,393 3,929,160 3,806,011 4,258,633 3,816,420 19,905,617
2011 4,033,627 4,166,029 4,199,378 3,982,299 4,064,535 20,445,868
2012 4,379,692 4,348,845 4,267,066 4,125,682 4,300,876 21,422,161
2013 4,001,833 4,184,681 4,126,129 4,091,086 3,467,083 19,870,812
2014 3,985,902 4,101,083 4,215,001 4,094,162 3,383,971 19,780,119
2015 3,986,703 4,216,654 4,141,556 4,081,631 3,523,931 19,950,475
TOTAL 109,391,643 104,900,827 110,110,637 105,963,844 103,421,981 533,788,932