DOWNLOAD LIST OF ALL 248,752 TEXTS AND SUMMARY BY YEAR, GENRE, AND SUB-GENRE


The corpus is composed of more than 560 million words in 220,225 texts, including 20 million words each year from 1990-2017. The most recent addition of texts (Jan 2016 - Dec 2017) was completed in December 2017.

For each year (and therefore overall, as well), the corpus is evenly divided between the five genres of spoken, fiction, popular magazines, newspapers, and academic journals. The texts come from a variety of sources:

  • Spoken: (118 million words [118,167,133]) Transcripts of unscripted conversation from more than 150 different TV and radio programs (examples: All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes (CBS), Hannity and Colmes (Fox), Jerry Springer, etc). [See notes on the naturalness and authenticity of the language from these transcripts).

  • Fiction: (113 million words [113,404,735]) Short stories and plays from literary magazines, childrenís magazines, popular magazines, first chapters of first edition books 1990-present, and movie scripts.

  • Popular Magazines: (118 million words [118,450,563]) Nearly 100 different magazines, with a good mix (overall, and by year) between specific domains (news, health, home and gardening, women, financial, religion, sports, etc). A few examples are Time, Menís Health, Good Housekeeping, Cosmopolitan, Fortune, Christian Century, Sports Illustrated, etc.

  • Newspapers: (114 million words [114,341,164]) Ten newspapers from across the US, including: USA Today, New York Times, Atlanta Journal Constitution, San Francisco Chronicle, etc. In most cases, there is a good mix between different sections of the newspaper, such as local news, opinion, sports, financial, etc.

  • Academic Journals: (112 million words [111,537,393]) Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress classification system (e.g. a certain percentage from B (philosophy, psychology, religion), D (world history), K (education), T (technology), etc.), both overall and by number of words per year

YEAR SPOK FIC MAG NEWS ACAD TOTAL
19904,241,820 4,100,2963,993,642 4,000,9273,914,328 20,251,013
19914,183,317 4,075,4284,099,198 4,003,1733,980,425 20,341,541
19924,367,946 3,792,2554,292,672 3,984,9423,957,009 20,394,824
19934,336,787 3,860,4064,250,973 4,041,6734,078,421 20,568,260
19944,305,046 4,046,7474,293,745 4,040,0133,977,781 20,663,332
1995 4,396,172 3,847,142 4,288,730 4,009,933 3,948,436 20,490,413
1996 3,965,565 3,858,640 4,277,667 3,987,828 4,037,870 20,127,570
1997 3,774,994 3,678,700 4,259,465 4,036,195 4,342,502 20,091,856
1998 4,314,807 3,683,747 4,283,190 4,019,406 4,038,454 20,339,604
1999 4,286,305 4,045,331 4,281,338 3,998,758 3,951,864 20,563,596
20004,297,830 3,850,3444,282,437 3,949,1914,019,668 20,399,470
20013,896,284 3,789,8754,194,943 3,984,2023,895,326 19,760,630
20024,230,138 3,674,1684,210,790 4,001,4743,980,495 20,097,065
20034,297,895 4,015,8424,222,326 3,937,0253,972,378 20,445,466
20044,224,432 3,999,2174,229,015 4,003,4633,938,459 20,394,586
2005 4,300,773 3,998,572 4,252,853 4,010,857 3,856,046 20,419,101
2006 4,210,862 4,004,822 4,205,020 4,005,230 3,994,522 20,420,456
2007 3,774,535 3,948,324 4,112,852 3,891,029 4,226,689 19,953,429
2008 3,533,287 4,076,895 4,191,580 3,969,842 3,917,939 19,689,543
2009 3,883,612 4,069,557 3,897,508 3,955,928 3,992,413 19,799,018
20104,023,555 3,885,9823,765,169 4,219,6293,787,581 19,681,916
20114,760,687 4,166,0294,199,378 3,986,3214,551,005 21,663,420
20124,336,058 4,335,1554,294,190 4,173,8134,337,823 21,477,039
20134,019,619 4,225,1624,173,336 4,133,9173,531,695 20,083,729
20144,004,868 4,134,2204,266,683 4,142,5003,456,761 20,005,032
2015 4,005,894 4,255,674 4,195,487 4,130,818 3,609,226 20,197,099
2016 4,371,199 4,197,883 4,087,037 4,134,560 4,005,824 20,796,503
2017 4,404,291 4,228,709 4,252,889 4,242,760 4,109,588 21,238,237
TOTAL116,748,578 111,845,122117,354,113 112,995,407111,410,528 570,353,748