|
DOWNLOAD LIST OF ALL TEXTS AND
SUMMARY BY YEAR, GENRE, AND SUB-GENRE (XLS
version)
Download file with information
for each text: tokens, types, avg. word length, # nouns, # tokens
The corpus is composed of more than 425 million words (details) in
more than
175,000
texts (actually 176,389),
including 20 million words each year from 1990-2011. For each year (and
therefore overall, as well), the
corpus is evenly divided between the five genres of spoken, fiction, popular magazines, newspapers, and academic
journals. The texts come from a variety of sources:
-
Spoken: (90 million words
[90,065,764]) Transcripts of unscripted
conversation from more than 150 different TV and radio programs
(examples: All Things Considered (NPR), Newshour (PBS),
Good Morning America (ABC), Today Show (NBC), 60 Minutes
(CBS), Hannity and Colmes (Fox), Jerry Springer, etc).
[See notes on the naturalness and
authenticity of the language from these transcripts).
-
Fiction: (85 million words
[84,965,507]) Short stories and plays
from literary magazines, children’s magazines, popular magazines, first
chapters of first edition books 1990-present, and movie scripts.
-
Popular Magazines: (90 million
words [90,292,046]) Nearly 100
different magazines, with a good mix (overall, and by year) between
specific domains (news, health, home and gardening, women, financial,
religion, sports, etc). A few examples are Time,
Men’s Health, Good Housekeeping, Cosmopolitan, Fortune, Christian
Century, Sports Illustrated, etc.
-
Newspapers: (87 million words
[86,670,479]) Ten newspapers from
across the US, including: USA Today, New York Times, Atlanta Journal
Constitution, San Francisco Chronicle, etc. In most cases, there is a good
mix between different sections of the newspaper, such as local news,
opinion, sports, financial, etc.
-
Academic Journals: (86 million words
[85,791,918]) Nearly 100
different peer-reviewed journals. These were selected to cover the
entire range of the Library of Congress classification system (e.g. a
certain percentage from B (philosophy, psychology, religion), D (world
history), K (education), T (technology), etc.), both overall and by
number of words per year
|
YEAR |
SPOKEN |
FICTION |
MAGAZINE |
NEWSPAPER |
ACADEMIC |
TOTAL |
|
1990 |
4,332,983 |
4,176,786 |
4,061,059 |
4,072,572 |
3,943,968 |
20,587,368 |
|
1991 |
4,275,641 |
4,152,690 |
4,170,022 |
4,075,636 |
4,011,142 |
20,685,131 |
|
1992 |
4,493,738 |
3,862,984 |
4,359,784 |
4,060,218 |
3,988,593 |
20,765,317 |
|
1993 |
4,449,330 |
3,936,880 |
4,318,256 |
4,117,294 |
4,109,914 |
20,931,674 |
|
1994 |
4,416,223 |
4,128,691 |
4,360,184 |
4,116,061 |
4,008,481 |
21,029,640 |
|
1995 |
4,506,463 |
3,925,121 |
4,355,396 |
4,086,909 |
3,978,437 |
20,852,326 |
|
1996 |
4,060,792 |
3,938,742 |
4,348,339 |
4,062,397 |
4,070,075 |
20,480,345 |
|
1997 |
3,874,976 |
3,750,256 |
4,330,117 |
4,114,733 |
4,378,426 |
20,448,508 |
|
1998 |
4,424,874 |
3,754,334 |
4,353,187 |
4,096,829 |
4,070,949 |
20,700,173 |
|
1999 |
4,417,997 |
4,130,984 |
4,353,229 |
4,079,926 |
3,983,704 |
20,965,840 |
|
2000 |
4,414,772 |
3,925,331 |
4,353,049 |
4,034,817 |
4,053,691 |
20,781,660 |
|
2001 |
3,987,514 |
3,869,790 |
4,262,503 |
4,066,589 |
3,924,911 |
20,111,307 |
|
2002 |
4,329,856 |
3,745,852 |
4,279,955 |
4,085,554 |
4,014,495 |
20,455,712 |
|
2003 |
4,404,978 |
4,094,865 |
4,295,543 |
4,022,457 |
4,007,927 |
20,825,770 |
|
2004 |
4,330,018 |
4,076,462 |
4,300,735 |
4,084,584 |
3,974,453 |
20,766,252 |
|
2005 |
4,396,030 |
4,075,210 |
4,328,642 |
4,089,168 |
3,890,318 |
20,779,368 |
|
2006 |
4,304,513 |
4,081,287 |
4,279,043 |
4,085,757 |
4,028,620 |
20,779,220 |
|
2007 |
3,882,586 |
4,028,998 |
4,185,161 |
3,975,474 |
4,267,452 |
20,339,671 |
|
2008 |
3,635,622 |
4,155,298 |
4,205,477 |
4,031,769 |
4,015,545 |
20,043,711 |
|
2009 |
3,969,587 |
4,143,814 |
3,855,815 |
3,971,607 |
4,144,064 |
20,084,887 |
|
2010 |
4,095,393 |
3,929,160 |
3,806,011 |
4,258,633 |
3,816,420 |
19,905,617 |
|
2011 |
1,061,878 |
1,081,972 |
1,130,539 |
1,081,497 |
1,110,333 |
5,466,219* |
|
TOTAL |
90,065,764 |
84,965,507 |
90,292,046 |
86,670,481 |
85,791,918 |
437,785,716 |
* The latest update for 2011 was in April
2011, and includes texts from Jan-Mar 2011.
Because of copyright and licensing
issues, the texts themselves are not available for download, under any
circumstances. All access to the texts is via this web interface.
|