|
DOWNLOAD LIST OF ALL 189,431 TEXTS AND
SUMMARY BY YEAR, GENRE, AND SUB-GENRE
The corpus is composed of more than 450 million words in 189,431 texts, including 20 million words each year from 1990-2012.
The most recent addition of texts (Apr 2011 - Jun 2012) was completed
in June 2012.
For each year (and
therefore overall, as well), the
corpus is evenly divided between the five genres of spoken, fiction, popular magazines, newspapers, and academic
journals. The texts come from a variety of sources:
-
Spoken: (95 million words
[95,385,672]) Transcripts of unscripted
conversation from more than 150 different TV and radio programs
(examples: All Things Considered (NPR), Newshour (PBS),
Good Morning America (ABC), Today Show (NBC), 60 Minutes
(CBS), Hannity and Colmes (Fox), Jerry Springer, etc).
[See notes on the naturalness and
authenticity of the language from these transcripts).
-
Fiction: (90 million words
[90,344,134]) Short stories and plays
from literary magazines, children’s magazines, popular magazines, first
chapters of first edition books 1990-present, and movie scripts.
-
Popular Magazines: (95 million
words [95,564,706]) Nearly 100
different magazines, with a good mix (overall, and by year) between
specific domains (news, health, home and gardening, women, financial,
religion, sports, etc). A few examples are Time,
Men’s Health, Good Housekeeping, Cosmopolitan, Fortune, Christian
Century, Sports Illustrated, etc.
-
Newspapers: (92 million words
[91,680,966]) Ten newspapers from
across the US, including: USA Today, New York Times, Atlanta Journal
Constitution, San Francisco Chronicle, etc. In most cases, there is a good
mix between different sections of the newspaper, such as local news,
opinion, sports, financial, etc.
-
Academic Journals: (91 million words
[91,044,778]) Nearly 100
different peer-reviewed journals. These were selected to cover the
entire range of the Library of Congress classification system (e.g. a
certain percentage from B (philosophy, psychology, religion), D (world
history), K (education), T (technology), etc.), both overall and by
number of words per year
|
YEAR |
SPOKEN |
FICTION |
MAGAZINE |
NEWSPAPER |
ACADEMIC |
TOTAL |
|
1990 |
4,332,983 |
4,176,786 |
4,061,059 |
4,072,572 |
3,943,968 |
20,587,368 |
|
1991 |
4,275,641 |
4,152,690 |
4,170,022 |
4,075,636 |
4,011,142 |
20,685,131 |
|
1992 |
4,493,738 |
3,862,984 |
4,359,784 |
4,060,218 |
3,988,593 |
20,765,317 |
|
1993 |
4,449,330 |
3,936,880 |
4,318,256 |
4,117,294 |
4,109,914 |
20,931,674 |
|
1994 |
4,416,223 |
4,128,691 |
4,360,184 |
4,116,061 |
4,008,481 |
21,029,640 |
|
1995 |
4,506,463 |
3,925,121 |
4,355,396 |
4,086,909 |
3,978,437 |
20,852,326 |
|
1996 |
4,060,792 |
3,938,742 |
4,348,339 |
4,062,397 |
4,070,075 |
20,480,345 |
|
1997 |
3,874,976 |
3,750,256 |
4,330,117 |
4,114,733 |
4,378,426 |
20,448,508 |
|
1998 |
4,424,874 |
3,754,334 |
4,353,187 |
4,096,829 |
4,070,949 |
20,700,173 |
|
1999 |
4,417,997 |
4,130,984 |
4,353,229 |
4,079,926 |
3,983,704 |
20,965,840 |
|
2000 |
4,414,772 |
3,925,331 |
4,353,049 |
4,034,817 |
4,053,691 |
20,781,660 |
|
2001 |
3,987,514 |
3,869,790 |
4,262,503 |
4,066,589 |
3,924,911 |
20,111,307 |
|
2002 |
4,329,856 |
3,745,852 |
4,279,955 |
4,085,554 |
4,014,495 |
20,455,712 |
|
2003 |
4,404,978 |
4,094,865 |
4,295,543 |
4,022,457 |
4,007,927 |
20,825,770 |
|
2004 |
4,330,018 |
4,076,462 |
4,300,735 |
4,084,584 |
3,974,453 |
20,766,252 |
|
2005 |
4,396,030 |
4,075,210 |
4,328,642 |
4,089,168 |
3,890,318 |
20,779,368 |
|
2006 |
4,304,513 |
4,081,287 |
4,279,043 |
4,085,757 |
4,028,620 |
20,779,220 |
|
2007 |
3,882,586 |
4,028,998 |
4,185,161 |
3,975,474 |
4,267,452 |
20,339,671 |
|
2008 |
3,635,622 |
4,155,298 |
4,205,477 |
4,031,769 |
4,015,545 |
20,043,711 |
|
2009 |
3,969,587 |
4,143,814 |
3,855,815 |
3,971,607 |
4,144,064 |
20,084,887 |
|
2010 |
4,095,393 |
3,929,160 |
3,806,011 |
4,258,633 |
3,816,420 |
19,905,617 |
|
2011 |
4,033,627 |
4,166,029 |
4,199,378 |
3,982,299 |
4,064,535 |
20,445,868 |
|
2012 |
2,348,159 |
2,294,570 |
2,203,821 |
2,109,683 |
2,298,658 |
11,254,891* |
|
TOTAL |
95,385,672 |
90,344,134 |
95,564,706 |
91,680,966 |
91,044,778 |
464,020,256 |
* The latest update for 2012 was in Summer 2012, and includes texts from Jan-Jun 2012.
Because of copyright and licensing
issues, the texts themselves are not available for download, under any
circumstances. All access to the texts is via this web interface.
|