Download complete list of all 25,094 texts, with metadata
The Movies Corpus is composed of 200 million words in
25,094 texts from the 1930s to the 2010s (the last texts are from 2018).
The following table shows the number of words by country and
decade. (Note that MISC means that the first country listed in IMDB was
not one of the size shown below, although in most cases one of these
countries is listed as an "additional country".)
|
US / CA |
UK / IE |
AU / NZ |
Misc |
TOTAL |
1930s |
6,013,722 |
445,980 |
2,245 |
104,255 |
6,566,202 |
1940s |
8,679,722 |
1,077,429 |
--- |
51,151 |
9,808,302 |
1950s |
8,570,819 |
1,826,174 |
21,777 |
197,173 |
10,615,943 |
1960s |
5,851,067 |
2,687,175 |
6,594 |
557,976 |
9,102,812 |
1970s |
6,972,688 |
2,060,309 |
112,715 |
958,968 |
10,104,680 |
1980s |
10,739,129 |
2,153,349 |
308,640 |
917,461 |
14,118,579 |
1990s |
19,259,078 |
2,983,322 |
384,607 |
1,986,577 |
24,613,584 |
2000s |
38,572,824 |
6,970,252 |
793,610 |
4,893,749 |
51,230,435 |
2010s |
48,649,187 |
8,705,479 |
1,337,876 |
4,626,223 |
63,318,765 |
TOTAL |
153,308,236 |
28,909,469 |
2,968,064 |
14,293,533 |
199,479,302 |
The texts were taken from the
OpenSubtitles collection. In cases where there were multiple
subtitles files for a given movie (which was the norm), we used the
"highest ranked" file, in terms of accuracy (from the ratings at
OpenSubtitles). We then matched up each movie with the corresponding
page from IMDB, which
provides rich metadata for each movie (and which can be used to create
your own Virtual Corpus). |