Download complete list of all 25,094 texts, with metadata

The Movies Corpus is composed of 200 million words in 25,094 texts from the 1930s to the 2010s (the last texts are from 2018). The following table shows the number of words by country and decade. (Note that MISC means that the first country listed in IMDB was not one of the size shown below, although in most cases one of these countries is listed as an "additional country".)

US / CA UK / IE AU / NZ Misc TOTAL

1930s 6,013,722 445,980 2,245 104,255 6,566,202

1940s 8,679,722 1,077,429 --- 51,151 9,808,302

1950s 8,570,819 1,826,174 21,777 197,173 10,615,943

1960s 5,851,067 2,687,175 6,594 557,976 9,102,812

1970s 6,972,688 2,060,309 112,715 958,968 10,104,680

1980s 10,739,129 2,153,349 308,640 917,461 14,118,579

1990s 19,259,078 2,983,322 384,607 1,986,577 24,613,584

2000s 38,572,824 6,970,252 793,610 4,893,749 51,230,435

2010s 48,649,187 8,705,479 1,337,876 4,626,223 63,318,765

TOTAL 153,308,236 28,909,469 2,968,064 14,293,533 199,479,302

The texts were taken from the OpenSubtitles collection. In cases where there were multiple subtitles files for a given movie (which was the norm), we used the "highest ranked" file, in terms of accuracy (from the ratings at OpenSubtitles). We then matched up each movie with the corresponding page from IMDB, which provides rich metadata for each movie (and which can be used to create your own Virtual Corpus).