English-Corpora.org

English-Corpora.org


COCA: TV and Movie subtitles (informal language)

Some researchers have employed an interesting approach to use easily available texts that do a very good job of "modeling" spoken language. In projects like SUBTLEXus, rather than using transcriptions of actual recorded speech, they use data from subtitles of movies and TV, on the theory that the dialogue in most TV shows and movies represents the spoken language really well. (For examples of this research, see A, B, and C). As this research shows, the data from subtitles agrees with native speaker intuitions about their language even better than the data from actual everyday conversation (like in the BNC).

The TV and Movies data (128 million words) in COCA is are taken from the 1990s-2010s American portion of the TV (325 million words) and Movies (200 million words) corpora. This data is based on texts that is very similar to SUBTLEXus.

One might be suspicious of dialogue from TV shows and movies. After all, it is written by a scriptwriter. How well does it really represent authentic, "spoken" language? Let's take a look at this is some detail. In each case, we'll compare the TV and Movies data with the spoken portion of the BNC. We'll see that in most cases, the language in the TV shows and movies is actually much more informal than the BNC.


Phrases: TV/Movies vs BNC-Spoken
(For more details and even more carefully constructed data, see Davies 2021)

The following table shows the raw frequency (columns 3-5) and the frequency per million (PM) words (columns 6-8). The rightmost two column shows how much more frequent the word is in the TV Corpus than in the BNC-Spoken (e.g. 12.4 = more than ten times as frequent, per million words). You can click on any of the entries to see the actual examples from the three corpora. For the BNC, look at the SPOKEN column of the chart. For the movies, look at the ALL column at the left.

Note: click on any link on this page to see the corpus data, and then click on the "BACK" image (see left) at the top of the page to come back to this page.
Query Example TV Movies BNC-Spok TV-PM Movies-PM BNC-S-PM TV/BNC Movies/BNC
. you VERB me ? . You heard me?   (=subject ellipsis) 3,491 2,946 0 10.7 14.7 0.0 107.4 147.3
, ok|okay ? we're leaving now, OK? 100,866 59,288 344 310.4 296.4 34.8 8.9 8.5
, right ? you're pretty tired, right? 111,195 59,081 274 342.1 295.4 27.7 12.4 10.7
I told you I told you to get out of here 45,899 31,302 385 141.2 156.5 38.7 3.6 4.0
DO n't get it I don't get it -- why do you hate me so much? 9,188 4,847 89 28.3 24.2 9.0 3.1 2.7
how can you How can you even say that? 10,155 7,331 193 31.2 36.7 19.5 1.6 1.9
my God My God -- she's horrible! 102,515 57,812 572 315.4 289.1 20.0 15.8 14.5
. it 's ADJ . . It's sad. She's totally forgotten him. (=short phrases) 56,198 36,161 126 172.9 180.8 34.3 5.0 5.3
Situational (shows that the movie scripts are very oriented to the "here and now")
hand me * NOUN Hand me a towel. 1,641 1,107 2 5.0 5.5 0.2 25.2 27.7
. Get out . Get out before I call the police! 11,263 10,374 23 34.7 51.9 2.7 12.8 19.2
do n't leave Don't leave! I need you! 4,890 4,667 39 15.0 23.3 0.7 21.5 33.3
 

Syntax: TV/Movies vs BNC-Spoken
(For more details and even more carefully constructed data, see Davies 2021)

In many cases, the data from the TV shows and Movies is more informal than the Spoken portion of the BNC (British National Corpus) in terms of syntax (grammar).

Features #1-3 below are informal features of English syntax (click on the BNC link to see evidence of this). In each case, these informal features are more common in the TV and Movies section of COCA than in the Spoken section of the BNC. Feature #4 (BE passive) is the least common in informal, spoken English (see the BNC and COCA links), and it is even less common in the TV and Movies section of COCA.

The charts show the normalized frequency (per million words) in the BNC (left bar) and the last three decades of the TV and Movies Corpus (with the TV chart on the left, and the Movies chart on the right). The features are also much more informal than COCA Spoken, which is not shown in the pictures below, but which can be seen from the COCA link.

 1  Progressive (BE _vvg): you're making a huge mess  2  get passive (GET _vvn): he got fired from his job
Links:  BNC  COCA  TV  Movies Links:  BNC  COCA  TV  Movies
Higher = more informal Higher = more informal

 

 

 3   1 and 2 person pronouns: I/you: you're my best friend  4  BE passive (BE _vvn): they were colonized in the 1880s
Links:  BNC  COCA  TV  Movies Links:  BNC  COCA  TV  Movies
Higher = more informal Lower = more informal