Most corpora (large collections of searchable text) want to show what's going on with the informal, more "spoken" variety of a language, as opposed to (or at least in addition to) more formal fiction, newspapers, magazines, or academic writing. This is hard to do, however, since it is very time-consuming and expensive to create a large corpus of the spoken language, because of the effort in recording, transcribing, and then annotating the texts.

As a result, spoken corpora tend to be quite small. For English, for example, the MICASE, CALLHOME and CALLFRIEND corpora are all between about 1 and 2 million words. Even the largest corpus of "everyday, conversational" English is only 5 million words -- in the British National Corpus. In additional to the small size, most of these corpora are now a bit dated as well -- they are from 15-25 years ago -- in other words, almost a full generation ago.

The Corpus of Contemporary American English (COCA) is much larger and more recent than these other corpora. COCA contains 95 million words of spoken English -- 5 million words each year from 1990 to the present (2012). These transcripts are for unscripted conversation on TV and radio programs like Good Morning American, the Today Show, All Things Considered, and Oprah. Unfortunately, the conversations often don't deal with "everyday" topics, but rather they often deal with politics, the economy, science, business, entertainment personalities, and other current events.

Some researchers have hit upon an interesting approach. In projects like SUBTLEXus, rather than using transcriptions of actual recorded speech, they use subtitles from movies and TV, on the theory that the dialogue in most TV shows and movies represents the spoken language pretty well. As nice as this data is, however, it only gives the frequency of individual words -- you can't search for phrases, grammatical constructions, collocates (nearby words), etc.

The new SOAP corpus -- based on American soap operas -- is designed to provide the best of these different approaches. In terms of size, the corpus is very large -- 100 million words. It is also very recent -- all of the texts are from 2001 to 2012. And like SUBTLEX, we would suggest that subtitles from informal TV shows and movies does represent the informal, everyday language quite well -- especially soap operas. The topics of the episodes deal almost exclusively with "everyday life" -- love, hate, jobs, kids, crime, desire and passion. Finally, since SOAP is a full-blown corpus, you can go beyond simple word frequencies to look at grammar, collocates, and so on.