English-Corpora.org

We will soon have several links here that will allow users to help annotate the corpora via crowdsourcing (where lots of people do what one person could not do alone).

Right now there is just one project open. We are creating a 220 million word corpus of TV subtitles, and we need your help in correcting some of the annotation for the corpus. You will be presented with a word from the corpus and 20 sample lines containing that word, and you just need to 1) indicate whether the word is a "typo" and 2) indicate what the part of speech is (noun, verb, adjective, or adverb). If you'd like to help with this (even just 5-10 words at a time, and each word takes just a few seconds), please login here, and then go to https://www.english-corpora.org/crowdsourcing/tv.asp.

Just a bit more about this new corpus and why we're creating it:

Researchers have found that TV and movie subtitles are extremely good at mirroring native speaker intuitions about language. For example, response times in lexical decision tasks (LDT) correlate the best with the vocabulary found in subtitles. To this point, though, virtually the only data from these subtitles is found in word frequency lists -- there is no way to actually search these subtitle texts (like you can with the other BYU corpora) to look for grammatical constructions, collocates, etc.

In order to get at this really informal English, we've collected 220 million words of extremely informal English from about 58,000 different episodes on TV. (See a 2.2 million word sample -- every 100th text from the 58,000 texts).

Problem is -- some of the texts are kind of a mess, and they need to be cleaned up a bit. That's where you come in. With several hundred people helping, in just a few months we can have a nice corpus that will be much more informal than BNC or COCA spoken, or even the informal SOAP corpus. The language from these subtitles will pretty much be the most informal language found in any large corpus, and you will have helped to create it!

Again: https://www.english-corpora.org/crowdsourcing/tv.asp. Thanks for your help. (And by the way, if you're able to do at least 500 words, we'll include your name in the "acknowledgments" for the corpus, once it's finished in Summer 2015.)