We will soon have several links here that will allow users to
help annotate the corpora via crowdsourcing (where lots of people do what one
person could not do alone).
Right now there is just one project open. We are
creating a 220 million word corpus of TV subtitles, and we need your help
in correcting some of the annotation for the corpus. You will be presented with
a word from the corpus and 20 sample lines containing that word, and you just
need to 1) indicate whether the word is a "typo" and 2) indicate what the part
of speech is (noun, verb, adjective, or adverb). If you'd like to help with this
(even just 5-10 words at a time, and each word takes just a few seconds), please
login here, and then go to
https://www.english-corpora.org/crowdsourcing/tv.asp.
Just a bit more about this new corpus and why we're
creating it:
Researchers have found that TV and movie subtitles are extremely good at
mirroring native speaker intuitions about language. For example, response times
in lexical decision tasks (LDT) correlate the best with the vocabulary found in
subtitles. To this point, though, virtually the only data from these subtitles
is found in word frequency lists
-- there is no way to actually search these subtitle texts (like you can with
the other BYU corpora) to look for
grammatical constructions, collocates, etc.
In order to get at this really informal English,
we've collected 220 million words of extremely informal English from about
58,000 different episodes on TV. (See a
2.2 million word sample
-- every 100th text from the 58,000 texts).
Problem is -- some of the texts are kind of a mess,
and they need to be cleaned up a bit. That's where you come in. With several
hundred people helping, in just a few months we can have a nice corpus that will
be much more informal than BNC or COCA spoken, or even the
informal SOAP corpus. The
language from these subtitles will pretty much be the most informal language
found in any large corpus, and you will have helped to create it!
Again:
https://www.english-corpora.org/crowdsourcing/tv.asp. Thanks for your help. (And by
the way, if you're able to do at least 500 words, we'll include your name in the
"acknowledgments" for the corpus, once it's finished in Summer 2015.)
|