|
The corpora were created by Mark Davies, Professor of Linguistics at Brigham Young University in Provo, Utah, USA. In most cases (though see #2 below) this involved designing the corpora, collecting the texts, editing and annotating them, creating the corpus architecture, and designing and programming the web interfaces. Even though I use the term "we" on this and other pages, most activities related to the development of most of these corpora was actually done by just one person.
3. What's the history of these corpora? The first large online corpus was the Corpus del Español in 2002, followed by the BYU-BNC in 2004, the Corpus do Português in 2006, TIME Corpus in 2007, and the Corpus of Contemporary American English (COCA) in 2008. Major improvements to the corpus architecture were made in early 2008. In March 2009 we received a large grant from the National Endowment for the Humanities to create a 300 million word Corpus of Historical American English (COHA). (More details on this history...) 4. What is the advantage of these corpora over other ones that are available? For some languages and time periods, these are really the only corpora available. For example, in spite of earlier corpora like the American National Corpus, our Corpus of Contemporary American English is the only large, balanced corpus of American English. The TIME corpus is the only large, annotated corpus of English from throughout the 1900s. The Corpus del Español and the Corpus do Português are the only large, annotated corpora of these two languages. Beyond the "textual" corpora, however, the corpus architecture and interface that we have developed allows for speed, size, annotation, and a range of queries that we believe is unmatched with other architectures, and which makes it useful for corpora such as the British National Corpus, which does have other interfaces. Also, they're free -- a nice feature. 5. What software is used to index, search, and retrieve data from these corpora? We have created our own corpus architecture, using Microsoft SQL Server as the backbone of the relational database approach. Our proprietary architecture allows for size, speed, and very good scalability. For example, the 400+ million word Corpus of Contemporary American English is nearly as fast as the 100 million word British National Corpus. Even complex queries of the more than 400 million word corpus typically only take one or two seconds. In addition, because of the relational database design, we can keep adding on more annotation "modules" with little or no performance hit. Finally, the relational database design allows for a range of queries that we believe is unmatched by any other architecture for large corpora. 6. How many people use the corpora? As measure by Google Analytics, as of April 2009 the corpora are used by more than 25,000 unique people each month. (In other words, if the same person uses three different corpora a total of ten times that month, it counts as just one of the 25,000 unique users). The most widely-used corpus is the Corpus of Contemporary American English -- with about 12,000 unique users each month -- and usage of this corpus is currently doubling about every 4-5 months. And people don't just come in, look for one word, and move on -- average time at the site each visit is between 10-15 minutes. 7. What do they use the corpora for? For lots of things. Linguists use the corpora to analyze variation and change in the different languages. Some are materials developers, who use the data to create teaching materials. A high number of users are language teachers and learners, who use the corpus data to model native speaker performance and intuition. Translators use the corpora to get precise data on the target languages. Some businesses purchase data from the corpora to use in natural language processing projects. And lots of people are just curious about language, and (believe it or not) just use the corpora for fun, to see what's going on with the languages currently. Feel free to look at the list of user profiles (which was just introduced in Spring 2009, and which should grow over the next few months). 8. Are there any published materials that are based on these corpora? We continue to receive anecdotal reports of the corpora being used at the backbone for publications, conference papers, theses and dissertations, but we certainly need more systematic data on this. We are aware of at least four textbooks of English published in the last year that used lots of data from our interface to the BNC. And we ourselves have published (or will publish) three frequency dictionaries that are based on data from the corpora -- Spanish (2005), Portuguese (2007), and American English (late 2009). 9. How can I collaborate with other users? Before May 2009, corpus users pretty much used the corpora by themselves, or as part of a class. However, due to the large number of users, we think that now is a good time to create a "community" of these users. There are a number of ways that you can now collaborate with others, and we'd be interested in other ideas that you have as well. This is a huge issue. Our corpora contain hundreds of millions of words of copyrighted material. The only way that their use is legal (under US Fair Use Law) is because of the limited "Keyword in Context" (KWIC) displays. It's kind of like the "snippet defense" used by Google. They retrieve and index billions of words of copyright material, but they only allow end users to access "snippets" of this data from their servers. Click here for an extended discussion of US Fair Use Law and how it applies to our COCA texts. 11. Can I get access to the full text of these corpora? Unfortunately, no, for reasons of copyright discussed above. We would love to allow end users to have access to full-text, but we simply cannot. Even when "no one else will ever use it" and even when "it's only one article or one page" of text, we can't. We have to be 100% compliant with US Fair Use Law, and that means no full text for anyone under any circumstances -- ever. Sorry about that. 12. I want more data than what's available via the standard interface. What can I do? Users can purchase derived data -- such as frequency lists, n-grams lists (e.g. all two or three word strings of words), or even blocks of sentences from the corpus. Basically anything, as long as it does not involve full-text access (e.g. paragraphs or pages of text), which would violate copyright restrictions. Click here for much more detailed information on this data, as well as downloadable samples. 13. Can my class have additional access to a corpus on a given day? Yes. Sometimes your school will be blocked after an hour or so of heavy use from a classroom full of students. (This is a security mechanism, to prevent "bots" from running thousands of queries in a short time.) To avoid this, sign up ahead of time for "group access". 14. Can you create a corpus for us, based on our own materials? Well, I probably could, but I'm inclined not to at this point. Creating and maintaining corpora is extremely time intensive, even when you give me the data "all ready" to import into the database. The one exception, I guess, would be if you get a large grant to create and maintain the corpus. Feel free to contact me with questions. 15. Can you come do a workshop at our university? The short answer is yes. I use these corpora extensively in the classes that I teach at BYU, and I have developed many activities and projects around the corpora. I have come to other universities to teach workshops on using the corpora, and I enjoy this interaction with students and faculty at other schools. The problem is that between academic conferences and workshops at different universities throughout the world, I'm already a bit over-extended for 2009 and even into 2010. But fill free to contact me if you'd be interested in having me come do a series of workshops at your university. 16. How do I cite the corpora in my published articles? Please see the [MORE INFORMATION / CITING THE CORPUS] link on each of the individual corpus websites. And then please remember to add your publication to the list on this website as well. Thanks!
|