For some languages and time periods, these are really the only corpora available. For example, in spite of earlier corpora like the American National Corpus and the Bank of English, our Corpus of Contemporary American English is the only large, balanced corpus of contemporary American English. In spite of the Brown family of corpora and the ARCHER corpus, the Corpus of Historical American English is the only large and balanced corpus of historical American English. And the Corpus del EspaŮol and the Corpus do PortuguÍs are the only large, annotated corpora of these two languages. Beyond the "textual" corpora, however, the corpus architecture and interface that we have developed allows for speed, size, annotation, and a range of queries that we believe is unmatched with other architectures, and which makes it useful for corpora such as the British National Corpus, which does have other interfaces. Also, they're free -- a nice feature.
We have created our own corpus architecture, using Microsoft SQL Server as the backbone of the relational database approach. Our proprietary architecture allows for size, speed, and very good scalability that we don't believe are available with any other architecture. Even complex queries of the more than 450 million word COCA corpus or the 400 million word COHA corpus typically only take two or three seconds. In addition, because of the relational database design, we can keep adding on more annotation "modules" with little or no performance hit. Finally, the relational database design allows for a range of queries that we believe is unmatched by any other architecture for large corpora.
As measured by Google Analytics, as of October 2014 the corpora are used by more than 170,000 unique people each month. (In other words, if the same person uses three different corpora a total of ten times that month, it counts as just one of the 170,000 unique users). The most widely-used corpus is the Corpus of Contemporary American English -- with more than 55,000 unique users each month. And people don't just come in, look for one word, and move on -- average time at the site each visit is between 10-15 minutes.
For lots of things. Linguists use the corpora to analyze variation and change in the different languages. Some are materials developers, who use the data to create teaching materials. A high number of users are language teachers and learners, who use the corpus data to model native speaker performance and intuition. Translators use the corpora to get precise data on the target languages. Other people in the humanities and social sciences look at changes in culture and society. Some businesses purchase data from the corpora to use in natural language processing projects. And lots of people are just curious about language, and (believe it or not) just use the corpora for fun, to see what's going on with the languages currently.
Our corpora contain hundreds of millions of words of copyrighted material. The only way that their use is legal (under US Fair Use Law) is because of the limited "Keyword in Context" (KWIC) displays. It's kind of like the "snippet defense" used by Google. They retrieve and index billions of words of copyright material, but they only allow end users to access "snippets" of this data from their servers. Click here for an extended discussion of US Fair Use Law and how it applies to our COCA texts.
Full-text data for COCA and GloWbE is now available (COCA = 440 million words, 190,000 texts / GloWbE = 1.8 billion words, 1.8 million texts). There is currently no full-text access for the other corpora.
No, there isn't. There are two main reasons for this. First, we don't have copyright access to the texts in the corpora, and so we can only provide limited access to the corpora, via the corpus interface. Second, we're already pretty "maxed out" in terms of the two corpus servers, and API access would probably lead to quite a bit more queries, which we can't handle right now. Although we don't allow API access, some people have programmed browsers (via VB.NET for IE, or Perl for Firefox) to allow for semi-automated queries (note, though, that we don't provide tech support for this).
"Non-researchers" (Level 1) have 100 queries a day, or about 3,000 queries per month. For most people, this is way more than enough. But if you are in fact a graduate student in languages or linguistics, but there isn't a web page with your name on it, and you really do need more than 3,000 queries per month, then click here.
Users can purchase offline data -- such as full text copies of the texts, frequency lists, collocates lists, n-grams lists (e.g. all two or three word strings of words). Click here for much more detailed information on this data, as well as downloadable samples.
Yes. Sometimes your school will be blocked after an hour or so of heavy use from a classroom full of students. (This is a security mechanism, to prevent "bots" from running thousands of queries in a short time.) To avoid this, sign up ahead of time for "group access".
Please use the following information when you cite the corpus in academic publications or conference papers. Thanks.
In the first reference to the corpus in your paper, please use the full name. For example, for COCA: "the Corpus of Contemporary American English" with the appropriate citation to the references section of the paper, e.g. (Davies 2008-). After that reference, feel free to use something shorter, like "COCA" (for example: "...and as seen in COCA, there are..."). Also, please do not refer to the corpus in the body of your paper as "Davies' COCA corpus", "a corpus created by Mark Davies", etc. The bibliographic entry itself is enough to indicate who created the corpus. Otherwise, it just kind of sounds strange, and overly proprietary.