corpus.byu.edu

corpora, size, queries = better resources, more insight


 Contribute   Contributors 

 Academic site license 

Overview
Corpora
Size, speed, queries
Insight into variation

History / updates
FAQ / questions
Researchers
Log in / password
Profile / register

Related resources
   Full-text data
   Word frequency
   Collocates
   N-grams
   WordAndPhrase
   Academic vocabulary

Problems
Contact us


  1. Who created these corpora?

  2. Who else contributed?

  3. What is the advantage of these corpora over other ones that are available?

  4. What software is used to index, search, and retrieve data from these corpora?

  5. How many people use the corpora?

  6. What do they use them for?

  7. What about copyright?

  8. Can I get access to the full text of these corpora?

  9. Is there API access to the corpora?

  10. My access limits (for "non-researchers") are too low. Can I increase them?

  11. I want more data than what's available via the standard interface. What can I do?

  12. Can my class have additional access to a corpus on a given day?

  13. Why the new push for contributions?

  14. Can our company, university, or organization "sponsor" the corpora?

  15. I don't want to see the messages that appear every 10 searches or so, as I use the corpora.

  16. How do I cite the corpora in my published articles?

  17. Things aren't working (I'm getting error messages). Can you help?


1. Who created these corpora?

The underlying corpus architecture and web interface were created by Mark Davies, Professor of Linguistics at Brigham Young University in Provo, Utah, USA. In most cases, he also designed, collected, edited, and annotated the corpora as well. In the case of the BYU-BNC, Strathy, and Hansard corpora, I received the texts from others, and "just" created the architecture and interface. So although I use the terms "we" and "us" on this and other pages, most activities related to the development of most of these corpora were actually carried out by just one person.

2. Who else contributed?

All corpora

Brigham Young University (especially the College of Humanities and the Department of Linguistics and English Language) has provided generous support to buy hardware and software.

All corpora

Microsoft generously provided the 64-bit Enterprise version of SQL Server that is the backbone for the architecture.

Multiple corpora

The Corpus del EspaŮol, the Corpus do PortuguÍs, and the new Corpus of Historical American English were funded by large grants from the National Endowment for the Humanities.

Multiple corpora

Paul Rayson provided the CLAWS tagger, which was used for all of the English corpora.

COCA

Some BYU students helped to scan a few of the novels.

COHA

Several BYU students helped to scan novels, magazines, and non-fiction books, and to help process and correct the files and lexicon.

GloWbE

Michael Bean helped install the VM to run JusText, which removed the web page boilerplate material

   
Google Books Based on the datasets from Google Books
Corpus do PortuguÍs

This was a joint project with Michael Ferreira of Georgetown University, who helped select, acquire, edit, and annotate the older texts (1300s-1700s), and who provided the translations of the web interface, among other activities.

BYU-BNC

The original texts were licensed for re-use from Oxford University Press.

Strathy

The textual corpus was designed and created at the Strathy Language Unit at Queen's University in Canada.

Hansard The vast majority of the work on the corpus (including semantic tagging) was done by other participants in the SAMUELS project. I simply created the corpus architecture and interface.

3. What is the advantage of these corpora over other ones that are available?

For some languages and time periods, these are really the only corpora available. For example, in spite of earlier corpora like the American National Corpus and the Bank of English, our Corpus of Contemporary American English is the only large, balanced corpus of contemporary American English. In spite of the Brown family of corpora and the ARCHER corpus, the Corpus of Historical American English is the only large and balanced corpus of historical American English. And the Corpus del EspaŮol and the Corpus do PortuguÍs are the only large, carefuly annotated corpora of these two languages. Beyond the "textual" corpora, however, the corpus architecture and interface that we have developed allows for speed, size, annotation, and a range of queries that we believe is unmatched with other architectures, and which makes it useful for corpora such as the British National Corpus, which does have other interfaces. Also, they're free -- a nice feature.

4. What software is used to index, search, and retrieve data from these corpora?

We have created our own corpus architecture, using Microsoft SQL Server as the backbone of the relational database approach. Our proprietary architecture allows for size, speed, and very good scalability that we don't believe are available with any other architecture. Even complex queries of the more than 450 million word COCA corpus or the 400 million word COHA corpus typically only take two or three seconds. In addition, because of the relational database design, we can keep adding on more annotation "modules" with little or no performance hit. Finally, the relational database design allows for a range of queries that we believe is unmatched by any other architecture for large corpora.

5. How many people use the corpora?

As measured by Google Analytics, as of October 2014 the corpora are used by more than 130,000 unique people each month. The most widely-used corpus is the Corpus of Contemporary American English -- with more than 65,000 unique users each month. And people don't just come in, look for one word, and move on -- average time at the site each visit is between 10-15 minutes. (More information...)

6. What do they use the corpora for?

For lots of things. Linguists use the corpora to analyze variation and change in the different languages. Some are materials developers, who use the data to create teaching materials. A high number of users are language teachers and learners, who use the corpus data to model native speaker performance and intuition. Translators use the corpora to get precise data on the target languages. Other people in the humanities and social sciences look at changes in culture and society (especially with COHA and Hansard). Some businesses purchase data from the corpora to use in natural language processing projects. And lots of people are just curious about language, and (believe it or not) just use the corpora for fun, to see what's going on with the languages currently. To get a better idea of what people are doing with the corpora, check out (or search through) the entries from the Researchers page.

7. What about copyright?

Our corpora contain hundreds of millions of words of copyrighted material. The only way that their use is legal (under US Fair Use Law) is because of the limited "Keyword in Context" (KWIC) displays. It's kind of like the "snippet defense" used by Google. They retrieve and index billions of words of copyright material, but they only allow end users to access "snippets" of this data from their servers. Click here for an extended discussion of US Fair Use Law and how it applies to our COCA texts.

8. Can I get access to the full text of these corpora?

Full-text data for COCA and GloWbE is now available (COCA = 440 million words, 190,000 texts / GloWbE = 1.8 billion words, 1.8 million texts). There is currently no full-text access for the other corpora, although we will probably release full-text data from COHA in early 2015.

9. Is there API access to the corpora?

No, there isn't. There are two main reasons for this. First, we don't have copyright access to the texts in the corpora, and so we can only provide limited access to the corpora, via the corpus interface. Second, we're already pretty "maxed out" in terms of the two corpus servers, and API access would probably lead to quite a bit more queries, which we can't handle right now. Although we don't allow API access, some people have programmed browsers (via VB.NET for IE, or Perl for Firefox) to allow for semi-automated queries (note, though, that we don't provide tech support for this).

10. My access limits (for "non-researcher") are too low. Can I increase them?

"Non-researchers" (Level 1) have 50 queries a day, or about 3,000 queries per month. For most people, this is way more than enough. But if you are in fact a graduate student in languages or linguistics, but there isn't a web page with your name on it, and you really do need more than 1,500 queries per month, then click here. If that's not possible, you might want to contribute to help support the corpora, in which case you will have 200 queries a day.

11. I want more data than what's available via the standard interface. What can I do?

Users can purchase offline data -- such as full text copies of the texts, frequency lists, collocates lists, n-grams lists (e.g. all two or three word strings of words). Click here for much more detailed information on this data, as well as downloadable samples.

12. Can my class have additional access to a corpus on a given day?

There is a limit of 250 queries per 24 hours for a "group", where a group is typically a class of students or a department at a university. If you need more queries than this, you'd want an academic / site license..

13. Why the new push for contributions?

There are a number of reasons for our move to a contributions-based model in early 2015. One important factor is that Mark Davies, the creator and administrator for the corpora, will probably be retiring in 2018 or 2019, and there needs to be some viable model for financial sustainability of the corpora beyond that date. It's probably not realistic to expect the College of Humanities at BYU (which has been extremely supportive to this point) to keep spending $15,000-20,000 for a new server every year or two after 2018-19. In addition, there will need to be someone working 10-15 hours/week each week as an administrator for the corpora (for a total of $10,000-15,000/year). Hopefully, with a few years of contributions stored up by 2018-19, and with contributions coming in after that date as well, this will provide the needed financial viability of the corpora (~$15,000-20,000 year). The other option, of course, is to go to a strict subscription-based model like some other corpora, and this is something that we really don't want to have to do.

14. Can our company, university, or organization "sponsor" the corpora?

In addition to the basic "contributions", we're considering the possibility of allowing organizations (such as publishers, ESL schools, universities, etc) to "sponsor" the corpora. The sponsorship would be for a limited time (e.g. 1-3 months) and it could be targeted to just those users from a particular country. All of the people from that country who use the corpora during that time period would see a small logo for your organization in the header at the top of the corpus page, and this would link to the website for your organization.

We've already done this for an ESL publisher from South Korea, targeted to all users of WordAndPhrase who are coming from South Korea, and it has resulted in thousands of additional visits to their website. To give another example, if you're from a university with a graduate program where corpus linguistics is an important part of the program, this might be a great way to attract to your program students who are already interested in corpus linguistics.

Anyway, we're just considering the possibility of doing this, and we probably wouldn't start until mid-2015. But if you think that your organization might be interested (and there's no obligation to follow through on this), please let us know (corpus@byu.edu).

15. I don't want to see the messages that appear every 10 searches or so, as I use the corpora.

Part of the rationale for these messages is to let you know about useful resources that are related to the corpora (such as the word frequency, n-grams, collocates, and full-text data, or WordAndPhrase, AcademicWords, etc). The other purpose is to help motivate people to contribute, in order to ensure the financial viability of the corpora.

Once you have made a minimal contribution to the corpora (or purchased data from one of the sites just listed, some for as little as $20), you won't see these messages anymore (during the month or year of your contribution or purchase).

If you don't want to contribute to the BYU corpora and are really bothered by the messages, you might want to consider other web-based corpora -- like those from Lancaster University (including BNCweb), CorpusEye, or the many excellent corpora from Sketch Engine. (Please be aware, though, that the subscription fee for the Sketch Engine corpora is somewhat more expensive than the suggested contribution for the BYU corpora.)

16. How do I cite the corpora in my published articles?

Please use the following information when you cite the corpus in academic publications or conference papers. Thanks.

COCA

Davies, Mark. (2008-) The Corpus of Contemporary American English: 450 million words, 1990-present. Available online at http://corpus.byu.edu/coca/.

COHA

Davies, Mark. (2010-) The Corpus of Historical American English: 400 million words, 1810-2009. Available online at http://corpus.byu.edu/coha/.

TIME

Davies, Mark. (2007-) TIME Magazine Corpus: 100 million words, 1920s-2000s. Available online at http://corpus.byu.edu/time/.

BYU-BNC

Davies, Mark. (2004-) BYU-BNC. (Based on the British National Corpus from Oxford University Press). Available online at http://corpus.byu.edu/bnc/.

GloWbE

Davies, Mark. (2013) Corpus of Global Web-Based English: 1.9 billion words from speakers in 20 countries. Available online at http://corpus.byu.edu/glowbe/.

Corpus del EspaŮol

Davies, Mark. (2002-) Corpus del EspaŮol: 100 million words, 1200s-1900s. Available online at http://www.corpusdelespanol.org.

Corpus do PortuguÍs

Davies, Mark and Michael Ferreira. (2006-) Corpus do PortuguÍs: 45 million words, 1300s-1900s. Available online at http://www.corpusdoportugues.org.

Google Books

Davies, Mark. (2011-) Google Books Corpus. (Based on Google Books n-grams). Available online at http://googlebooks.byu.edu/. Based on:
Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden*. Quantitative Analysis of Culture Using Millions of Digitized Books. Science 331 (2011) [Published online ahead of print 12/16/2010].

In the first reference to the corpus in your paper, please use the full name. For example, for COCA: "the Corpus of Contemporary American English" with the appropriate citation to the references section of the paper, e.g. (Davies 2008-). After that reference, feel free to use something shorter, like "COCA" (for example: "...and as seen in COCA, there are..."). Also, please do not refer to the corpus in the body of your paper as "Davies' COCA corpus", "a corpus created by Mark Davies", etc. The bibliographic entry itself is enough to indicate who created the corpus. Otherwise, it just kind of sounds strange, and overly proprietary.