Who created these corpora?
What is the advantage of these corpora
over other ones that are available?
What software is used to index, search, and retrieve data from these corpora?
How many people use the corpora?
What do they use them for?
What about copyright?
Can I get access to the full text of
API access to the corpora?
My access limits (for "non-researchers") are too low. Can I increase them?
I want more data than what's available
via the standard interface. What can I do?
Can my class have additional access to a corpus on a
Why the new push for contributions?
I don't want to see
the messages that appear every 10 searches or so, as I use the corpora.
How do I cite the corpora in my
Things aren't working (I'm getting error
messages). Can you help?
1. Who created these corpora?
The underlying corpus
architecture and web interface were created by
Mark Davies, Professor of Linguistics at
Brigham Young University
in Provo, Utah, USA. In most cases, he also designed, collected, edited,
and annotated the corpora as well. In the case of the BYU-BNC, Strathy,
and Hansard corpora, I received the texts from others, and "just"
created the architecture and interface. So although I use the
terms "we" and "us" on this and other pages, most activities related to the
development of most of these corpora were actually carried out by just one person.
2. Who else
Brigham Young University (especially
the College of Humanities and
the Department of Linguistics and
English Language) has provided generous support to buy hardware and
Microsoft generously provided
the 64-bit Enterprise version of
SQL Server that is the
backbone for the architecture.
The Corpus del
EspaŮol, the Corpus
do PortuguÍs, and the new
Corpus of Historical American English were funded by large grants from the
National Endowment for the Humanities.
Paul Rayson provided
the CLAWS tagger, which
was used for all of the English corpora.
Some BYU students helped to scan a
few of the novels.
Several BYU students helped to scan
novels, magazines, and non-fiction books, and to help process
and correct the files and lexicon.
helped install the VM to run
removed the web page boilerplate material
Based on the
datasets from Google Books
Corpus do PortuguÍs
This was a joint project with
of Georgetown University,
helped select, acquire, edit, and annotate the older texts
(1300s-1700s), and who provided the translations of the web
interface, among other activities.
original texts were
licensed for re-use from
Oxford University Press.
textual corpus was designed and created at the
Strathy Language Unit at
Queen's University in Canada.
||The vast majority of the work on the corpus
(including semantic tagging) was done by
participants in the
I simply created the corpus architecture and interface.
3. What is the advantage of these
corpora over other ones that are available?
For some languages and time periods,
these are really the only corpora available. For example, in spite of
earlier corpora like the
National Corpus and the
Bank of English, our Corpus of
Contemporary American English is the only large, balanced corpus of
American English. In spite of the
Brown family of corpora and the
ARCHER corpus, the Corpus of
Historical American English is the only large and balanced corpus of
historical American English. And the
Corpus del EspaŮol and
the Corpus do PortuguÍs
are the only large, carefuly annotated corpora of these two languages. Beyond the
"textual" corpora, however, the corpus architecture and interface that
we have developed allows for speed, size, annotation, and a range of
queries that we believe is unmatched with other architectures, and which
makes it useful for corpora such as the
British National Corpus, which
does have other interfaces. Also, they're free -- a nice feature.
4. What software is used to index,
search, and retrieve data from these corpora?
We have created our own corpus
architecture, using Microsoft
SQL Server as the
backbone of the relational database approach.
architecture allows for size, speed,
and very good scalability that
we don't believe are available with any other architecture. Even
complex queries of the more than 520 million word COCA corpus or the 400 million word COHA corpus typically only
take two or three seconds. In addition, because of the relational database
design, we can keep adding on more annotation "modules" with little or
no performance hit. Finally, the relational database design allows for a
range of queries that we
believe is unmatched by any other architecture for large corpora.
5. How many people use the corpora?
As measured by
Google Analytics, as of
October 2014 the corpora are used by more than 130,000 unique people
each month. The most widely-used corpus is the
Corpus of Contemporary American
English -- with more than 65,000
unique users each month. And people
don't just come in, look for one word, and move on -- average time at
the site each visit is between 10-15 minutes. (More
6. What do they use the corpora for?
For lots of things. Linguists use the
corpora to analyze variation and change in the different languages. Some
are materials developers, who use the data to create teaching materials.
A high number of users are language teachers and learners, who use the
corpus data to model native speaker performance and intuition.
Translators use the corpora to get precise data on the target languages.
Other people in the humanities and social sciences look at changes in
culture and society (especially with
Hansard). Some businesses purchase data from the
corpora to use in natural language processing projects. And lots of
people are just curious about language, and (believe it or not) just use
the corpora for fun, to see what's going on with the languages
currently. To get a better idea of what people are doing with the
corpora, check out (or search through) the entries from the
7. What about copyright?
contain hundreds of millions of words of copyrighted material. The only
way that their use is legal (under
US Fair Use Law) is because of the limited "Keyword in Context" (KWIC)
displays. It's kind of like the "snippet defense" used by Google. They
retrieve and index billions of words of copyright material, but they
only allow end users to access "snippets" of this data from their
servers. Click here for an
extended discussion of US Fair Use Law and how it applies to our
8. Can I get access to the full text
of these corpora?
Full-text data for COCA and
GloWbE is now available (COCA
= 440 million words, 190,000 texts /
GloWbE = 1.8
billion words, 1.8 million texts). There is currently no full-text
access for the other corpora, although we will probably release
full-text data from COHA in
there API access to the corpora?
No, there isn't. There are two main
reasons for this. First, we don't have copyright access
to the texts in the corpora, and so we can only provide limited access to the
corpora, via the corpus interface. Second, we're already pretty "maxed out" in
terms of the two corpus servers, and API access would probably lead to quite a
bit more queries, which we can't handle right now. Although we don't allow API
access, some people have programmed browsers (via
Perl for Firefox) to allow for semi-automated queries (note, though, that we
don't provide tech support for this).
My access limits (for "non-researcher") are too low. Can I increase
(Level 1) have 50 queries
a day, or about 3,000 queries per month. For most people, this is way more than
enough. But if you are in fact a graduate student in languages or linguistics, but
there isn't a web page with your name on it, and you really do need more than
1,500 queries per month, then click here. If that's
not possible, you might want to contribute to help
support the corpora, in which case you will have 200 queries a day.
11. I want more data than what's
available via the standard interface. What can I do?
Users can purchase offline data -- such
as full text copies of
the texts, frequency lists,
n-grams lists (e.g. all two or three word strings of
Click here for much more detailed
information on this data, as well as downloadable samples.
12. Can my
class have additional access to a corpus on a given day?
There is a limit of 250 queries per
24 hours for a "group", where a group is typically a class of students or a
department at a university. If you need more queries than this, you'd want an
academic / site license..
13. Why the
new push for contributions?
There are a
number of reasons for our move to a
contributions-based model in early 2015. One
important factor is that Mark Davies, the creator and
administrator for the corpora, will probably be retiring in 2018 or 2019, and
there needs to be some viable model for financial sustainability of the corpora beyond
that date. It's probably not realistic to expect the
College of Humanities at
BYU (which has been extremely supportive to this point) to keep spending
$15,000-20,000 for a new server every year or two after 2018-19. In
addition, there will need to be someone working 10-15 hours/week each week as an
administrator for the corpora (for a total of $10,000-15,000/year). Hopefully, with
a few years of contributions stored up by 2018-19, and with contributions coming in
after that date as well, this will provide the needed financial viability of the
corpora (~$15,000-20,000 year). The other option, of course, is to go to a
subscription-based model like some other corpora, but this is something that
we really don't want to have to do.
14. I don't
want to see the messages that appear every 10 searches or so, as I use the
Part of the rationale for these
messages is to let you know about useful resources that are related to the
corpora (such as the word frequency,
full-text data, or
AcademicWords, etc). The other
purpose is to help motivate people to contribute, in order to ensure the
financial viability of the corpora.
Once you have made a minimal
contribution to the corpora (or purchased data from one of the sites just
listed) you won't see these messages anymore (during the month or
year of your contribution or purchase).
If you don't want to contribute to the BYU corpora and are
really bothered by the messages, you might want to consider other web-based
corpora -- like those from Lancaster
University (including BNCweb),
CorpusEye, or the many
excellent corpora from Sketch Engine.
(Please be aware, though, that the subscription fee for the Sketch Engine
corpora is somewhat
more expensive than the suggested contribution for the BYU corpora.)
15. How do I cite the corpora in my
Please use the following information when
you cite the corpus in academic publications or conference papers. Thanks.
Davies, Mark. (2008-) The Corpus of Contemporary American English: 520 million words, 1990-present. Available online at http://corpus.byu.edu/coca/.
Davies, Mark. (2010-) The Corpus of Historical American English: 400 million
words, 1810-2009. Available online at http://corpus.byu.edu/coha/.
Davies, Mark. (2007-) TIME Magazine Corpus: 100 million words,
1920s-2000s. Available online at http://corpus.byu.edu/time/.
Davies, Mark. (2004-) BYU-BNC.
(Based on the British National Corpus from Oxford University
Press). Available online at http://corpus.byu.edu/bnc/.
Davies, Mark. (2013) Corpus of Global
Web-Based English: 1.9 billion words from speakers in 20 countries. Available online at http://corpus.byu.edu/glowbe/.
Davies, Mark. (2015) The Wikipedia Corpus: 4.6 million articles, 1.9 billion words.
Adapted from Wikipedia. Available online at http://corpus.byu.edu/wiki/.
|Corpus del EspaŮol
Davies, Mark. (2002-) Corpus del EspaŮol: 100 million words,
1200s-1900s. Available online at http://www.corpusdelespanol.org.
|Corpus do PortuguÍs
Davies, Mark and Michael Ferreira. (2006-)
Corpus do PortuguÍs: 45
million words, 1300s-1900s. Available online at
Davies, Mark. (2011-) Google Books
(Based on Google Books n-grams). Available online at
Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres,
Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg,
Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak,
and Erez Lieberman Aiden*. Quantitative Analysis of Culture Using
Millions of Digitized Books. Science 331 (2011) [Published online ahead
of print 12/16/2010].
In the first reference to the corpus in your paper, please use the
full name. For example, for COCA: "the Corpus of Contemporary American English"
with the appropriate citation to the references section of the paper, e.g.
(Davies 2008-). After that
reference, feel free to use something shorter, like "COCA" (for example: "...and
as seen in COCA, there are..."). Also, please
do not refer to the
corpus in the body of your paper as "Davies' COCA corpus", "a
corpus created by Mark Davies", etc. The bibliographic entry
itself is enough
to indicate who created the corpus. Otherwise, it just kind of sounds
strange, and overly proprietary.