|
Corpus name |
Number
of
words |
Language / dialect
Time period |
Content |
Searches / architecture /
interface |
|
English |
These five corpora now have exactly the same
architecture and interface. Users can:
* Search by word, phrase, substring, part of
speech (e.g. nouns or verbs) and lemma (e.g. all
forms of go: goes, went, etc)
* Find the collocates (nearby words) of a given
word or phrase, which provides insight into the
meaning of the word
* Compare the collocates of two words, to see
differences in meaning or usage (e.g. collocates of
rob vs. steal, or warm vs hot, or
men vs. women, or Democrats vs. Republicans)
* Compare the collocates across time periods
(provides insight into changes in meaning, such as
new uses with green)
* Compare the collocates across genres (to show
differences in 'word sense', e.g. chair =
'committee leader' (academic) vs. 'piece of
furniture' (fiction))
* Order results by Mutual Information Score (shows
'relevance', in addition to raw frequency)
* With integrated thesauruses, find the frequency
and distribution of synonyms of a given word (to see
which synonyms are most frequent, in which genres
they are used most, which are increasing or
decreasing in use, etc)
* Create personalized lists of words and phrases (e.g.
for a particular semantic field) and then re-use them as
part of subsequent queries |
|
The Corpus of Contemporary American English
(COCA) |
385 million |
American English
1990-present |
20 million words each
year, 1990-present. Equally divided into spoken,
fiction, popular magazine, newspaper, and academic. Will
be updated at least two times a year. |
|
BYU-BNC: The British National
Corpus
|
100 million |
British English
~1980s-1993 |
90 million words written
(fiction, newspaper, academic, etc); 10 million spoken.
[Website for the
original BNC] |
|
TIME Magazine
|
100 million |
American English
1923-present |
More than 275,000 articles
from
TIME
Magazine. Wide range of topics: news, sports,
business, culture, health, entertainment, etc. |
|
Other languages |
|
|
|
|
Corpus del
Espaņol |
100 million |
Spanish
1200s-1900s |
20 million words 1900s,
20m 1800s, 40m 1500s-1700s, 20m 1200s-1400s |
|
Corpus do
Portuguęs |
45 million |
Portuguese
1300s-1900s |
20 million words 1900s,
including spoken, fiction, newspaper, and academic.
Equally divided Brazil/Portugal. 10m 1800s, 15m
1300s-1700s |
|
Corpus del Espaņol: Registers |
20 million |
Spanish
1900s |
Enhanced version of the
1900s component of the Corpus del Espaņol. Equally
divided between spoken, fiction, newspaper, and
academic. |
Compare frequency of 110+
grammatical constructions in twenty different registers. |
|
BYU-only (limited to on-campus
use by BYU students and faculty) |
Oxford English Dictionary (OED) [SEARCH]
|
37
million |
Old
English-1900s |
2.2
million quotations in the Oxford English Dictionary. |
Find the frequency of
word, phases, substrings, and constructions in each
century since Old English. Can limit hits by frequency
limits in any century. |
|
EEBO / LION
[SEARCH] |
700
million |
1500s-1900s |
Early English Books Online (1500s-1600s; 350m words) and
Literature Online (mainly 1700s-1800s; 350m words) |
Basic interface to these corpora. Find the frequency by
decade and century for words, phrases, and substrings. |
|
LDS General Conferences |
23
million |
1851-present |
Every General Conference talk from 1851 to the current
time |
Basic interface to these corpora. Find the frequency by
decade for words, phrases, and substrings. |