The Corpus of Contemporary American English (COCA) is probably the best corpus of
English (online or anywhere else) for looking at a wide range of ongoing
changes in the language (see
specific examples below). In order to look at ongoing changes, a corpus
would ideally have the following characteristics:
-
Large (probably 100
million words or more)
-
Recent texts
(ideally, it would be updated to within a year of the present time)
-
Balance between several
genres (e.g. not just newspapers)
-
Roughly the same genre
balance from year to year
-
An architecture that
shows frequency over time and which allows one to compare
frequencies between different periods
As we have discussed in a journal article in the journal
Literary and Linguistic Computing, other corpora have some of
these features, but none of them has all five. For example:
-
The
British National Corpus is
fairly large
and has many genres, but for more genres in the corpus, it is now almost a generation out of date.
-
The
Brown family of corpora (Brown,
Frown, LOB, FLOB) are neither large nor recent (the last set of texts
that are at least the same size of those from the 1960s are
from 1991)
-
The
American National Corpus
has none of the five characteristics listed above
-
The
Bank of English
(now WordBanks Online) is large and fairly recent (up through 2005 / 2006), but
its genre
balance varies a great deal from year to year. As a result, there is no way to
know if the changes they show are indicative of actual changes in the
"real world", or whether they just reflect changes in the corpus
itself. To give a simple example, a higher frequency of "fiction"
words (pale, smile, sparkle, etc) in the early 2000s
than in the late 1990s might simply reflect an increase in the total
number of words in fiction texts during that time, but this would
give no evidence that these words or phrases had actually increased
in real world usage. (In addition, another serious problem with
these two corpora is that neither is freely-available to the
public.)
-
The
Web (via Google)
and text archives are not genre-balanced, and (most importantly)
there is no way to measure change over time. In order to do so, one
would have to know the frequency of an item in a given year and then
know the overall size of all texts in that year (to get normalized
frequency statistics). There are also
real problems
in terms of searches involving phrases, and not just individual
words.
-
[Note that the corpora mentioned above might be great for other
things, just not as a monitor corpus to look at ongoing changes.]
The Corpus of Contemporary American English
was designed from the ground up as a "monitor corpus" (a corpus that
allows us to look at changes over time), and it is the only corpus
(online or elsewhere) that has all of the five characteristics listed
above.
Let's briefly look at a few
examples of data from the corpus relating to ongoing changes in English.
Just click on any of the following links to run the queries. Note that in
the comparisons below, the 2000s are on the left and the 1990s are on
the right. Also, note that not every entry in the tables is meaningful,
but they are a good starting-point.
-
Lexical change
(words and phrases): What is the
frequency of
old-school,
gift (as a
verb), freak out,
(think) outside the box, on
the hook for, throw
someone under the bus, or
[be] likely a|the
over time? (Note that you can click on [See all sections] in the Chart
Display to see the frequency by individual years as well). What
-ed
verbs or
-tion nouns or
-ing adjectives or
phrasal verbs with up are used a lot
more in 2010-19 than in 1990-99? (Wait 5-6 seconds for the noun and
adjective queries to run.)
-
Morphological change (word
formation): Are words with the suffix
-gate
(indicating "scandal") and the suffix
-friendly more frequent
in the 1990s or the 2000s? What is the frequency of
words ending in -ism
(e.g. communism, terrorism) in each time period since the
early 1990s, and which -ism
words are more common in the 2000s than the 1990s (and vice
versa)?
-
Syntactic change
(grammar): Are the following increasing or decreasing (and when):
end up V-ing,
get passive
(got hired),
so not ADJ
(I'm so not interested in her), and "quotative
like" (he's like, I'm not going).
-
Semantic change (word meaning):
Changes over time with collocates (nearby words) can often indicate
changes in meaning or the usage of a given word. See if this is true
for the following words:
green,
web,
engine.
-
Discourse analysis
("what are we saying about X?") Compare the collocates for the given
words in the 1990s and the 2000s:
crisis,
terror,
gay. Or
look at the collocates of
nuclear and
crisis in each time period since the early 1990s. How does
this data give insight into changes in American culture and society
during this time?
|