The Corpus of Contemporary American English (COCA) is probably the best corpus of English (online or anywhere else) for looking at a wide range of ongoing changes in the language (see specific examples below). In order to look at ongoing changes, a corpus would ideally have the following characteristics:

  1. Large (probably 100 million words or more)

  2. Recent texts (ideally, it would be updated to within a year of the present time)

  3. Balance between several genres (e.g. not just newspapers)

  4. Roughly the same genre balance from year to year

  5. An architecture that shows frequency over time and which allows one to compare frequencies between different periods

As we have discussed in a journal article in the journal Literary and Linguistic Computing, other corpora have some of these features, but none of them has all five. For example:

  • The British National Corpus is fairly large and has many genres, but for more genres in the corpus, it is now almost a generation out of date.

  • The Brown family of corpora (Brown, Frown, LOB, FLOB) are neither large nor recent (the last set of texts that are at least the same size of those from the 1960s are from 1991)

  • The American National Corpus has none of the five characteristics listed above

  • The Bank of English (now WordBanks Online) is large and fairly recent (up through 2005 / 2006), but its genre balance varies a great deal from year to year. As a result, there is no way to know if the changes they show are indicative of actual changes in the "real world", or whether they just reflect changes in the corpus itself. To give a simple example, a higher frequency of "fiction" words (pale, smile, sparkle, etc) in the early 2000s than in the late 1990s might simply reflect an increase in the total number of words in fiction texts during that time, but this would give no evidence that these words or phrases had actually increased in real world usage. (In addition, another serious problem with these two corpora is that neither is freely-available to the public.)

  • The Web (via Google) and text archives are not genre-balanced, and (most importantly) there is no way to measure change over time. In order to do so, one would have to know the frequency of an item in a given year and then know the overall size of all texts in that year (to get normalized frequency statistics). There are also real problems in terms of searches involving phrases, and not just individual words.

  • [Note that the corpora mentioned above might be great for other things, just not as a monitor corpus to look at ongoing changes.]

The Corpus of Contemporary American English was designed from the ground up as a "monitor corpus" (a corpus that allows us to look at changes over time), and it is the only corpus (online or elsewhere) that has all of the five characteristics listed above.


Let's briefly look at a few examples of data from the corpus relating to ongoing changes in English. Just click on any of the following links to run the queries. Note that in the comparisons below, the 2000s are on the left and the 1990s are on the right. Also, note that not every entry in the tables is meaningful, but they are a good starting-point.

  • Lexical change (words and phrases): What is the frequency of old-school, gift (as a verb), freak out, (think) outside the box, on the hook for, throw someone under the bus, or [be] likely a|the over time? (Note that you can click on [See all sections] in the Chart Display to see the frequency by individual years as well). What -ed verbs or -tion nouns or -ing adjectives or phrasal verbs with up are used a lot more in 2010-19 than in 1990-99? (Wait 5-6 seconds for the noun and adjective queries to run.)

  • Morphological change (word formation): Are words with the suffix -gate (indicating "scandal") and the suffix -friendly more frequent in the 1990s or the 2000s? What is the frequency of words ending in -ism (e.g. communism, terrorism) in each time period since the early 1990s, and which -ism words are more common in the 2000s than the 1990s (and vice versa)?

  • Syntactic change (grammar): Are the following increasing or decreasing (and when): end up V-ing, get passive (got hired),  so not ADJ (I'm so not interested in her), and "quotative like" (he's like, I'm not going).

  • Semantic change (word meaning): Changes over time with collocates (nearby words) can often indicate changes in meaning or the usage of a given word. See if this is true for the following words: green, web, engine.

  • Discourse analysis ("what are we saying about X?") Compare the collocates for the given words in the 1990s and the 2000s: crisis, terror, gay. Or look at the collocates of nuclear and crisis in each time period since the early 1990s. How does this data give insight into changes in American culture and society during this time?