The Corpus of Contemporary American English (COCA) is probably the best corpus of English (online or anywhere else) for looking at a wide range of ongoing changes in the language (see specific examples below). In order to look at ongoing changes, a corpus would ideally have the following characteristics:

  1. Large (probably 100 million words or more)

  2. Recent texts (ideally, it would be updated to within a year of the present time)

  3. Balance between several genres (e.g. not just newspapers)

  4. Roughly the same genre balance from year to year

  5. An architecture that shows frequency over time and which allows one to compare frequencies between different periods

As we have discussed in a recent journal article in the journal Literary and Linguistic Computing, other corpora have some of these features, but none of them has all five. For example:

  • The British National Corpus is fairly large and has many genres, but it is now almost a generation out of date, and will likely never be updated.

  • The Brown family of corpora (Brown, Frown, LOB, FLOB) are neither large nor recent (the last texts are from 1991)

  • The American National Corpus has none of the five characteristics listed above

  • The Bank of English and the Oxford English Corpus are both large and fairly recent (up through 2005 / 2006), but their genre balance varies a great deal from year to year. As a result, there is no way to know if the changes they show are indicative of actual changes in the "real world", or whether they just reflect changes in the corpus itself. To give a simple example, a higher frequency of "fiction" words (pale, smile, sparkle, etc) in the early 2000s than in the late 1990s might simply reflect an increase in the total number of words in fiction texts during that time, but this would give no evidence that these words or phrases had actually increased in real world usage. (In addition, another serious problem with these two corpora is that neither is freely-available to the public.)

  • The Web (via Google) and text archives are not genre-balanced, and (most importantly) there is no way to measure change over time. In order to do so, one would have to know the frequency of an item in a given year and then know the overall size of all texts in that year (to get normalized frequency statistics). There are also real problems in terms of searches involving phrases, and not just individual words.

  • [Note that the corpora mentioned above might be great for other things, just not as a monitor corpus to look at ongoing changes.]

The Corpus of American English was designed from the ground up as a "monitor corpus" (a corpus that allows us to look at changes over time), and it is the only corpus (online or elsewhere) that has all of the five characteristics listed above.


Let's briefly look at a few examples of data from the corpus relating to ongoing changes in English. Just click on any of the following links to run the queries. Note that in the comparisons below, the 2000s are on the left and the 1990s are on the right. Also, note that not every entry in the tables is meaningful, but they are a good starting-point.

  • Lexical change (words and phrases): What is the frequency of jonesing, morph, old-school, gift (as a verb), freak out, perfect storm, (think) outside the box, on the hook for, throw someone under the bus, or [be] likely a|the over time? (Note that you can click on [See all sections] in the Chart Display to see the frequency by individual years as well). What verbs or nouns or adjectives or phrasal verbs with up are used a lot more in 2005-09 than in 1990-94? (Wait 5-6 seconds for the noun and adjective queries to run.)

  • Morphological change (word formation): Are words with the suffix -gate (indicating "scandal") and the suffix -friendly more frequent in the 1990s or the 2000s? What is the frequency of words ending in -ism (e.g. communism, terrorism) in each time period since the early 1990s, and which -ism words are more common in the 2000s than the 1990s (and vice versa)?

  • Syntactic change (grammar): Are the following increasing or decreasing (and when): end up V-ing, get passive (got hired),  "quotative like" (he's like, I'm not going), so not ADJ (I'm so not interested in her)

  • Semantic change (word meaning): Changes over time with collocates (nearby words) can often indicate changes in meaning or the usage of a given word. See if this is true for the following words: green, web, engine.

  • Discourse analysis ("what are we saying about X?") Compare the collocates for the given words in the 1990s and the 2000s: crisis, terror, gay, religion. Or look at the collocates of nuclear and crisis in each time period since the early 1990s. How does this data give insight into changes in American culture and society during this time?