Historical English corpora and diachronic syntax

Mark Davies
Brigham Young University

Updated 23 Jan 2009


Until recently, there were no large, annotated corpora of English that could be used to study diachronic syntax (i.e. how the grammar has changed over time). Either the corpora were extremely small (0.5-2.0 million words), or were not annotated, or were both small and unannotated. This compares poorly with other languages such as Spanish and Portuguese, where large, annotated corpora (45-100 million words) are available, and where they have been used to study diachronic syntax in some detail (see a partial listing of some of these studies, especially for the period 1994-2003).

In the past three years, however, three large, annotated corpora of historical English have come online. For more details, including examples of the types of searches that can be done with each corpus, click here.

Name Size Time period
BYU-OED 37 million words Old English to present
TIME Corpus of American English 100 million words 1920s-2000s
Corpus of Contemporary American English (COCA) 385+ million words 1990-2009

The question with any new corpus, of course, is whether it models well what is actually occurring in the language. Some might object to the materials that were used to create these three corpora. The proof of their value, however, is seeing whether they do provide data that agree with what people have already extracted from other sources. We believe that in fact they do meet this requirement very nicely. In addition, they contain much, much more data than other historical corpora, which means that we can use them to research more phenomena, and with much more precision. Finally, because they are annotated and because the corpus architecture is very advanced, it is possible to extract the data much more quickly and easily than with other corpora.

In the sections that follow, we will give just a few examples of the type of data on historical grammar which can be extracted from each of these three corpora. We invite you to use the corpora for other topics that may be of interest to you.

A couple of notes, first, in terms of searching the corpora. Just copy the [SEARCH STRING] shown in the tables below into the WORD(S) field of the corpus interface, and choose [CHARTS]. And then just click on any of the bars in the chart to see the matching strings, and click on any of those strings to see KWIC (Keyword in Context) displays. Also:

  • Elements like [know] represents lemmas -- all forms of the word
  • Elements like [vvg] represent part of speech tags (in this case, gerunds: walking, reading, etc).
  • Elements like [davies:gone-come] represent "customized lists". Via the web interface, any user can create a list of words that will be used in that "slot" in the query. This is particularly useful for BYU-OED corpus, where lemmatization and part of speech tagging is not complete. For example, in the example shown above, this refers to a list that we have created with nine different spellings of gone and come: gon, coome, cumme, etc. See the help files at the corpus websites for more information (or click on the "question mark" to the right of [USER LISTS] ).
     

BYU-OED (37 million words, Old English - Present-Day English)

This corpus is based on 37 million words in 2.2 million quotations in the Oxford English Dictionary. Some might wonder whether these quotations -- which form the entries for specifically-chosen words -- can form an accurate corpus. Let's see.

(Before starting, remember that -- due to the extreme variation in spelling in historical texts -- the corpus is only tagged and lemmatized at a relatively basic level. As mentioned above, however, it is possible for individual users to modify the lemmatization and tagging to suit their own purposes, via "customized lists".)

1. Do support (measured here by negation with one verb (know): we know not / we do not (or don't) know): 1,676 tokens

Notice how nicely the data from these 1676 tokens show the steady rise from the 1500s to the 1900s. If the 2.2 million quotations in the corpus didn't really form a good corpus, would we see such a clear, sustained shift?

  Search string 1500s 1600s 1700s 1800s 1900s
A [*pps*] [know] [not] 89 295 164 155 15
B [*pps*] [do] [not] [know] 1 23 64 155 112
C [*pps*] don't|doesn't|didn't [know] 0 7 31 166 399
  % (B+C) (i.e. pre-verbal do) 0.01 0.09 0.37 0.67 0.97


2. Be/have + past participle
with intransitive verbs: they have / are come): 5,026 tokens

Notice the beautiful "s-shaped curve" (a few examples before the 1600s, and then the sharp increase from the 1600s-1800s, and then "wrap-up" since then). In addition, the parts of the S-curve (slow initial increase, sharp increase 1600-1800s, especially the 1700s, etc) agrees precisely with what others have found, using other corpora. This again adds support for the BYU-OED Corpus as a valuable corpus of historical English (and with 10-20 times the data of other corpora).

  Search string 1200s 1300s 1400s 1500s 1600s 1700s 1800s 1900s 2000s
A [be] [davies:gone-come] 4 48 99 324 493 276 375 141 2
B [have] [davies:gone-come] 0 10 26 95 155 165 1229 1586 30
  % B (i.e. has come) 0.00 0.17 0.21 0.23 0.24 0.37 0.77 0.92 0.94


3. [like] + to VERB / V-ing
(I like to watch / watching sunsets): 344 tokens

Even a small corpus would probably have enough tokens for high-frequency phenomena like #1-2 to provide useful data. For example, a small 1-2 million word corpus (about 1/20th the size of the BYU-OED Corpus), would have about 80 tokens for "do-support" (compared to our 1600+ tokens), and this might be enough. But there are many lower-frequency phenomena that cannot be studied well with a corpus of that size. For example, in the BYU-OED corpus there are 344 tokens for like + [ to VERB / V-ING], and in a corpus 1/20th this size there would only be about 17 tokens -- far too small to really understand what's going on in terms of change.

  SEARCH STRING 1300s 1400s 1500s 1600s 1700s 1800s 1900s 2000s
A [*pps*] [like] [*vvg*] 0 0 0 1 0 9 27 1
B [*pps*] [like] to [*v*] 1 2 2 7 9 98 189 3
  % A (i.e. V-ing) 0.0% 0.0% 0.0% 12.5% 0.0% 8.4% 12.5% 25.0%

There is a little "blip" in the 1600s (due to just one token) and then the construction comes on strong in the 1800s. It is interesting that this one token from the 1600s is really a noun (1649: Of all pastimes and exercises I like sailing worst; Fam. Ep. Wks. 146). Notice, however, that the surface-level ambiguity with gerunds is what will lead to ambiguity with (and therefore an increase in) constructions like I like talking to her.

Remember, though, that with a corpus 1/20th this size, it might be too small to see low-frequency occurrences like this. Changes would instead appear to "come out of nowhere", rather than being able to see them at the "beginning" of their "construction-life".

 


TIME Corpus (100 million words, 1920s-2000s)

This corpus is based on 100 million words in 275,000+ articles from TIME Magazine. Some might wonder whether data from just one source in just one genre can form an accurate corpus. Let's see.

1. end up V-ing (we ended up paying too much): 637 tokens

The chart below is about as good as you'll get for the emergence of a new construction. There is a clear and sustained increase in this construction since the 1920s, and it looks like we're now in the "fast change" part of the S-curve. You couldn't make up data that is better than this, and it all comes from one source in one genre.
 
SEARCH STRING 1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s  
[end] up [vvg*] 0 6 14 30 76 93 101 150 167 (tokens)
  0 0.5 0.9 1.8 4.7 6.8 8.9 15.4 26.0 (per million words)


2
. will / shall (I will / shall consider the following...): 184,450 tokens

Everyone knows that will is slowly crowding out shall. The chart below shows a very nice curve with the increase in will, leaving just a bit of space for shall by the 1990s/2000s (well, some people still use it some of the time). But notice -- no backtracking -- just a clear increase (decade by decade) since the 1920s.

  SEARCH STRING 1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s
A will [v*] 14190 17714 23754 27615 25789 23453 19075 16439 10719
B shall [v*] 1295 1177 1162 857 590 375 193 109 34
  % A (i.e. will) 91.6% 93.8% 95.3% 97.0% 97.8% 98.4% 99.0% 99.3% 99.7%


3. help / help to (they helped her (to) clean the room): 17,887 tokens

The move towards bare infinitives with help (help her clean, instead of help her to clean) is supposedly one of the features that distinguishes American English from British English. The TIME data show an overall increase in the bare infinitive since the 1920s. The darker line below shows that although it has meandered just a bit over time, the overall shift is towards the bare infinitive.

  SEARCH STRING 1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s
A [help].[v*] [v*] 209 825 1333 1707 1769 1665 1858 1662 1336
B [help].[v*] [p*] [v*] 68 204 289 339 260 279 247 372 346
C [help].[v*] to [v*] 137 259 391 489 566 492 284 116 86
D [help].[v*] [p*] to [v*] 15 33 47 54 53 54 24 11 8
  % (A+B) (i.e. bare infinitive) 64.6% 77.9% 78.7% 79.0% 76.6% 78.1% 87.2% 94.1% 94.7%


4. start/begin + [to V / V-ing] (he began/started to sing/singing): 47,028 tokens

There has been a gradual shift in English away from the [to V] construction towards [V-ing] (she started to walk away > she started walking away). The data below show this is definitely the case.

  SEARCH STRING 1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s
A [begin] [vvg*] 199 1194 1753 2412 2326 2344 1953 1326 799
B [start] [vvg*] 99 495 957 1311 1179 879 732 922 1027
C [begin] to [v*] 1193 2789 3575 3709 3233 2618 2101 1584 792
D [start] to [v*] 131 400 364 503 322 373 445 505 484
  %A (vs C) (i.e. V-ing w/ begin) 14.3% 30.0% 32.9% 39.4% 41.8% 47.2% 48.2% 45.6% 50.2%
  %B (vs D) (i.e. V-ing w/ start) 43.0% 55.3% 72.4% 72.3% 78.5% 70.2% 62.2% 64.6% 68.0%


5. be/get V-ed (it was / got ran over): 536,066 tokens

Another shift in English has been the slow increase in the "get passive" (it got ran over) at the expense of the "be passive" (it was run over). This is definitely the case in the TIME Corpus, especially since the 1980s. Note that because of the high frequency of the "be passive", it was split into two different searches (B and C below).
 
  SEARCH STRING 1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s
A [get] [vvn] 224 533 1113 1279 1099 836 728 1129 1115
B are|were|been|being [vvn] 21,048 29,364 35,136 31,933 33,325 33,323 27,375 18,656 10,895
C is|was [vvn] 31,365 40,925 38,097 40,102 39,345 34,886 29,291 20,720 12,224
  % A (i.e. get) 0.4% 0.8% 1.5% 1.7% 1.5% 1.2% 1.3% 2.8% 4.6%

 


Corpus of Contemporary American English [COCA] (385+million words, 1990-present)

This corpus is based on more than 385 million words, evenly divided by year (20 million words each year since 1990) and genre (spoken, fiction, popular magazine, newspaper, and academic; 20% in each genre each year). It doesn't have exactly the same composition as other corpora like the British National Corpus, and so some might question just how reliable it is. Let's see.

1. end up V-ing (we ended up paying too much): 7,546 tokens

The construction just continues it increase -- nearly every year is higher than the year before. By the way, the charts below show the frequency in each five year block (1990-1994, etc), and at the corpus website it is also possible to see the frequency year by year.

SEARCH STRING 1990-1994 1995-1999 2000-2004 2005-2009  
[end] up [vvg*] 1632 2011 2090 1813 (tokens)
  15.8 19.5 20.4 23.3 (per million words)

2. will / shall (I will / shall consider the following...): 592,165 tokens

Here also the shift towards will continues on. Probably not too much should be made of the larger increase from 1995-99. Shall is used at such a small rate now that even a little difference from one time block to the next may look large, at least in relative terms.

  SEARCH STRING 1990-1994 1995-1999 2000-2004 2005-2009
A will [v*] 166,445 154,623 155,676 105,992
B shall [v*] 3,072 2,732 2,213 1,412
  % A (i.e. will) 98.2% 98.3% 98.6% 98.7%


3. help / help to (they helped her (to) clean the room): 78,763 tokens

There is a consistent, sustained increase in the bare infinitive (they helped her clean the room) in each five year time period during the last two decades.

  SEARCH STRING 1990-1994 1995-1999 2000-2004 2005-2009
A [help].[v*] [v*] 9640 10073 11448 9023
B [help].[v*] [p*] [v*] 5355 6312 6956 5816
C [help].[v*] to [v*] 2998 2919 3074 2248
D [help].[v*] [p*] to [v*] 841 808 727 525
  % (A+B) (i.e. bare infinitive) 79.6% 81.5% 82.9% 84.3%


 

4. start/begin + [to V / V-ing] (he began/started to sing/singing): 190,737 tokens

An increase in [V-ing] in each five-year block, with both start and begin:

  SEARCH STRING 1990-1994 1995-1999 2000-2004 2005-2009
A [begin] [vvg*] 8873 8879 9385 6443
B [start] [vvg*] 11393 13291 14007 11199
C [begin] to [v*] 22820 20184 18534 12675
D [start] to [v*] 8069 8816 9023 7146
  %A (vs C) (i.e. V-ing w/ begin) 28.0% 30.6% 33.6% 33.7%
  %B (vs D) (i.e. V-ing w/ start) 58.5% 60.1% 60.8% 61.0%


 

5. be/get V-ed (it was / got ran over): 1,817,092 tokens

Once again, a shift in the same direction in each five year block, here with an increase in the "get passive" at the expense of the "be passive". (Notice that here again, we've divided the query for the "be passive" into two parts, because of the huge number of tokens in the corpus).

  SEARCH STRING 1990-1994 1995-1999 2000-2004 2005-2009
A [get] [vvn] 13,971 15,669 15,645 12,607
B are|were|been|being [vvn] 264,372 243,881 240,575 179,178
C is|was [vvn] 231,587 222,825 216,851 159,931
  % A (i.e. get) 2.7% 3.2% 3.3% 3.6%

 


Summary

Some might quibble about the composition of the BYU-OED, TIME, or COCA corpora. But since they agree so nicely with what is found from other corpora, this shows that the data are in fact reliable. And as we have shown, each one of these is much larger than other corpora for these time periods (and for the TIME Corpus from the 1900s, there really isn't anything else available, at least in terms of a "structured" corpus).

As good as these corpora are, they are only the beginning in terms of corpora of historical English. We have begun work on a 300 million word corpus of American English (1810-2010), which will be balanced (in each decade, and therefore overall as well) between fiction, popular magazines, newspapers, and non-fiction books. This corpus will allow researchers to examine the history of English with much more precision than is possible with any other corpus.

In addition, we have in raw text format tens of millions of words of British fiction and magazines from the 1930s-1970s (in addition to all of the Project Gutenberg materials up through the 1920s), as well as hundreds of millions of words of English from the 1500s-1700s, which could also be used for other historical corpora. If you are interested in working on one of these corpora (especially as part of a government-funded project), feel free to contact me.