Historical English corpora and diachronic syntax
Mark Davies Updated 23 Jan 2009 Until recently, there were no large, annotated corpora of English that could be used to study diachronic syntax (i.e. how the grammar has changed over time). Either the corpora were extremely small (0.5-2.0 million words), or were not annotated, or were both small and unannotated. This compares poorly with other languages such as Spanish and Portuguese, where large, annotated corpora (45-100 million words) are available, and where they have been used to study diachronic syntax in some detail (see a partial listing of some of these studies, especially for the period 1994-2003). In the past three years, however, three large, annotated corpora of historical English have come online. For more details, including examples of the types of searches that can be done with each corpus, click here.
The question with any new corpus, of course, is whether it models well what is actually occurring in the language. Some might object to the materials that were used to create these three corpora. The proof of their value, however, is seeing whether they do provide data that agree with what people have already extracted from other sources. We believe that in fact they do meet this requirement very nicely. In addition, they contain much, much more data than other historical corpora, which means that we can use them to research more phenomena, and with much more precision. Finally, because they are annotated and because the corpus architecture is very advanced, it is possible to extract the data much more quickly and easily than with other corpora. In the sections that follow, we will give just a few examples of the type of data on historical grammar which can be extracted from each of these three corpora. We invite you to use the corpora for other topics that may be of interest to you. A couple of notes, first, in terms of searching the corpora. Just copy the [SEARCH STRING] shown in the tables below into the WORD(S) field of the corpus interface, and choose [CHARTS]. And then just click on any of the bars in the chart to see the matching strings, and click on any of those strings to see KWIC (Keyword in Context) displays. Also:
BYU-OED (37 million words, Old English - Present-Day English) This corpus is based on 37 million words in 2.2 million quotations in the Oxford English Dictionary. Some might wonder whether these quotations -- which form the entries for specifically-chosen words -- can form an accurate corpus. Let's see. (Before starting, remember that -- due to the extreme variation in spelling in historical texts -- the corpus is only tagged and lemmatized at a relatively basic level. As mentioned above, however, it is possible for individual users to modify the lemmatization and tagging to suit their own purposes, via "customized lists".) 1. Do support (measured here by negation with one verb (know): we know not / we do not (or don't) know): 1,676 tokens Notice how nicely the data from these 1676 tokens show the steady rise from the 1500s to the 1900s. If the 2.2 million quotations in the corpus didn't really form a good corpus, would we see such a clear, sustained shift?
Notice the beautiful "s-shaped curve" (a few examples before the 1600s, and then the sharp increase from the 1600s-1800s, and then "wrap-up" since then). In addition, the parts of the S-curve (slow initial increase, sharp increase 1600-1800s, especially the 1700s, etc) agrees precisely with what others have found, using other corpora. This again adds support for the BYU-OED Corpus as a valuable corpus of historical English (and with 10-20 times the data of other corpora).
Even a small corpus would probably have enough tokens for high-frequency phenomena like #1-2 to provide useful data. For example, a small 1-2 million word corpus (about 1/20th the size of the BYU-OED Corpus), would have about 80 tokens for "do-support" (compared to our 1600+ tokens), and this might be enough. But there are many lower-frequency phenomena that cannot be studied well with a corpus of that size. For example, in the BYU-OED corpus there are 344 tokens for like + [ to VERB / V-ING], and in a corpus 1/20th this size there would only be about 17 tokens -- far too small to really understand what's going on in terms of change.
There is a little "blip" in the 1600s (due to just one token) and then the construction comes on strong in the 1800s. It is interesting that this one token from the 1600s is really a noun (1649: Of all pastimes and exercises I like sailing worst; Fam. Ep. Wks. 146). Notice, however, that the surface-level ambiguity with gerunds is what will lead to ambiguity with (and therefore an increase in) constructions like I like talking to her. Remember, though, that with a corpus 1/20th this size, it might be too small to see low-frequency occurrences like this. Changes would instead appear to "come out of nowhere", rather than being able to see them at the "beginning" of their "construction-life".
TIME Corpus (100 million words, 1920s-2000s) This corpus is based on 100 million words in 275,000+ articles from TIME Magazine. Some might wonder whether data from just one source in just one genre can form an accurate corpus. Let's see. 1. end up V-ing (we ended up paying too much): 637 tokens The chart below is about as good as you'll get for the emergence
of a new construction. There is a clear and sustained increase in
this construction since the 1920s, and it looks like we're now in
the "fast change" part of the S-curve. You couldn't make up data
that is better than this, and it all comes from one source in one
genre.
Everyone knows that will is slowly crowding out shall. The chart below shows a very nice curve with the increase in will, leaving just a bit of space for shall by the 1990s/2000s (well, some people still use it some of the time). But notice -- no backtracking -- just a clear increase (decade by decade) since the 1920s.
The move towards bare infinitives with help (help her clean, instead of help her to clean) is supposedly one of the features that distinguishes American English from British English. The TIME data show an overall increase in the bare infinitive since the 1920s. The darker line below shows that although it has meandered just a bit over time, the overall shift is towards the bare infinitive.
There has been a gradual shift in English away from the [to V] construction towards [V-ing] (she started to walk away > she started walking away). The data below show this is definitely the case.
Another shift in English has been the slow increase in
the "get passive" (it got ran over) at the expense of
the "be passive" (it was run over). This is
definitely the case in the TIME Corpus, especially since the
1980s. Note that because of the high frequency of the "be
passive", it was split into two different searches (B and C
below).
Corpus of Contemporary American English [COCA] (385+million words, 1990-present) This corpus is based on more than 385 million words, evenly divided by year (20 million words each year since 1990) and genre (spoken, fiction, popular magazine, newspaper, and academic; 20% in each genre each year). It doesn't have exactly the same composition as other corpora like the British National Corpus, and so some might question just how reliable it is. Let's see. 1. end up V-ing (we ended up paying too much): 7,546 tokens The construction just continues it increase -- nearly every year is higher than the year before. By the way, the charts below show the frequency in each five year block (1990-1994, etc), and at the corpus website it is also possible to see the frequency year by year.
2. will / shall (I will / shall consider the following...): 592,165 tokens Here also the shift towards will continues on. Probably not too much should be made of the larger increase from 1995-99. Shall is used at such a small rate now that even a little difference from one time block to the next may look large, at least in relative terms.
There is a consistent, sustained increase in the bare infinitive (they helped her clean the room) in each five year time period during the last two decades.
4. start/begin + [to V / V-ing] (he began/started to sing/singing): 190,737 tokens An increase in [V-ing] in each five-year block, with both start and begin:
5. be/get V-ed (it was / got ran over): 1,817,092 tokens Once again, a shift in the same direction in each five year block, here with an increase in the "get passive" at the expense of the "be passive". (Notice that here again, we've divided the query for the "be passive" into two parts, because of the huge number of tokens in the corpus).
Summary Some might quibble about the composition of the BYU-OED, TIME, or COCA corpora. But since they agree so nicely with what is found from other corpora, this shows that the data are in fact reliable. And as we have shown, each one of these is much larger than other corpora for these time periods (and for the TIME Corpus from the 1900s, there really isn't anything else available, at least in terms of a "structured" corpus). As good as these corpora are, they are only the beginning in terms of corpora of historical English. We have begun work on a 300 million word corpus of American English (1810-2010), which will be balanced (in each decade, and therefore overall as well) between fiction, popular magazines, newspapers, and non-fiction books. This corpus will allow researchers to examine the history of English with much more precision than is possible with any other corpus. In addition, we have in raw text format tens of millions of words of British fiction and magazines from the 1930s-1970s (in addition to all of the Project Gutenberg materials up through the 1920s), as well as hundreds of millions of words of English from the 1500s-1700s, which could also be used for other historical corpora. If you are interested in working on one of these corpora (especially as part of a government-funded project), feel free to contact me. |