Google Books (and its newest interface via Culturomics) is much larger than 400 million words. So why not just use Google to look at language change?

The main problem is that with Google Books (and by extension Culturomics) is that you're basically limited to looking at just the frequency of exact words and phrases. Of course COHA does simple searches like this as well. But with Google (unlike COHA) it is difficult or impossible to research the following:

  • How words are used. Sure, with Google you can see the frequency of the word women over time (yet see Note 1 below). But what does that tell you, really? What would be interesting is to see what we're saying about women. With COHA, you can search 400 million words to compare the words used with woman in the 1930s-1950s (left) vs the 1960s-1980s (right) in 2-3 seconds. How could you do that with Google -- ever??

  • Context, please. Take the example of women in Google above. What if you want to actually see the word or phrase in context? With Google, you would have to download each of the thousands or millions of books or articles -- one by one -- to see the context. (And even more problematic, many of these can't be downloaded -- they're just graphic images in Google Books or PDF files in Google News Archive, which cost $2.95 each). For hundreds of thousands of occurrences of a word, that would take months or years. With COHA, you can easily get the frequency listing in just 3-4 seconds, and then see all of the occurrences in context (in this frame, by clicking on the word or phrase in the frequency listing) in just one or two seconds more.

  • Semantics (word meaning). We all know that the word gay has changed meaning in the last 40-50 years. But how would we know that with simple Google frequency charts? We wouldn't. But with COHA, we can find the collocates (nearby words) for any word, which provides valuable evidence of changes in meaning and usage. For example, with COHA we can easily find the collocates of gay decade by decade, and we can see how they have changed since the 1960s. As above, we can also directly compare the collocates in different sets of decades (e.g. gay in 1830s-1910s vs 1970-2000s) to see how its meaning has changed. Again, no such luck with simple search interfaces like Google. You would have to manually download thousands or millions of hits -- one by one -- and then use another program to look for and categorize the collocates (nearby words). Not fun. Actually, not really doable at all.

  • Syntax (grammar). With Google, you can't search by lemma or part of speech, but with COHA you can. For example, in the case of end up V-ing in the Google Books example above, we would have to search individually -- one by one -- for all forms of end + up + any V-ing form of the verb (ended up paying, ends up being, end up knowing). There would be thousands of individual forms, which would take weeks or months to search for. With COHA, we can do this in less than two seconds.

  • Morphology (word formation). With Google, you can't use wildcards to search for parts of words, but with COHA you can. For example, in COHA you can search for the root -heart- (compare earlier and later) or the suffix -ism (earlier/later). With Google, you can only search for entire words -- not parts of words.

  • More semantics (synonyms). With COHA, you can search for the frequency of synonyms of a word like lovely by decade. Or you could use synonyms and customized lists (e.g. body) to search for a semantically-oriented search like "briefly touch someone" (stroking her hair, rubbed his chin, patted her shoulder, etc). With Google, you're just looking at exact (strings of) words -- nothing semantically-oriented at all.

  • Lexis (words). With COHA, the corpus "knows" the frequency of words and phrases by time period, so you can look for example for adjectives that are much more frequent now than 100 years ago. There's no way to do anything even remotely like this with Google Books, Google News Archive, or text archives.

So if you want to do a simple search -- like finding the frequency of an exact word or phrase over time -- then Google is fine. And for some people this is about the extent of their historical linguistics research.

But using Google or text archives as a full-blown linguistic search engine to look at language change has real drawbacks. None of the preceding types of searches (morphological, syntactic, semantic, and lexical) are possible with Google (or any other historical corpus of English that we're aware of). But they are all possible -- quickly and easily -- with the Corpus of Historical American English.


Note 1: Frequency problems? Is the Google data really reliable? Take a look again at the frequency for women. Even with societal shifts in the 1960s-1970s, does it really make sense that we'd be using the word women four times as much now as we did fifty years ago? In COHA, there is an increase in the last fifty years, but it's much more reasonable and believable. Is the Google frequency simply a function of more books in the last fifty years, and hence more of any word??

Note 2: Accuracy. One other problem that we're aware of with Google Books, and it's presumably a problem with the the new interface as well, since it's just based on Google Books data. Google Books often thinks that a book is from a given decade when in fact it's not -- it's just talking about that decade. For example, take the end up V-ing construction (end up watching, ended up paying, etc), which has increased over time. Since COHA shows that it starts in about the 1920s/1930s, and so we might search in Google Books for end up paying in an earlier period -- 1800-1920. There are six books listed, but none are actually from the period 1810-1919 -- they are just talking about that period.