The Corpus of Contemporary American English (COCA)
and Google / Web as Corpus

The Web is much larger than the Corpus of Contemporary American English (COCA), and Google is a great search engine. So why not just use Google to see what's happening in contemporary American English? Well, as good as it is for most searches, there are things that neither Google (nor any other search engine) can do (or which they do only very poorly), but which are possible with our corpus. These include the following:

  • Looking at differences between different styles or types of English. Is the "I'm like..." construction, the string . Well ,, or the word attitudinal used more in informal (e.g. spoken) or formal (e.g. academic) English? Google is pretty good at knowing what domain something comes from (e.g. cbs.com or neh.org), but it can't really relate that (well) to "genre", or "styles of speech".

  • Measuring changes over time. Is the word jonesing or the phrase outside the box used more or less now than in the early 1990s? Which verbs are really on the increase during the last 2-3 years? No way to check this with Google/glowbe or other search engines.

  • Grammar-based searches. Is end up VERB-ing (e.g. ended up paying too much) on the increase or decrease? Is the get passive (e.g. get married) used more in spoken or academic? Google doesn't allow you to search by part of speech or lemma (e.g. all of the forms of a word). You'd have to search for each string individually (e.g. all forms of end + up + every conceivable verb).

  • Wildcard-based searches. What are the most common word forms with the strings -friendly (e.g. kid-friendly), -backed (e.g. Soviet-backed) or hyper- (e.g. hyperspace)? Wildcards are no problem with COCA, but you can't use them with Google.

  • Semantically-based searches. How are fair, or strike, or sign used in the language? In order to find out, you need to look at collocates (nearby words), since (as corpus linguists are fond of saying) "the words that a word 'hangs out with' can tell you a lot about its meaning". But Google doesn't do collocates.

  • And more semantically-based searches. Since Google can't do collocates, it obviously can't use them to compare word meanings in different genres (e.g. chair in fiction and academic), or to see how they're changing over time (e.g. green = "environmentally friendly").

  • And even more complex semantically-based searches. Google only really knows how to search for specific words and strings. It doesn't let you search by words that are related in meaning, such as all of the synonyms of a given word, or all of the words in personalized lists you've created (related to fashion, or food, or clothing, or whatever) as part of a query. Our corpus can do both of these.

  • Finding the word when you don't know what the word is. What are the adjectives that are found mainly is medical articles, collocates of hard that are used more in fiction or newspapers, or synonyms of strong that are found mainly in fiction or academic? Google allows you to find the occurrence of a given form that you already know, but it can't produce a list of words for you that match criteria like these.

  • Searching for strings of words. Sure, on Google you can search for a phrase like "might be taken for a". Go ahead and try it. How many hits does it say there are? Our search today shows 955,000. Start paging through the hits, though, and they run out at about 450 (e.g. 44 pages of 10 links each, and then they end). In other words, Google's "guess" is more than 2000 times more than what it should be. This is because Google usually doesn't "know" the frequency of anything more than single words -- it's usually just guessing.

So if you want to find web pages dealing with a certain topic, then Google is fine. But using Google as a full-blown linguistic search engine has real drawbacks. None of the preceding types of searches -- which are some of the most interesting ones that you can carry out to see what's going on with the language -- are possible with Google (or any other search engine). But they are all possible -- quickly and easily -- with the Corpus of Contemporary American English.