|
Google Books
(and its newest
interface via
Culturomics)
is much larger than 400 million words. So why not just use Google to look at language
change?
The main problem is that with Google Books (and by
extension
Culturomics) is that you're basically limited to looking at
just the frequency of exact words and phrases. Of course COHA does
simple searches like this
as well. But with Google (unlike COHA) it is difficult or impossible to research the following:
-
How words are
used. Sure, with Google
you can see the
frequency of the word women over time
(yet see Note 1 below).
But what does that tell you, really? What would be interesting is to
see what we're saying about women. With COHA, you can
search 400 million words to compare the
words used with woman in the 1930s-1950s
(left) vs the 1960s-1980s (right) in
2-3 seconds. How could you do that with Google -- ever??
-
Context, please.
Take the example of
women in Google above. What if you want to actually
see the word or phrase in context? With Google, you would have
to download each of the thousands or millions of books or articles -- one by one --
to see the context. (And even more
problematic, many of these can't be downloaded -- they're just
graphic images in Google Books or PDF files in Google News
Archive, which cost $2.95 each). For hundreds of thousands of
occurrences of a word, that would take months or years. With COHA, you can easily get
the frequency listing in just 3-4 seconds, and then see all of the
occurrences in context (in this frame, by clicking on the word or
phrase in the frequency listing) in just one or two seconds more.
-
Semantics (word
meaning). We all know that the word gay has changed meaning
in the last 40-50 years. But how would we know that with simple
Google frequency charts? We wouldn't. But with COHA, we can find the
collocates (nearby words) for any word, which provides valuable
evidence of changes in meaning and usage. For example, with COHA we
can easily find the
collocates of gay decade by decade, and we can see how
they have changed since the 1960s. As above, we can also directly
compare the collocates in different sets of decades (e.g.
gay in
1830s-1910s vs 1970-2000s) to see how its meaning has changed.
Again, no such luck with simple search interfaces like Google. You
would have to manually download thousands or millions of hits -- one
by one -- and then use another program to look for and categorize
the collocates (nearby words). Not fun. Actually, not really doable
at all.
-
Syntax
(grammar). With Google, you can't search by lemma or part of speech,
but with COHA you can. For example, in the case of end up V-ing
in the Google Books example above, we would have to search
individually -- one by one -- for all forms of end + up
+ any V-ing form of the verb (ended up paying, ends up
being, end up knowing). There would be thousands of individual
forms, which would take weeks or months to search for. With COHA,
we can do this in
less than two seconds.
-
Morphology (word
formation). With Google, you can't use wildcards to search for parts
of words, but with COHA you can. For example, in COHA you can search
for the root
-heart-
(compare earlier and later)
or the suffix -ism
(earlier/later).
With Google, you can only search for entire words -- not parts of
words.
-
More semantics
(synonyms). With COHA, you can search for the frequency of synonyms
of a word like
lovely by decade. Or you could use synonyms and customized
lists (e.g. body) to search for a semantically-oriented search like
"briefly touch someone"
(stroking her hair, rubbed his chin, patted her shoulder,
etc). With Google, you're just looking at exact (strings of) words
-- nothing semantically-oriented at all.
-
Lexis (words).
With COHA, the corpus "knows" the frequency of words and phrases by
time period, so you can look for example for
adjectives that are
much more frequent now than 100 years ago. There's no way to do
anything even remotely like this with Google Books, Google News
Archive, or text archives.
So if you want to do a
simple search -- like finding the frequency of an exact word or phrase
over time -- then Google is fine. And for some people this is about the extent of their historical
linguistics research.
But
using Google or text archives as a full-blown linguistic search engine to look at
language change has real
drawbacks. None of the preceding types of searches (morphological,
syntactic, semantic, and lexical) are possible with Google (or any other
historical corpus of English that we're aware of). But they are all possible -- quickly and easily -- with the Corpus of
Historical American English.
Note 1: Frequency
problems? Is the Google data really reliable? Take a look again
at
the frequency for women. Even with societal shifts in the
1960s-1970s, does it really make sense that we'd be using the word
women four times as much now as we did fifty years ago? In COHA,
there is an increase
in the last fifty years, but it's much more reasonable and
believable. Is the Google frequency simply a function of more books in
the last fifty years, and hence more of any word??
Note 2:
Accuracy.
One other problem that we're aware of with Google Books, and it's
presumably a problem with the
the new
interface as well, since it's just based on Google Books data. Google Books often thinks that a book is from a given decade when
in fact it's not -- it's just talking about that decade. For
example, take the end up V-ing construction (end up watching,
ended up paying, etc), which has increased over time. Since
COHA shows that it
starts in about the 1920s/1930s, and so we might search in Google Books
for
end up paying
in an earlier period -- 1800-1920. There are six
books listed, but none are actually from the period 1810-1919 --
they are just talking about that period.
|