Side-by-side comparisons of corpora
(American and British English)


Until recently, if you wanted to use the BYU corpus interface to compare frequencies in two of corpora (e.g. COCA and the BNC), you had to do two separate searches and then compare the data in another program, like Excel. Now, however, with just one click, you can compare the results of a search in two corpora side-by-side -- right from the corpus interface.

The examples that follow are from just two of the five English corpora -- COCA (450 million words of American English, 1990-present) and the BNC (100 million words of British English, 1980s-1993). And these are just a handful of ad-hoc examples -- certainly just the "tip of the iceberg". Feel free to send me links to other pages that you've created, which use the new side-by-side "compare corpora" functionality.


(Note: to come back to this page after each search, just click RETURN in the upper right-hand section of the corpus interface)

1. Obvious differences between words.

There are many sites that list pairs of words for the "same" concept -- one word that is used in American English, and the other in British English. We can see the frequency of these words side-by-side in the corpus interface, such as the following:

(American) pants, store (n), can (of nouns), railroad, freeway, truck, trash can, or vacuum (v)
(British) trousers, shop (n), tin (of nouns), railway, motorway, lorry, dustbin, and hoover (v).

There is nothing overly earth-shattering or surprising here, but for people who are skeptical about using any corpus besides the BNC, the fact that the corpora provide the expected results for such queries ought to be at least mildly reassuring.

2. Idioms

You can also see side-by-side frequencies for less obvious words and phrases, like the following idioms with head. I'd like to also have a list of idioms with head that are a lot more frequent in British English -- feel free to send them to me if you find them. And of course, there are tens of thousands of idioms that one could compare -- this is just a tiny sampling with just one word. (Note: ~ = his/her/their/our, etc)

About the same in COCA and BNC More in COCA (American)
head over heels in love
head-on
price on ~ head
head for the hills
head and shoulders above
talk over ~ head
talk ~ head off
two heads are better (than one)
use ~ head
make ~ head spin
put ~ heads together
bury ~ head (in the sand)
from head to toe
have a head for (something)
hanging over ~ head
off the top of ~ head
 
head (v) up
head (v) toward(s)
head (v) back to
head (v) out
in over ~ head
(hit the) nail on the head
head ~ off at the pass
cooler heads (+ prevail)
go head-to-head
head start
heads or tails
talking head
head game
head rush (n)
head trip
(like a) deer in the headlights
 

3. More powerful lexical comparisons

In the examples above, we mostly compared exact words or phrases, with some cases of phrases where a given part of speech was used in a particular slot (e.g. in over his/her/my head; [=ap*]). But you can also do searches like the following:

  • head* (COCA) heads-up, headliner, headspace, headware; (BNC) headteacher, headstock, headhunted, headmistress

  • a(n) *head (COCA) crackhead, knucklehead, bobblehead, hothead, bonehead, trailhead, wellhead

  • head + prep (COCA) toward(s), to, up , around, into

  • verb ~ head: this construction has about same frequency overall. But there are clear differences in the particular verbs, e.g. (COCA) poke, stick, tilt, cock, lift, bob, shake, nod; (BNC) mind, feel, raise

Moving away from the words and phrases with head, the following are just a handful of other lexical comparisons:

  • *ism words: (COCA) bioterrorism, volunteerism, Islamism, Pentecostalism, ecotourism, globalism; (BNC) Owenism, Toryism, Fabianism, teetotalism. Comparing such lists of words provides some interesting insight into cultural differences between the two countries as well.

  • Adjectives used to describe men: (COCA) nuts, liable, scary, smarter, tougher, relentless, focused, easygoing, low-key, astounded; (BNC) redundant, wont, spotty, chuffed, dotty, cheeky, posh

  • Phrasal verbs with up (COCA) ratchet, fess, hike, crank, listen, bust, scare, cuddle, scrounge, rack; (BNC) nip, stump, plant, top, phone, cash, tot, pluck, cock, mug, bugger, knock

4. Morphological differences. It's interesting to compare forms of words -- side-by-side -- in the two dialects. A few examples:

  • [have] gotten: obviously much more common in American English (COCA)

  • have + proved / proven: the first is more common in the BNC, the second in COCA. To do this right, you'd want to get a ratio of the two forms in Excel (for example), but here we're just doing the two forms individually

  • [PRON] + sneaked / snuck: both are more common in COCA than in the BNC, but it is much more pronounced with snuck.

5. Syntactic differences. The following compare -- side by side -- particular grammatical phenomena in the two dialects. These are just a random sample of quick examples; hundreds of other phenomena could be studied in more depth.

  • need NEG VERB (e.g. you needn't worry): much more common in British English (BNC).

  • must VERB (e.g. we must work more): more common in British (BNC).

  • end up V-ing (e.g. he ended up paying): more common in American (COCA).

  • "quotative" like (e.g. and he's like "that's not fair"): much more common in American (COCA).

We can also search for very "narrow" phenomena, like the following:

6. Semantic differences. We can tell a lot about the meaning of words by the "collocates" (nearby words) with which they occur. Consider the following differences in meaning:

  • napkin/nappy: the BNC has more collocates referring to children (e.g. baby, children, rash, toy, child), showing that this word has roughly the same meaning as the American diaper. In American English, though, it refers to the British serviette, and this shows up with collocates referring to food and dining, like cocktail, silverware, plates, and cups.

  • cupboard: unlike American English (COCA), in British English (BNC) it can refer to a place where you store clothes as well, hence the collocates like wardrobe, linen, clothes, and bedroom.

  • scheme: in American English, it has a much more negative sense than in British English, hence the collocates like risky, hazardous, offensive, aggressive, evil, and diabolical.

  • stick: in American English, it refers to many objects that it wouldn't in British English, like butter, margarine, needles, and gum. (In British English, it would be a knob of butter, right?) Notice newer words like memory (stick) as well, which won't occur in a 20 year old corpus like the BNC.

  • dumb: it looks like in British English it still has the meaning it had in American English 50-100 years ago ("can't speak" (well)); whereas in American English it now usually means "stupid" (e.g. things, luck, idea, investment).

  • neat: in British English it still refers mainly (??) to something being "orderly" (e.g. finish, collar, control, button), whereas in American English it has expanded its meaning to "nice / cool" (e.g. place, trick, guy, stuff, part).

  • boost: it looks like in British English, it refers primarily to "increasing" something (e.g. finances, figures), whereas in American English it has expanded its meaning to "improvement" (e.g. mood, spirits, security).

  • flip: it's not clear how to characterize what it means in American English that it doesn't (yet) in British English, but the list of nouns is quite interesting: light, hair, phone, bird, head, channels, etc. Any ideas?

  • strip: in the US, we have lots and lots of strip malls in our cities, where we go shopping. BTW, what are these called in England?

  • web: this one is pretty obvious. In COCA, it refers to the World Wide Web (site, world, page, email, company, browser, Internet), but the Web wasn't really in existence when the BNC was released a generation ago -- back in the early 1990s. This shows the value of having up-to-date texts that reflect recent changes in the language.


7. Conclusions

7.1 Size matters. Even with 450 million words (COCA) and 100 million words (BNC), in some of the cases above we still only have a handful of tokens. Imagine if we were using tiny corpora of just two million words or so for each dialect. Very few of the searches above would be possible, and we'd be reduced to looking at just highly-frequent phenomena, like modals, other auxiliary verbs, and prepositions.

7.2 It really helps to have a corpus that is up-to-date. For several of these searches, it looks like British English (BNC) is different from American English (COCA), but this may just be due in some cases to the fact that COCA is so much more up-to-date (the BNC ends nearly a generation ago, in the early 1990s, whereas COCA goes up through mid-2012). For the best comparisons, it would really help to have an up-to-date, balanced corpus of British English. Any takers?

7.3 It is now possible to compare corpora side-by-side with just one click of the mouse. It doesn't make much sense to limit oneself to just one corpus (like the BNC), and completely ignore all other corpora (like the much larger and much more recent COCA), especially when these "side-by-side" comparisons of multiple corpora are so simple.