DIALECTAL (GloWbE)
The Corpus of Global Web-Based
English (GloWbE) [1.9 billion words] can provide data on differences between
dialects of English, in ways that are not possible with any
other corpus. The following are just a few samples of how the corpus
can be used to compare the 20 different countries in the corpus.
For an academic overview of the corpus, please see:
Davies, Mark, and Robert Fuchs. 2015.
“Expanding Horizons in the Study of World Englishes with the 1.9 Billion
Word Global Web-Based English Corpus (GloWbE).”
English World-Wide 36: 1-28. (Note:
several other articles in this issue are dedicated to GloWbE as
well.) |
Note:
click on any link
on this page to see the corpus data, and then
click on "RETURN" in the upper right-hand corner of the corpus to come back to
this page.
Lexical (vocabulary):
You can search for any word or phrase, and see its frequency in all
20 dialects. For example, all of the following are common in [British]
than American English:
fortnight,
trousers,
rained off,
on holiday,
at university,
[be] different to,
rather more ADJ. More examples: [Irish]
jackeen*,
banjax*,
culchie*,
childer,
soft day,
[act] the maggot*; [Australia]
bikkies,
thongs,
rockmelon*; [Malaysia]
(+Singapore)
rakyat,
makan,
hand phone,
[take] ADJ food,
lah!; [Jamaica]
ackee,
bammy,
guinep,
callaloo. You can also see comparisons across groups of
countries, e.g. [South Asia]
out of station,
eve teas*,
be elder to,
keep in view; [Non-"core" countries]:
equipments,
thrice,
godown,
same to the,
[discuss] about,
[cope] up.
In all of the preceding searches, you
input a specific word or phrase, and then see the frequency in each
country. But because the corpus has already stored the frequency of
each word and phrase in each country, you can also do more complicated searches, in which you
have GloWbE show you what words or phrases occur in a given country
(or set of countries), but not in another. For example, you could
compare all *ism words in the six "core" countries
(left) and the four countries in South Asia (right). Or you could
find all
*ies nouns that are more common in Australia (left)
than in other countries (e.g. cockies, pollies, furphies).
Idioms (and phrases):
The following are a few idioms related to "head" that are
more common in American (and Canadian) English:
in over ~ head,
head start,
heads or tails,
talking [head],
(like) a deer in the headlights,
cooler heads (will prevail). On the other hand, the following
are spread more evenly across the dialects:
price on ~ head,
head over heels (in love),
head and shoulders above,
two heads are better (than one),
[use] ~ head,
[make] ~ head spin,
[put] ~ head* together,
from head to toe,
hanging over ~ head,
off the top of ~ head. Note, by the way, how sensitive idioms
are to size. In a "small" corpus like the BNC, which is 1/20th the
size of GloWbE, there might only be 1/20th as many tokens (so
perhaps just 5 or 6 total), and in a tiny
one million
word corpus, there probably wouldn't be any tokens at all.
Again, because you can easily compare
anything in different countries or regions, you could for example
compare
V-ed me up (e.g. stressed, freaked, creeped me out)
in the six "core" countries (left) and the countries in South Asia
(right). Or you could see, for example,
what prepositions are
used with a given adjective (like integrated) in
different countries (notice the "non-standard" ones in India: in
and to, instead of into).
Morphology (word forms):
Just a few examples show that
[be] spoilt (vs spoiled) and
[have] learnt (vs. learned) are less common in the US
and Canada than in other varieties, whereas American and Canadian
English prefer
dove (vs dived) more than other "core" dialects.
Syntax (grammar): You
can enter any grammatical construction and then see its frequency
across each of the 20 countries. For example, you could look for
V likely V
(e.g. would likely remember), the
subjunctive (e.g. if I were king),
verb agreement (e.g. none of them
are),
try and verb (e.g. you should try and do it), or the
"like" construction
(and he's like ,...). You can
also look for constructions like the "go
+ ADJ" construction (e.g. go crazy, go bankrupt), the "way"
construction (e.g. he pushed his way through the crowd)
or the
Verb someone into V-ing construction (e.g. he talked her
into coming) and see the different verbs or adjectives by
country.
Because of its size, GloWbE can compare
low frequency constructions in different dialects. For example,
compared to UK, Ireland, Australia, New Zealand,
[stop] someone V-ing and
[prevent] someone V-ing (they stopped / prevented him going)
are quite infrequent in American and Canadian English (they would
need from as well:
stop,
prevent). We can also examine "discourse markers", just as
"that
said ,", which is the most common in the US (and then descending
order through the other "core" dialects).
Semantics (meaning):
You can use collocates (nearby words) to compare the meaning of a
word in two dialects. For example, the collocates of
scheme in the US (left) are much more negative than those
in the UK (right; e.g. evil, fraudulent, nefarious). In the
UK (right),
cupboards are not limited just to kitchens (as in the
US; left), and so you get collocates like wardrobe and
clothes. And finally, it looks like in British English (right)
boost (verb) refers primarily to "increasing" something
(e.g. finances, figures), whereas in American English
(left) it has expanded its meaning to "improvement" (e.g. mood,
spirits, security)
Discourse (cultural):
Finally, one of the most interesting uses of the corpus is the
ability to compare frequency or collocates across countries. For
example, it is probably no surprise in which countries the words
Quran or
Allah are most common (Pakistan and other Muslim
countries), or
Buddh* (Sri Lanka), or
feminism (six "core" countries). Using collocates (nearby
words), we can also compare "what is being said" about specific
concepts in different countries or regions. For example,
ADJ book in the Asian countries (left) refers much more
to religious texts (divine, revealed, Buddhist) than in the
six "core" (more secular) countries (right).
ADJ belief in South Asia (left) contains Hindu, corrupt,
wrong, Islamic, heretical, etc compared to silly,
contradictory, liberal, and Catholic in the six "core"
(more secular) countries (right). Finally, the
adjectives with wife in the "non-core"
countries (left) contain chaste, temporary, obedient, Muslim,
virtuous, etc much more than in the (more secular) "core"
countries (right).
COMPOSITION OF
THE CORPUS (# web sites (distinct domains), web pages, and
words)
Country |
Code |
General (may also
include blogs) |
(Only) Blogs |
Total |
|
|
Sites |
Pages |
Words |
Sites |
Pages |
Words |
Sites |
Pages |
Words |
United States |
US |
43,249 |
168,771 |
253,536,242 |
48,116 |
106,385 |
133,061,093 |
82,260 |
275,156 |
386,809,355 |
Canada |
CA |
22,178 |
81,644 |
90,846,732 |
16,745 |
54,048 |
43,814,827 |
33,776 |
135,692 |
134,765,381 |
Great Britain |
GB |
39,254 |
232,428 |
255,672,390 |
35,229 |
149,413 |
131,671,002 |
64,351 |
381,841 |
387,615,074 |
Ireland |
IE |
12,978 |
75,432 |
80,530,794 |
5,512 |
26,715 |
20,410,027 |
15,840 |
102,147 |
101,029,231 |
Australia |
AU |
19,619 |
81,683 |
104,716,366 |
13,516 |
47,561 |
43,390,501 |
28,881 |
129,244 |
148,208,169 |
New Zealand |
NZ |
11,202 |
54,862 |
58,698,828 |
4,970 |
27,817 |
22,625,584 |
14,053 |
82,679 |
81,390,476 |
India |
IN |
11,217 |
76,609 |
68,032,551 |
9,289 |
37,156 |
28,310,511 |
18,618 |
113,765 |
96,430,888 |
Sri Lanka |
LK |
3,307 |
25,310 |
33,793,772 |
1,672 |
13,079 |
12,760,726 |
4,208 |
38,389 |
46,583,115 |
Pakistan |
PK |
3,070 |
25,852 |
38,005,985 |
2,899 |
16,917 |
13,332,245 |
4,955 |
42,769 |
51,367,152 |
Bangladesh |
BD |
4,415 |
30,813 |
28,700,158 |
2,332 |
14,246 |
10,922,869 |
5,712 |
45,059 |
39,658,255 |
Singapore |
SG |
5,775 |
28,332 |
29,229,186 |
4,255 |
17,127 |
13,711,412 |
8,339 |
45,459 |
42,974,705 |
Malaysia |
MY |
6,225 |
29,302 |
29,026,896 |
4,591 |
16,299 |
13,357,745 |
8,966 |
45,601 |
42,420,168 |
Philippines |
PH |
6,169 |
28,391 |
29,758,446 |
5,979 |
17,951 |
13,457,087 |
10,224 |
46,342 |
43,250,093 |
Hong Kong |
HK |
6,720 |
27,896 |
27,906,879 |
2,892 |
16,040 |
12,508,796 |
8,740 |
43,936 |
40,450,291 |
South Africa |
ZA |
7,318 |
28,271 |
31,683,286 |
4,566 |
16,993 |
13,645,623 |
10,308 |
45,264 |
45,364,498 |
Nigeria |
NG |
3,448 |
23,329 |
30,622,738 |
2,072 |
13,956 |
11,996,583 |
4,516 |
37,285 |
42,646,098 |
Ghana |
GH |
3,161 |
32,189 |
27,644,721 |
1,053 |
15,162 |
11,088,160 |
3,616 |
47,351 |
38,768,231 |
Kenya |
KE |
4,222 |
31,166 |
28,552,920 |
2,073 |
14,796 |
12,480,777 |
5,193 |
45,962 |
41,069,085 |
Tanzania |
TZ |
3,829 |
27,533 |
24,883,840 |
1,414 |
13,823 |
10,253,840 |
4,575 |
41,356 |
35,169,042 |
Jamaica |
JM |
3,049 |
30,928 |
28,505,416 |
1,049 |
15,820 |
11,124,273 |
3,488 |
46,748 |
39,663,666 |
TOTAL |
|
220,405 |
1,140,741 |
1,300,348,146 |
170,224 |
651,304 |
583,923,681 |
340,619 |
1,792,045 |
1,885,632,973 |
|