corpus.byu.edu


DIALECTAL (GloWbE)

The Corpus of Global Web-Based English (GloWbE) [1.9 billion words] can provide data on differences between dialects of English, in ways that are not possible with any other corpus. The following are just a few samples of how the corpus can be used to compare the 20 different countries in the corpus.

For an academic overview of the corpus, please see:

Davies, Mark, and Robert Fuchs. 2015. “Expanding Horizons in the Study of World Englishes with the 1.9 Billion Word Global Web-Based English Corpus (GloWbE).” English World-Wide 36: 1-28. (Note: several other articles in this issue are dedicated to GloWbE as well.)

Note: click on any link on this page to see the corpus data, and then click on "RETURN" in the upper right-hand corner of the corpus to come back to this page.

Lexical (vocabulary): You can search for any word or phrase, and see its frequency in all 20 dialects. For example, all of the following are common in [British] than American English: fortnight, trousers, rained off, on holiday, at university, [be] different to, rather more ADJ. More examples: [Irish] jackeen*, banjax*, culchie*, childer, soft day, [act] the maggot*; [Australia] bikkies, thongs, rockmelon*; [Malaysia] (+Singapore) rakyat, makan, hand phone, [take] ADJ food, lah!; [Jamaica] ackee, bammy, guinep, callaloo. You can also see comparisons across groups of countries, e.g. [South Asia] out of station, eve teas*, be elder to, keep in view; [Non-"core" countries]: equipments, thrice, godown, same to the, [discuss] about, [cope] up.

In all of the preceding searches, you input a specific word or phrase, and then see the frequency in each country. But because the corpus has already stored the frequency of each word and phrase in each country, you can also do more complicated searches, in which you have GloWbE show you what words or phrases occur in a given country (or set of countries), but not in another. For example, you could compare all *ism words in the six "core" countries (left) and the four countries in South Asia (right). Or you could find all *ies nouns that are more common in Australia (left) than in other countries (e.g. cockies, pollies, furphies).

Idioms (and phrases): The following are a few idioms related to "head" that are more common in American (and Canadian) English: in over ~ head, head start, heads or tails, talking [head], (like) a deer in the headlights, cooler heads (will prevail). On the other hand, the following are spread more evenly across the dialects: price on ~ head, head over heels (in love), head and shoulders above, two heads are better (than one), [use] ~ head, [make] ~ head spin, [put] ~ head* together, from head to toe, hanging over ~ head, off the top of ~ head. Note, by the way, how sensitive idioms are to size. In a "small" corpus like the BNC, which is 1/20th the size of GloWbE, there might only be 1/20th as many tokens (so perhaps just 5 or 6 total), and in a tiny one million word corpus, there probably wouldn't be any tokens at all.

Again, because you can easily compare anything in different countries or regions, you could for example compare V-ed me up (e.g. stressed, freaked, creeped me out) in the six "core" countries (left) and the countries in South Asia (right). Or you could see, for example, what prepositions are used with a given adjective (like integrated) in different countries (notice the "non-standard" ones in India: in and to, instead of into).

Morphology (word forms): Just a few examples show that [be] spoilt (vs spoiled) and [have] learnt (vs. learned) are less common in the US and Canada than in other varieties, whereas American and Canadian English prefer dove (vs dived) more than other "core" dialects.

Syntax (grammar): You can enter any grammatical construction and then see its frequency across each of the 20 countries. For example, you could look for V likely V (e.g. would likely remember), the subjunctive (e.g. if I were king), verb agreement (e.g. none of them are), try and verb (e.g. you should try and do it), or the "like" construction (and he's like ,...). You can also look for constructions like the "go + ADJ" construction (e.g. go crazy, go bankrupt), the "way" construction (e.g. he pushed his way through the crowd) or the Verb someone into V-ing construction (e.g. he talked her into coming) and see the different verbs or adjectives by country.

Because of its size, GloWbE can compare low frequency constructions in different dialects. For example, compared to UK, Ireland, Australia, New Zealand, [stop] someone V-ing and [prevent] someone V-ing (they stopped / prevented him going) are quite infrequent in American and Canadian English (they would need from as well: stop, prevent). We can also examine "discourse markers", just as "that said ,", which is the most common in the US (and then descending order through the other "core" dialects).

Semantics (meaning): You can use collocates (nearby words) to compare the meaning of a word in two dialects. For example, the collocates of scheme in the US (left) are much more negative than those in the UK (right; e.g. evil, fraudulent, nefarious). In the UK (right), cupboards are not limited just to kitchens (as in the US; left), and so you get collocates like wardrobe and clothes. And finally, it looks like in British English (right) boost (verb) refers primarily to "increasing" something (e.g. finances, figures), whereas in American English (left) it has expanded its meaning to "improvement" (e.g. mood, spirits, security)

Discourse (cultural): Finally, one of the most interesting uses of the corpus is the ability to compare frequency or collocates across countries. For example, it is probably no surprise in which countries the words Quran or Allah are most common (Pakistan and other Muslim countries), or Buddh* (Sri Lanka), or feminism (six "core" countries). Using collocates (nearby words), we can also compare "what is being said" about specific concepts in different countries or regions. For example, ADJ book in the Asian countries (left) refers much more to religious texts (divine, revealed, Buddhist) than in the six "core" (more secular) countries (right). ADJ belief in South Asia (left) contains Hindu, corrupt, wrong, Islamic, heretical, etc compared to silly, contradictory, liberal, and Catholic in the six "core" (more secular) countries (right). Finally, the adjectives with wife in the "non-core" countries (left) contain chaste, temporary, obedient, Muslim, virtuous, etc much more than in the (more secular) "core" countries (right).

COMPOSITION OF THE CORPUS (# web sites (distinct domains), web pages, and words)
 

Country Code General (may also include blogs) (Only) Blogs Total
    Sites Pages Words Sites Pages Words Sites Pages Words
United States US 43,249 168,771 253,536,242 48,116 106,385 133,061,093 82,260 275,156 386,809,355
Canada CA 22,178 81,644 90,846,732 16,745 54,048 43,814,827 33,776 135,692 134,765,381
Great Britain GB 39,254 232,428 255,672,390 35,229 149,413 131,671,002 64,351 381,841 387,615,074
Ireland IE 12,978 75,432 80,530,794 5,512 26,715 20,410,027 15,840 102,147 101,029,231
Australia AU 19,619 81,683 104,716,366 13,516 47,561 43,390,501 28,881 129,244 148,208,169
New Zealand NZ 11,202 54,862 58,698,828 4,970 27,817 22,625,584 14,053 82,679 81,390,476
India IN 11,217 76,609 68,032,551 9,289 37,156 28,310,511 18,618 113,765 96,430,888
Sri Lanka LK 3,307 25,310 33,793,772 1,672 13,079 12,760,726 4,208 38,389 46,583,115
Pakistan PK 3,070 25,852 38,005,985 2,899 16,917 13,332,245 4,955 42,769 51,367,152
Bangladesh BD 4,415 30,813 28,700,158 2,332 14,246 10,922,869 5,712 45,059 39,658,255
Singapore SG 5,775 28,332 29,229,186 4,255 17,127 13,711,412 8,339 45,459 42,974,705
Malaysia MY 6,225 29,302 29,026,896 4,591 16,299 13,357,745 8,966 45,601 42,420,168
Philippines PH 6,169 28,391 29,758,446 5,979 17,951 13,457,087 10,224 46,342 43,250,093
Hong Kong HK 6,720 27,896 27,906,879 2,892 16,040 12,508,796 8,740 43,936 40,450,291
South Africa ZA 7,318 28,271 31,683,286 4,566 16,993 13,645,623 10,308 45,264 45,364,498
Nigeria NG 3,448 23,329 30,622,738 2,072 13,956 11,996,583 4,516 37,285 42,646,098
Ghana GH 3,161 32,189 27,644,721 1,053 15,162 11,088,160 3,616 47,351 38,768,231
Kenya KE 4,222 31,166 28,552,920 2,073 14,796 12,480,777 5,193 45,962 41,069,085
Tanzania TZ 3,829 27,533 24,883,840 1,414 13,823 10,253,840 4,575 41,356 35,169,042
Jamaica JM 3,049 30,928 28,505,416 1,049 15,820 11,124,273 3,488 46,748 39,663,666
TOTAL   220,405 1,140,741 1,300,348,146 170,224 651,304 583,923,681 340,619 1,792,045 1,885,632,973