English-Corpora.org

DIALECTAL (GloWbE)

The Corpus of Global Web-Based English (GloWbE) [1.9 billion words] can provide data on differences between dialects of English, in ways that are not possible with any other corpus. The following are just a few samples of how the corpus can be used to compare the 20 different countries in the corpus.

For an academic overview of the corpus, please see:

Davies, Mark, and Robert Fuchs. 2015. “Expanding Horizons in the Study of World Englishes with the 1.9 Billion Word Global Web-Based English Corpus (GloWbE).” English World-Wide 36: 1-28. (Note: several other articles in this issue are dedicated to GloWbE as well.)

Note: click on any link on this page to see the corpus data, and then click on "RETURN" in the upper right-hand corner of the corpus to come back to this page.

Lexical (vocabulary): You can search for any word or phrase, and see its frequency in all 20 dialects. For example, all of the following are common in [British] than American English: fortnight, trousers, rained off, on holiday, at university, [be] different to, rather more ADJ. More examples: [Irish] jackeen*, banjax*, culchie*, childer, soft day, [act] the maggot*; [Australia] bikkies, thongs, rockmelon*; [Malaysia] (+Singapore) rakyat, makan, hand phone, [take] ADJ food, lah!; [Jamaica] ackee, bammy, guinep, callaloo. You can also see comparisons across groups of countries, e.g. [South Asia] out of station, eve teas*, be elder to, keep in view; [Non-"core" countries]: equipments, thrice, godown, same to the, [discuss] about, [cope] up.

In all of the preceding searches, you input a specific word or phrase, and then see the frequency in each country. But because the corpus has already stored the frequency of each word and phrase in each country, you can also do more complicated searches, in which you have GloWbE show you what words or phrases occur in a given country (or set of countries), but not in another. For example, you could compare all *ism words in the six "core" countries (left) and the four countries in South Asia (right). Or you could find all *ies nouns that are more common in Australia (left) than in other countries (e.g. cockies, pollies, furphies).

Idioms (and phrases): The following are a few idioms related to "head" that are more common in American (and Canadian) English: in over ~ head, head start, heads or tails, talking [head], (like) a deer in the headlights, cooler heads (will prevail). On the other hand, the following are spread more evenly across the dialects: price on ~ head, head over heels (in love), head and shoulders above, two heads are better (than one), [use] ~ head, [make] ~ head spin, [put] ~ head* together, from head to toe, hanging over ~ head, off the top of ~ head. Note, by the way, how sensitive idioms are to size. In a "small" corpus like the BNC, which is 1/20th the size of GloWbE, there might only be 1/20th as many tokens (so perhaps just 5 or 6 total), and in a tiny one million word corpus, there probably wouldn't be any tokens at all.

Again, because you can easily compare anything in different countries or regions, you could for example compare V-ed me up (e.g. stressed, freaked, creeped me out) in the six "core" countries (left) and the countries in South Asia (right). Or you could see, for example, what prepositions are used with a given adjective (like integrated) in different countries (notice the "non-standard" ones in India: in and to, instead of into).

Morphology (word forms): Just a few examples show that [be] spoilt (vs spoiled) and [have] learnt (vs. learned) are less common in the US and Canada than in other varieties, whereas American and Canadian English prefer dove (vs dived) more than other "core" dialects.

Syntax (grammar): You can enter any grammatical construction and then see its frequency across each of the 20 countries. For example, you could look for V likely V (e.g. would likely remember), the subjunctive (e.g. if I were king), verb agreement (e.g. none of them are), try and verb (e.g. you should try and do it), or the "like" construction (and he's like ,...). You can also look for constructions like the "go + ADJ" construction (e.g. go crazy, go bankrupt), the "way" construction (e.g. he pushed his way through the crowd) or the Verb someone into V-ing construction (e.g. he talked her into coming) and see the different verbs or adjectives by country.

Because of its size, GloWbE can compare low frequency constructions in different dialects. For example, compared to UK, Ireland, Australia, New Zealand, [stop] someone V-ing and [prevent] someone V-ing (they stopped / prevented him going) are quite infrequent in American and Canadian English (they would need from as well: stop, prevent). We can also examine "discourse markers", just as "that said ,", which is the most common in the US (and then descending order through the other "core" dialects).

Semantics (meaning): You can use collocates (nearby words) to compare the meaning of a word in two dialects. For example, the collocates of scheme in the US (left) are much more negative than those in the UK (right; e.g. evil, fraudulent, nefarious). In the UK (right), cupboards are not limited just to kitchens (as in the US; left), and so you get collocates like wardrobe and clothes. And finally, it looks like in British English (right) boost (verb) refers primarily to "increasing" something (e.g. finances, figures), whereas in American English (left) it has expanded its meaning to "improvement" (e.g. mood, spirits, security)

Discourse (cultural): Finally, one of the most interesting uses of the corpus is the ability to compare frequency or collocates across countries. For example, it is probably no surprise in which countries the words Quran or Allah are most common (Pakistan and other Muslim countries), or Buddh* (Sri Lanka), or feminism (six "core" countries). Using collocates (nearby words), we can also compare "what is being said" about specific concepts in different countries or regions. For example, ADJ book in the Asian countries (left) refers much more to religious texts (divine, revealed, Buddhist) than in the six "core" (more secular) countries (right). ADJ belief in South Asia (left) contains Hindu, corrupt, wrong, Islamic, heretical, etc compared to silly, contradictory, liberal, and Catholic in the six "core" (more secular) countries (right). Finally, the adjectives with wife in the "non-core" countries (left) contain chaste, temporary, obedient, Muslim, virtuous, etc much more than in the (more secular) "core" countries (right).

COMPOSITION OF THE CORPUS (# web sites (distinct domains), web pages, and words)

Country	Code	General (may also include blogs)			(Only) Blogs			Total
		Sites	Pages	Words	Sites	Pages	Words	Sites	Pages	Words
United States	US	43,249	168,771	253,536,242	48,116	106,385	133,061,093	82,260	275,156	386,809,355
Canada	CA	22,178	81,644	90,846,732	16,745	54,048	43,814,827	33,776	135,692	134,765,381
Great Britain	GB	39,254	232,428	255,672,390	35,229	149,413	131,671,002	64,351	381,841	387,615,074
Ireland	IE	12,978	75,432	80,530,794	5,512	26,715	20,410,027	15,840	102,147	101,029,231
Australia	AU	19,619	81,683	104,716,366	13,516	47,561	43,390,501	28,881	129,244	148,208,169
New Zealand	NZ	11,202	54,862	58,698,828	4,970	27,817	22,625,584	14,053	82,679	81,390,476
India	IN	11,217	76,609	68,032,551	9,289	37,156	28,310,511	18,618	113,765	96,430,888
Sri Lanka	LK	3,307	25,310	33,793,772	1,672	13,079	12,760,726	4,208	38,389	46,583,115
Pakistan	PK	3,070	25,852	38,005,985	2,899	16,917	13,332,245	4,955	42,769	51,367,152
Bangladesh	BD	4,415	30,813	28,700,158	2,332	14,246	10,922,869	5,712	45,059	39,658,255
Singapore	SG	5,775	28,332	29,229,186	4,255	17,127	13,711,412	8,339	45,459	42,974,705
Malaysia	MY	6,225	29,302	29,026,896	4,591	16,299	13,357,745	8,966	45,601	42,420,168
Philippines	PH	6,169	28,391	29,758,446	5,979	17,951	13,457,087	10,224	46,342	43,250,093
Hong Kong	HK	6,720	27,896	27,906,879	2,892	16,040	12,508,796	8,740	43,936	40,450,291
South Africa	ZA	7,318	28,271	31,683,286	4,566	16,993	13,645,623	10,308	45,264	45,364,498
Nigeria	NG	3,448	23,329	30,622,738	2,072	13,956	11,996,583	4,516	37,285	42,646,098
Ghana	GH	3,161	32,189	27,644,721	1,053	15,162	11,088,160	3,616	47,351	38,768,231
Kenya	KE	4,222	31,166	28,552,920	2,073	14,796	12,480,777	5,193	45,962	41,069,085
Tanzania	TZ	3,829	27,533	24,883,840	1,414	13,823	10,253,840	4,575	41,356	35,169,042
Jamaica	JM	3,049	30,928	28,505,416	1,049	15,820	11,124,273	3,488	46,748	39,663,666
TOTAL		220,405	1,140,741	1,300,348,146	170,224	651,304	583,923,681	340,619	1,792,045	1,885,632,973