corpus.byu.edu

corpora, size, queries = better resources, more insight


 Upgrade   Contributors 

 Academic site license 

Overview
Corpora
Size, speed, queries
Insight into variation

Updates (May 2016)
History / updates
FAQ / questions
Researchers

Register / create profile
Log in / password
Reset password

Related resources
   Full-text data
   Word frequency
   Collocates
   N-grams
   WordAndPhrase
   Academic vocabulary

Problems
Contact us


GloWbE: Insight into variation in World Englishes

The Corpus of Global Web-Based English (GloWbE) [details on corpus] can provide data on differences between dialects of English, in ways that are not possible with any other corpus. The following are just a few samples of how the corpus can be used to compare the 20 different countries in the corpus.

Note: click on any link on this page to see the corpus data, and then click on "RETURN" in the upper right-hand corner of the corpus to come back to this page.

Lexical (vocabulary): You can search for any word or phrase, and see its frequency in all 20 dialects. For example, all of the following are common in [British] than American English: fortnight, trousers, rained off, on holiday, at university, [be] different to, rather more ADJ. More examples: [Irish] jackeen*, banjax*, culchie*, childer, soft day, [act] the maggot*; [Australia] bikkies, thongs, rockmelon*; [Malaysia] (+Singapore) rakyat, makan, hand phone, [take] ADJ food, lah!; [Jamaica] ackee, bammy, guinep, callaloo. You can also see comparisons across groups of countries, e.g. [South Asia] out of station, eve teas*, be elder to, keep in view; [Non-"core" countries]: equipments, thrice, godown, same to the, [discuss] about, [cope] up.

In all of the preceding searches, you input a specific word or phrase, and then see the frequency in each country. But because the corpus has already stored the frequency of each word and phrase in each country, you can also do more complicated searches, in which you have GloWbE show you what words or phrases occur in a given country (or set of countries), but not in another. For example, you could compare all *ism words in the six "core" countries (left) and the four countries in South Asia (right). Or you could find all *ies nouns that are more common in Australia (left) than in other countries (e.g. cockies, pollies, furphies).

Idioms (and phrases): The following are a few idioms related to "head" that are more common in American (and Canadian) English: in over ~ head, head start, heads or tails, talking [head], (like) a deer in the headlights, cooler heads (will prevail). On the other hand, the following are spread more evenly across the dialects: price on ~ head, head over heels (in love), head and shoulders above, two heads are better (than one), [use] ~ head, [make] ~ head spin, [put] ~ head* together, from head to toe, hanging over ~ head, off the top of ~ head. Note, by the way, how sensitive idioms are to size. In a "small" corpus like the BNC, which is 1/20th the size of GloWbE, there might only be 1/20th as many tokens (so perhaps just 5 or 6 total), and in a tiny 10-20 million word corpus, there probably wouldn't be any tokens at all.

Again, because you can easily compare anything in different countries or regions, you could for example compare V-ed me up (e.g. stressed, freaked, creeped me out) in the six "core" countries (left) and the countries in South Asia (right). Or you could see, for example, what prepositions are used with a given adjective (like integrated) in different countries (notice the "non-standard" ones in India: in and to, instead of into).

Morphology (word forms): Just a few examples show that [be] spoilt (vs spoiled) and [have] learnt (vs. learned) are less common in the US and Canada than in other varieties, whereas American and Canadian English prefer dove (vs dived) more than other "core" dialects.

Syntax (grammar): You can enter any grammatical construction and then see its frequency across each of the 20 countries. For example, you could look for V likely V (e.g. would likely remember), the subjunctive (e.g. if I were king), verb agreement (e.g. none of them are), try and verb (e.g. you should try and do it), or the "like" construction (and he's like ,...). You can also look for constructions like the "go + ADJ" construction (e.g. go crazy, go bankrupt), the "way" construction (e.g. he pushed his way through the crowd) or the Verb someone into V-ing construction (e.g. he talked her into coming) and see the different verbs or adjectives by country.

Because of its size, GloWbE can compare low frequency constructions in different dialects. For example, compared to UK, Ireland, Australia, New Zealand, [stop] someone V-ing and [prevent] someone V-ing (they stopped / prevented him going) are quite infrequent in American and Canadian English (they would need from as well: stop, prevent). We can also examine "discourse markers", just as "that said ,", which is the most common in the US (and then descending order through the other "core" dialects).

Semantics (meaning): You can use collocates (nearby words) to compare the meaning of a word in two dialects. For example, the collocates of scheme in the US (left) are much more negative than those in the UK (right; e.g. evil, fraudulent, nefarious). In the UK (right), cupboards are not limited just to kitchens (as in the US; left), and so you get collocates like wardrobe and clothes. And finally, it looks like in British English (right) boost (verb) refers primarily to "increasing" something (e.g. finances, figures), whereas in American English (left) it has expanded its meaning to "improvement" (e.g. mood, spirits, security)

Discourse (cultural): Finally, one of the most interesting uses of the corpus is the ability to compare frequency or collocates across countries. For example, it is probably no surprise in which countries the words Quran or Allah are most common (Pakistan and other Muslim countries), or Buddh* (Sri Lanka), or feminism (six "core" countries). Using collocates (nearby words), we can also compare "what is being said" about specific concepts in different countries or regions. For example, ADJ book in the Asian countries (left) refers much more to religious texts (divine, revealed, Buddhist) than in the six "core" (more secular) countries (right). ADJ belief in South Asia (left) contains Hindu, corrupt, wrong, Islamic, heretical, etc compared to silly, contradictory, liberal, and Catholic in the six "core" (more secular) countries (right). Finally, the adjectives with wife in the "non-core" countries (left) contain chaste, temporary, obedient, Muslim, virtuous, etc much more than in the (more secular) "core" countries (right).