Note: If you want to use the older version of the corpus (from before May 2016), add /old to the web address, e.g. (new/regular), (old).

See also the NOW corpus (189 billion words, automatically adds 4-5 million words each day) and the CORE corpus (the first carefully categorized corpus of web registers) - both released in May 2016.


Comments?  Problems?

In May 2016 we released a new version of the BYU corpora. The following are the major changes to the corpus architecture and interface.

1. More mobile-friendly

The previous interface had lots of frames. These worked well on laptop and and desktop computers, but not very well with mobile phones or tablets. The new interface is designed from the ground up to work on screens of any size. The following are some screenshots from a mobile phone for (left to right) the search interface, results display, Keyword in Context display (KWIC), and expanded KWIC. The interface looks even better on a device with a larger screen, but the bottom line is that the corpora now look and work fine, no matter what device you're using. Note that the help files for mobile devices can always be accessed via the right-most tab (or just click HELP in the search form). Note also that the older interface will still be online as well, at least for the foreseeable future.


2. Cleaner, more simple interface

The search form in the previous interface was a bit overwhelming (below, left). The newer interface is much cleaner and simple to use (below, right). All of the previous functionality is still there -- the ability to limit by and compare sections of the corpus, deciding how to sort the data, etc -- but now those form elements only appear when you need them.

Previous interface

 

New: simple list view

New: collocates view


3. More helpful help files

Context-sensitive help files now appear whenever you click on a form element -- list, collocates, compare words, sections, virtual corpora, etc. And there are sample searches in each of these files, which you can modify to make your own searches. Note that the help files for mobile devices can always be accessed via the right-most tab (or just click HELP in the search form), and that it is context-sensitive (new instructions load every time you click on an element in the search form).


4. Simpler, more intuitive search syntax

Some search syntaxes are (in our view) unnecessarily complex, like the CQP syntax on the left. The previous BYU search syntax had a much simpler syntax, but there were still too many square brackets, full stops, asterisks, etc (no fun to type these on a mobile phone keyboard). We have now simplified the search syntax even more, as is shown on the right. But while they're learning the newer, simpler syntax, people can still use any combination of the older and newer syntax. (For more information, including the new part of speech codes, click on LIST in the search form, and then Part of Speech)

Type of search CQP syntax Previous BYU syntax New syntax Example
Word [word = "nooks"] nooks nooks nooks and crannies
Lemma (forms of word) [lemma = "decide"] [decide] DECIDE DECIDE that it
Part of speech [tag = "NN."] [nn*] NOUN fast NOUN
Synonyms Not possible [=soft] =soft soft, smooth, quiet
Customized word lists Not possible [emailAddress@clothes] @clothes dress, shoe, sock
Combinations of preceding [lemma = "end" & pos = "VV."] [end].[v*] END_v end, ends, ended, ending
Combinations of preceding [lemma = "eat"] [tag = "NN."] [eat] * [nn*] EAT * NOUN ate the bananas, eat some cake
Combinations of preceding Not possible [[emailAddress@clothes]] @CLOTHES dress, dresses, shoe, shoes
Combinations of preceding Not possible [[=clean]].[v*] =CLEAN_v cleans, scoured, washing
Combinations of preceding Not possible [wear] * [=nice] [email@clothes] WEAR * =nice @clothes wore some good-looking pants


5. Virtual corpora

In early 2015 we added the ability to create "virtual corpora" for the Wikipedia corpus. In just a few seconds, users could create a virtual corpus of texts related to biology, Buddhism, investments, basketball -- or thousands of other topics. They could then modify these corpora -- adding, deleting, or moving texts. They could limit their searches to a particular virtual corpus (e.g. collocates of stress in psychology or engineering), and compare the frequency of a word or phrase in their different virtual corpora. And best of all, they could create keyword lists for any of the virtual corpora -- in just a few seconds.

We have now added the "virtual corpus" functionality to all of the BYU corpora, which allows you to quickly and easily create and use virtual corpora from any of the texts in these corpora. For example, you could create a virtual corpus of texts from Cosmopolitan or Astronomy magazines (COCA), newspaper articles dealing with the New Deal from 1932-1938 (COHA), web pages from a particular website dealing with cricket in the UK (GloWbE), speeches by Winston Churchill from 1939 to 1945 that mention Germany (Hansard), or newspaper articles from September 2015 dealing with the European refugee crisis (NOW). Click on VIRTUAL/TEXTS in any of the corpora for much more detail and some great examples of these virtual corpora.

6. New corpora

In addition to all of these new features, in May 2016 we also released two new corpora, which of course use the new interface. The first is the NOW corpus ("Newspapers on the Web"), which contains nearly three billion words of text in newspapers from throughout the world from 2010 to the present. Best of all, the corpus automatically grows by about 5,000,000 words each day (~150 million per month, 1.8 billion words per year). So when you search the corpus, you are searching a corpus that is never more than 24 hours old -- not just stale texts in a corpus from the 1980s or early 1990s.

Second, we have released the CORE corpus ("Corpus of Online Registers of English". The corpus contains more than 50 million words of text from the web, and it is the first large web-based corpus that is carefully categorized into many different "web registers" (personal blogs, interviews, instructional pages, etc). This is quite different from other very large corpora that simply present huge amounts of data from web pages as giant "blobs", with no real attempt to categorize them into linguistically distinct registers.


We hope that these new features and corpora will be of benefit to you in your teaching, learning, and research.