English-Corpora.org


In May 2016 we also released a new version of the corpora from English-Corpora.org. The following are the major changes to the corpus architecture and interface. (Problems?...)

1. More mobile-friendly

The previous interface had lots of frames. These worked well on laptop and and desktop computers, but not very well with mobile phones or tablets. The new interface is designed from the ground up to work on screens of any size. The following are some screenshots from a mobile phone for (left to right) the search interface, results display, Keyword in Context display (KWIC), and expanded KWIC. The interface looks even better on a device with a larger screen, but the bottom line is that the corpora now look and work fine, no matter what device you're using. (Note: the older interface will still be online as well, at least for the foreseeable future.)


2. Cleaner, more simple interface

The search form in the previous interface was a bit overwhelming (below, left). The newer interface is much cleaner and simple to use (below, right). All of the previous functionality is still there -- the ability to limit by and compare sections of the corpus, deciding how to sort the data, etc -- but now those form elements only appear when you need them.

Previous interface

 

New: simple list view

New: collocates view


3. More helpful help files

Context-sensitive help files now appear whenever you click on a form element -- list, collocates, compare words, sections, virtual corpora, etc. And there are sample searches in each of these files, which you can modify to make your own searches.


4. Simpler, more intuitive search syntax

Some search syntaxes are (in our view) unnecessarily complex, like the CQP syntax on the left. The previous search syntax had a much simpler syntax, but there were still too many square brackets, full stops, asterisks, etc (no fun to type these on a mobile phone keyboard). We have now simplified the search syntax even more, as is shown on the right. But while they're learning the newer, simpler syntax, people can still use any combination of the older and newer syntax.  (For more information, including the new part of speech codes, click on LIST in the search form of a corpus, and then Part of Speech)

Type of search CQP syntax Previous search syntax New syntax Example
Word [word = "nooks"] nooks nooks nooks and crannies
Lemma (forms of word) [lemma = "decide"] [decide] DECIDE DECIDE that it
Part of speech [tag = "NN."] [nn*] NOUN fast NOUN
Synonyms Not possible [=soft] =soft soft, smooth, quiet
Customized word lists Not possible [emailAddress@clothes] @clothes dress, shoe, sock
Combinations of preceding [lemma = "end" & pos = "VV."] [end].[v*] END_v end, ends, ended, ending
Combinations of preceding [lemma = "eat"] [tag = "NN."] [eat] * [nn*] EAT * NOUN ate the bananas, eat some cake
Combinations of preceding Not possible [[emailAddress@clothes]] @CLOTHES dress, dresses, shoe, shoes
Combinations of preceding Not possible [[=clean]].[v*] =CLEAN_v cleans, scoured, washing
Combinations of preceding Not possible [wear] * [=nice] [email@clothes] WEAR * =nice @CLOTHES wore some good-looking pants


5. Virtual corpora

In early 2015 we added the ability to create "virtual corpora" for the Wikipedia corpus. In just a few seconds, users could create a virtual corpus of texts related to biology, Buddhism, investments, basketball -- or thousands of other topics. They could then modify these corpora -- adding, deleting, or moving texts. They could limit their searches to a particular virtual corpus (e.g. collocates of stress in psychology or engineering), and compare the frequency of a word or phrase in their different virtual corpora. And best of all, they could create keyword lists for any of the virtual corpora -- in just a few seconds.

We have now added the "virtual corpus" functionality to all of the corpora from English-Corpora.org, which allows you to quickly and easily create and use virtual corpora from any of the texts in these corpora. For example, you could create a virtual corpus of texts from Cosmopolitan or Astronomy magazines (COCA), newspaper articles dealing with the New Deal from 1932-1938 (COHA), web pages from a particular website dealing with cricket in the UK (GloWbE), speeches by Winston Churchill from 1939 to 1945 that mention Germany (Hansard), or newspaper articles from September 2015 dealing with the European refugee crisis (NOW). Click on VIRTUAL/TEXTS in any of the corpora for much more detail and some great examples of these virtual corpora.


In May 2016 we also released the following new corpora:

NOW corpus ("News On the Web") This 3 billion word corpus is like a "GloWbE monitor corpus" (allowing you to look at changes over time), and it will never be more than 24 hours out of date.

We have created a series of scripts that add about four million words of data (from the same twenty countries as GloWbE) every night (so ~130 million words a month / 1.5 billion words a year). The scripts run automatically from 10 PM - 1 AM -- getting the URLs from Google News; downloading the 7,000-8,000 web pages with HTTrack; cleaning them up with JusText (to remove boilerplate material); tagging and lemmatizing with CLAWS 7; and then integrating them into our existing relational database architecture. So when people search the NOW corpus, the data will be current as of no more than 24 hours ago, which should be useful for research that would benefit from up-to-date corpora (i.e. no more stale examples from corpora that only contain texts from the 1980s or 1990s -- a full generation ago).

The interface also allows users to find keywords and key phrases for any date or range of dates, and to quickly and easily find the "most recent 100" tokens of any word, phrase, or construction. Finally, in Summer 2016 we will also make available by subscription the ~130 million words of cleaned texts every month, similar in format to the other full-text data.

Corpus of Web Genres.  Douglas Biber, Jesse Egbert, and Mark Davies received a grant from the US National Science Foundation to create "A Linguistic Taxonomy of English Web Registers", and this corpus is the fruit of that research (see also articles 1, 2, and our 2017 book on "web registers" from Cambridge University Press). The corpus contains more than 50 million words of text from the web, and it is the first large web-based corpus that is so carefully categorized into so many different registers. This is quite different from other very large corpora that simply present huge amounts of data from web pages as giant "blobs", with no real attempt to categorize them into linguistically distinct registers.


We hope that these new features and corpora will be of benefit to you in your teaching, learning, and research.