corpora, size, queries = better resources, more insight

 Contribute   Contributors 

 Academic site license 

Size, speed, queries
Insight into variation

Upcoming (May 2016)
History / updates
FAQ / questions
Log in / password
Profile / register

Related resources
   Full-text data
   Word frequency
   Academic vocabulary

Contact us

The following is a history of the different corpora, as well as changes and improvements to the corpus architecture and interface.

2015. Jul

Released the Hansard corpus, which is based on 1.6 billion words in 7.6 million speeches from the British Parliament, 1803-2005. In addition to allowing for virtual corpora, this corpus is also semantically tagged, which allow for powerful meaning-based searches.

2015. Jan

Released the Wikipedia corpus, which is based on 1.9 billion words in 4.4 million articles from Wikipedia. You can create "virtual corpora" from the 4.4 million web pages (e.g. electrical engineering, investments, or basketball), and then search just that corpus, or create keyword lists based on that virtual corpus.

2014. Mar Released full-text versions of COCA and GloWbE, which allow users to search the downloaded texts on their own computer
2013. Aug

Released; free downloadable lists for academic English: word families, core academic, and genre-specific technical words

2013. Aug

Released same interface as the WordAndPhrase resources below, but for just for COCA-Academic

2013. Apr

Released the Corpus of Global Web-Based English (GloWbE) (1.9 billion words, 2012-13)

2013. Jan

Released the Strathy Corpus (Canadian English) (50 million words, ~1970s-2000s)

2012. Aug

Created ability to compare results from different corpora (side by side) within the web interface, e.g. COCA and BNC

2012. Aug

Update the British National Corpus with the CLAWS 7 tagset; inclusion of speech indicators, XML World Edition

2012. Jul

Released the Corpus of American Soap Operas (100 million words, 2001-2012)

2012. Jul

Added the following datasets to the Google Books corpora: British English (34 billion words), Fiction (91 billion), One Million Books (89 billion), Spanish (45 billion)

2012. Jun

Added about 25 million words to the Corpus of Contemporary American English (COCA), for Apr 2011 - Jun 2011.

2012. Feb

Modified ability to enter entire texts and then see detailed information about words and phrases

2012. Jan

Released integrated frequency and genre data, definitions, collocates, concordances, synonyms, and WordNet

2011. Dec

Released free n-grams lists for COCA and COHA; millions of rows of data for 2-grams (two word sequences), 3-grams, 4-grams, and 5-grams.

2011. May

Released beta version of the Google Books (American English) Corpus (155 billion words, 1810-2009)

2011. Apr

Added about 15 million words to the Corpus of Contemporary American English (COCA), for July 2010 - Mar 2011.

2011. Feb Added concordance view

2010. Oct

Improved functionality for interaction with other users (see queries, researchers, publications) and ability to save and manipulate Keyword in Context entries.

2010. Sep

Released beta version of the Corpus of Historical American English (COHA)

2010. Aug

Added about 20 million words to the Corpus of Contemporary American English (COCA), for July 2009 - June 2010.

2010. Feb

Released the frequency lists and dictionary that are based on the Corpus of Contemporary American English.

2009. Aug

Added about 15 million words to the Corpus of Contemporary American English (COCA), for October 2008 - June 2009.

2009. May

Added new tools for collaboration: links to previous queries (including annotations/notes) and ability to share them with others

2008. Oct

Added about 15 million words to the Corpus of Contemporary American English (COCA), for Jan-Sep 2008.

2008. Jun

Applied the new architecture to the Corpus do Portuguęs

2008. Apr

Applied the new architecture to the British National Corpus and the TIME Corpus

2008. Mar

Released the Corpus of Contemporary American English

2007. Oct

Finished new (current) corpus architecture; applied it to the Corpus del Espaņol. Major updates in this corpus as well, including much-improved tagging and lemmatization for Modern Spanish.

2007. May

Released the TIME Corpus of American English

2006. Aug

Released the Corpus do Portuguęs

2005. Apr

Interface for Register Variation in Spanish

2004. Apr

Released VIEW, our first version of the British National Corpus

2002. Sep

Released the first version of the Corpus del Espaņol


There are several other corpora with older, non-standard architecture and interface: Polyglot Bible, Polyglot Book of Mormon, Medieval Spanish bibles, and Latin/OSp/ModSp bibles