corpus.byu.edu


The following is a history of the different corpora, as well as changes and improvements to the corpus architecture and interface.

Upcoming
(Apr 2018)
15-16 billion word iWeb ("Intelligent Web") corpus. Unlike other large corpora of English, this one allows users to compare and contrast between 20+ million web pages from 100,000 different websites.
2017. Oct Released the Early English Books Online (EEBO) corpus, which contains 755 million words in more than 25,000 texts from the 1470s to the 1690s.
2017. Sep All of the corpora and the corpus portal (as well as corpus-based resources) now available with secure HTTPS connection
2017. Feb Released the US Supreme Court corpus, which contains 130 million words in US Supreme Court opinions during the last 200 years.
2016. May Released a major update to the corpus interface, which works great on mobile devices and which allows the use of "virtual corpora"
2016. May Released the NOW corpus, which automatically adds about four million words of data every night. In other words, your searches will show what is going on in English right now.
2016. May Released the CORE corpus, which is the first corpus of web pages (about 50 million words of data) that are carefully tagged for register (personal blog, advice, interviews, etc)
2015. Jul Released the Hansard corpus, which is based on 1.6 billion words in 7.6 million speeches from the British Parliament, 1803-2005.
2015. Jan Released the Wikipedia corpus, which is based on 1.9 billion words in 4.4 million articles from Wikipedia.
2014. Mar Released full-text versions of COCA and GloWbE, which allow users to search the downloaded texts on their own computer
2013. Aug Released www.academicvocabulary.info; free downloadable lists for academic English: word families, core academic, and genre-specific technical words
2013. Aug Released www.wordandphrase.info/academic: same interface as the WordAndPhrase resources below, but for just for COCA-Academic
2013. Apr Released the Corpus of Global Web-Based English (GloWbE) (1.9 billion words, 2012-13)
2013. Jan Released the Strathy Corpus (Canadian English) (50 million words, ~1970s-2000s)
2012. Aug Created ability to compare results from different corpora (side by side) within the web interface, e.g. COCA and BNC
2012. Aug Update the British National Corpus with the CLAWS 7 tagset; inclusion of speech indicators, XML World Edition
2012. Jul Released the Corpus of American Soap Operas (100 million words, 2001-2012)
2012. Jul Added the following datasets to the Google Books corpora: British English (34 billion words), Fiction (91 billion), One Million Books (89 billion), Spanish (45 billion)
2012. Jun Added about 25 million words to the Corpus of Contemporary American English (COCA), for Apr 2011 - Jun 2011.
2012. Feb Modified www.wordandphrase.info: ability to enter entire texts and then see detailed information about words and phrases
2012. Jan Released www.wordandphrase.info: integrated frequency and genre data, definitions, collocates, concordances, synonyms, and WordNet
2011. Dec Released free n-grams lists for COCA and COHA; millions of rows of data for 2-grams (two word sequences), 3-grams, 4-grams, and 5-grams.
2011. May Released beta version of the Google Books (American English) Corpus (155 billion words, 1810-2009)
2011. Apr Added about 15 million words to the Corpus of Contemporary American English (COCA), for July 2010 - Mar 2011.
2011. Feb Added concordance view
2010. Oct Improved functionality for interaction with other users (see queries, researchers, publications) and ability to save and manipulate Keyword in Context entries.
2010. Sep Released beta version of the Corpus of Historical American English (COHA)
2010. Aug Added about 20 million words to the Corpus of Contemporary American English (COCA), for July 2009 - June 2010.
2010. Feb Released the frequency lists and dictionary that are based on the Corpus of Contemporary American English.
2009. Aug Added about 15 million words to the Corpus of Contemporary American English (COCA), for October 2008 - June 2009.
2009. May Added new tools for collaboration: links to previous queries (including annotations/notes) and ability to share them with others
2008. Oct Added about 15 million words to the Corpus of Contemporary American English (COCA), for Jan-Sep 2008.
2008. Jun Applied the new architecture to the Corpus do PortuguÍs
2008. Apr Applied the new architecture to the British National Corpus and the TIME Corpus
2008. Mar Released the Corpus of Contemporary American English
2007. Oct Finished new (current) corpus architecture; applied it to the Corpus del EspaŮol. Major updates in this corpus as well, including much-improved tagging and lemmatization for Modern Spanish.
2007. May Released the TIME Corpus of American English
2006. Aug Released the Corpus do PortuguÍs
2005. Apr Interface for Register Variation in Spanish
2004. Apr Released VIEW, our first version of the British National Corpus
2002. Sep Released the first version of the Corpus del EspaŮol
Misc There are several other corpora with older, non-standard architecture and interface: Polyglot Bible, Polyglot Book of Mormon, Medieval Spanish bibles, and Latin/OSp/ModSp bibles