corpus.byu.edu


A great deal of data from iWeb is available for download, in the same way that it already is available for COCA: word frequency, collocates, n-grams, full text data, etc.

Click on any of the links below for more information and samples of this data.

Data Description
Word frequency Top 60,000 words, along with word forms and range information (number of websites in which they occur)
Collocates Top 1,000 collocates for each of the top 60,000 words in the corpus (60,000,000 node/collocate pairs)
N-grams Top 100 million n-grams for each of the following: 2-grams (two word strings), 3-grams, 4-grams, and 5-grams
URLs 22 million URLs for the corpus, along with website, title, and # words in the web page
Full-text data

95% of the text from the 14 billion words of text, including a listing of all 22+ million web pages used in the corpus