Download file with information on all 94,391 websites
 

The iWeb corpus contains about 14 billion words in 22,388,141 web pages from 94,391 websites. As far as we are aware, this makes it one of only three large web-based corpora that contain more than 12-13 billion words.

Unlike other large web-based corpora, iWeb was created by focusing on particular websites, rather than just scraping data from web pages on random websites. The following are the steps that we followed to create the corpus.



1. We downloaded from Alexa.com (created by Amazon) data on the top 1,000,000 websites from throughout the word, and their list is based on the number of users of these websites.

2. For each of these one million websites, we used the Alexa data to find what percentage of the users are from the US, Canada, Ireland, the UK, Australia, and New Zealand. The idea was that we wanted websites that would mainly be in English, as opposed to websites from India or Nigeria or Singapore (or obviously China or Japan or Russia), where they might contain a lot of material from other languages.

3. For each of the top 200,000 websites (from step #2), we obtained (and stored in a relational database) the URLs from searches on either Google or Bing. We basically just searched for web pages containing the word "of" from each of these websites, and (because nearly every page will have the word "of"), Google and Bing just gave us "random" web pages from each of these websites, which is what we wanted. Because Google will block repeated queries from the same IP address (somewhat less of a problem with Bing), we had to to these searches very slowly and methodically (to "stay below their radar"), and it took approximately three months to get all of the URLs.

4. We then downloaded all of the web pages for each of these websites, using custom software written by Spencer Davies in the Go programming language. It took about three days (using five different machines) to download the (approximately) 27 million web pages (at about 100 pages per second).

5. Approximately 30,000 of the 200,000 websites (from #3) were eliminated from the corpus, because of one of the following:

  • These were what might be called "transaction" websites, where there is little if any publicly-available data from "static" web pages. Examples might be VPN sites, torrent sites, or sites that require users to log in or to do a specific search to see pretty much anything else. For example, think of Google itself. 99.9% of anything valuable from Google will be the results for a specific search, not a static web page at www.google.com. So a list of random URLs (for static web pages) from www.google.com would not be very useful.
     
  • Websites that were blocked by the proxy server at BYU. The vast majority of these were p@rn sites, although there were a few for gambling, "hate speech", proxy avoidance, and other "blocked" sites.

6. We used JusText to remove "boilerplate" material (headers, footers, sidebars, etc), and we then tagged each of the pages with the CLAWS 7 tagger.

7. At this point, there were about 170,000 websites. We needed websites that had enough words and web pages to get a good sampling of the language from the website. We set a minimum threshold of 10,000 words in at least 30 different web pages from each website, and this eliminated another 65,000 websites, leaving us with about 105,000 websites. 

8. We then did repeated tests and procedures to find duplicate web pages and phrases. We searched for duplicate n-grams (primarily 11-grams), looking for long strings of words that are repeated, such as "This newspaper is copyrighted by Company_X. You are not permitted..." ( = 11 words, including punctuation). We ran these searches many times, in many different ways, trying to find and eliminate duplicate texts, as well as duplicate strings within different texts. Because there is a lot of duplicate material on web pages (even after running programs like JusText), this eliminated approximately 10,600 more websites that had less than 10,000 words or 30 web pages.

9. After all of these steps, we ended up with 94,391 websites (containing 22,388,141 web pages).


In other large web-based corpora, the websites are essentially "random", and the vast majority of websites might contain just a handful of web pages that were scooped up as the list of URLs was generated from links on other pages, and which means that users cannot search by the website itself. But because of the systematic, principled way in which the websites were selected for iWeb, there is an average of 245 web pages and 140,000 words for each of the 94,391 websites. And because of the underling architecture of the corpus, you can quickly and easily create Virtual Corpora to search by website, and to find the websites that refer the most to a particular word, phrase, or even topic.


(Just to be fair, we should note that there are advantages to just grabbing "random" web pages, regardless of what website they are from, and this is essentially the way that we created the GloWbE corpus. As mentioned above, however, the downside to such an approach is that the vast majority of the websites will probably have just a few pages, which means that it wouldn't make sense to use the website as part of the search, as one can do with iWeb.)