This corpus contains about 1.9 billion words in 4.4 million web pages. The pages were downloaded from Wikipedia, and because of the Creative Commons license, there are essentially no copyright restrictions for our version (or any other version of the corpus).

After downloading the one large file that contained all 4.4+ million web pages, we used VB.NET (and lots and lots of regular expressions) to process the data. Everything went into MS SQL Server databases, including the metadata, a list of the links, and all of the text for each page. We started with the same architecture and interface as the rest of the BYU corpora, but we then modified this quite a bit to allow functionality for virtual corpora.