One of the problems with a web page-based corpus is the duplicate text that you will find on different pages, and even within the same page. For example, there might be 10-15 pages from the same website that include a copyright notice (e.g. ...you are not permitted to copy this text...). Or there might be a web page with reader comments, in which a comment at the top of the page gets repeated two or three times later on that page.

We have used several methods to remove these duplicates:

1. As we created lists of web pages from Google searches, we only used each web page once, even if it was generated by multiple searches.
2. JusText removed most boilerplate material (e.g. headers, footers, sidebars), which contains a lot of duplicate material on pages from the same website.
3. Once we had downloaded all 25 million web pages, we then searched for duplicate n-grams (primarily 11-grams, in our case), looking for long strings of words that are repeated, such as "This newspaper is copyrighted by Company_X. You are not permitted..." ( = 11 words, including punctuation). We ran these searches many times, in many different ways, trying to find and eliminate duplicate texts, and also duplicate strings within different texts.

Even with these steps, however, there are still duplicate texts and (more commonly) duplicate portions of text in different pages, especially since the corpus is so big (1.9 billion words, in 1.8 million web pages). It will undoubtedly be impossible to eliminate every single one of these duplicates. But at this point, we are continuing to do the following:

4. In the Keyword in Context (KWIC) display, you will see a number in parentheses (e.g. (1) ) after web pages where there was a duplicate.
5. As these duplicates are found -- one by one as KWIC displays are generated for thousands of corpus users -- they will get logged in the database. Every month or so, we will run scripts to eliminate these duplicate texts / strings. In this way, the corpus will continue to get "cleaner and cleaner" over time.

One final issue: what do about intra-page duplicates, i.e. cases where the same text is copied on the same web page. As was mentioned above, there might be a web page with reader comments, in which a comment at the top of the page gets repeated two or three times later on that page. Our approach at this point is to log these in the database as users do KWIC displays (#5 above), but to not delete the duplicates at this point. If a comment is copied on a page, it may be because the comment is an important one, and perhaps it deserves to be preserved twice in the corpus. We're still debating on this, however.

If you have feedback on any of these issues, please feel free to email us. Thanks.