One of the problems with a web page-based corpus is the duplicate text that you will find on different pages, and even within the same page. For example, there might be 10-15 pages from the same website that include a copyright notice (e.g. ...you are not permitted to copy this text...). Or there might be a web page with reader comments, in which a comment at the top of the page gets repeated two or three times later on that page.
We have used several methods to remove these duplicates:
1. As we created lists of web pages
from Google searches, we only used each
web page once, even if it was generated by multiple searches.
Even with these steps, however, there are still duplicate texts and (more commonly) duplicate portions of text in different pages, especially since the corpus is so big (1.9 billion words, in 1.8 million web pages). It will undoubtedly be impossible to eliminate every single one of these duplicates. But at this point, we are continuing to do the following:
4. In the Keyword in Context (KWIC)
display, you will see a number in
parentheses (e.g. (1) ) after web pages where there was a duplicate.
One final issue: what do about intra-page duplicates, i.e. cases where the same text is copied on the same web page. As was mentioned above, there might be a web page with reader comments, in which a comment at the top of the page gets repeated two or three times later on that page. Our approach at this point is to log these in the database as users do KWIC displays (#5 above), but to not delete the duplicates at this point. If a comment is copied on a page, it may be because the comment is an important one, and perhaps it deserves to be preserved twice in the corpus. We're still debating on this, however.
If you have feedback on any of these issues, please feel free to email us. Thanks.