corpus.byu.edu

corpora, size, queries = better resources, more insight


 Upgrade   Contributors 

 Academic site license 

Overview
Corpora
Size, speed, queries
Insight into variation

Updates (May 2016)
History / updates
FAQ / questions
Researchers

Register / create profile
Log in / password
Reset password

Related resources
   Full-text data
   Word frequency
   Collocates
   N-grams
   WordAndPhrase
   Academic vocabulary

Problems
Contact us


The following is an extended discussion of why be believe that our use of the texts in the Corpus of Contemporary American English (COCA) is within the bounds of US Fair Use Law. Similar arguments would be used for other corpora that we have created.

The following are the four criteria used to determine whether materials fall under the provisions of the Fair Use Law:

Criteria

What favors Fair Use status

The Corpus of Contemporary American English

The amount and substantiality of the portion taken

Small portions of the original text, rather than full-text access

Under no circumstances whatsoever do end users have access to entire texts (e.g. newspaper, magazine, or journal articles, or short stories). All access is via the web interface, and the vast majority of what users see are simply frequency charts showing the frequency of words or phrases in different parts of the corpus. Access to small portions of the original text is more of an "afterthought", rather than the central feature of the interface.

Access to actual portions of the original text is limited to very short "Keyword in Context" displays, where users see just a handful of words to the left and the right of the word(s) searched for. In addition, all access is logged, and users can only perform a limited number of searches per day. As a result, it would be difficult for end users to re-create even one paragraph from the original text, and it would be virtually impossible to re-create an entire page of text, much less the entire article.

This "snippet defense" (which relies on limited access to the original text via small snippets from the web interface) is the same one used by Google Books for its use of millions of copyrighted materials. In addition, we have consulted two lawyers who specialize in Internet copyright law (names available upon request). They have both stated that because of our limited access to end users, as well as our status with regards to the other three factors shown here, we are clearly in accord with the provisions of the Fair Use statute.

The purpose and character of the use

Academic, non-commercial

Our use of the texts is strictly for academic research, and is purely non-commercial.

The nature of the copyrighted work

Non-creative works

There are some creative works (e.g. short stories and small sections of novels) in the corpus, but more than 80% of the corpus is composed of transcripts of TV shows, and articles from newspapers, magazines, and academic journals.

The effect of the use upon the potential market

Little or no effect on the copyright holder

Because of the very limited access via our web interface (see the first item above), it is extremely unlikely that anyone would use this corpus as a "substitute" for other access to the original texts. Other sources make these texts available as "complete articles", which are meant to be read in their entirety. That is completely impossible with our interface.

Access to the texts via our interface, as compared to access via other sources, serves two completely different audiences. Our interface is designed for linguists and language learners who want to see the frequency of words, phrases, synonyms, etc., and it is completely inadequate for anyone who wishes to read the entire text of an article. As a result, there is very little or no "competition" between our service and that provided by others, and therefore virtually no market impact.

 

In addition to the copyright issues, there are also licensing issues, in terms of the sources from which we obtained some of the texts in the corpus. We were very careful, however, to retrieve the materials over a very long period of time (four years -- 2005-2008), so as to not violate licensing agreements on how much material could be retrieved in a particular timeframe.