Frequency lists, n-grams, and customized data

While use is free for all of the corpora, some users may want to download frequency lists, n-grams, or customized data from these corpora. This data can then be used offline for many different purposes, such as:

  • Developing teaching and testing materials

  • Creating frequency-based dictionaries and other lexicographical resources

  • Natural language processing

There are a number of different types of frequency data that are available, including the following:

Type Explanation Sample files
(Click on links)

Word/lemma

The top 20,000 or 55,000 words (depends on corpus), grouped by lemma (so go = go, goes, went, etc). You can also obtain the frequency for each individual word form (for goes, for went, etc) of each lemma, and you can also have the frequency for the lemma in each of the five major genres in the corpus. (See links to the right for examples from COCA for these two specialized lists, although similar lists can be created for any of the corpora).

COCA See files
Spanish 20,000
Portuguese 20,000
BNC (See note 1)

N-grams

The frequency of all two-word (2-gram), three-word (3-gram), or other n-grams strings. With these lists, you can quickly and easily find the frequency of combinations of words across the corpus, without having to use the corpus interface. In addition, you can specify for which words you want n-grams (e.g. top 20,000 lemmas, top 10,000 NOUN+NOUN cobinations, or or all words in your  customized 30,000 word list).

COCA 2-grams 3-grams
BNC 2-grams 3-grams
Spanish 2-grams 3-grams
Portuguese 2-grams 3-grams

Sample sentences

You can get any number of sample sentences (with year, genre, and source) for any number of words in lists that you send us. For example, we recently created 3-6 sentences for 100,000+ words and phrases for an online dictionary. These sentences are selected by using collocates (with frequency and mutual information score) to find good samples for each word.

(From COCA)
Other data

If there is other data that you could use (without having access to the full text: see note 2 below), please let us know. Examples might be the frequency of each word or phrase in a 30,000 word/phrase list, or the frequency of all synonyms for the top 10,000 lemmas in the corpus.

 

The prices for these lists depends on the corpus. For some lists (such as the 20,000 word lists from the Corpus del EspaŮol and the Corpus do PortuguÍs), they are only about $200 for academic use. For the lists from COCA, and for the n-grams and sample sentences, they are somewhat more. For the exact prices, click on "MORE INFORMATION / FREQUENCY LISTS" at any of these four corpora, or contact us.


Note 1: There are other sources for frequency lists from the BNC (site 1, site 2). If, however, you want customized frequency lists (by PoS, by genre, etc), please feel free to contact us. Also, please be aware that the wordlists are from a fifteen year old corpus, and that they probably are not completely applicable to American English (more information: [MORE INFORMATION / COMPARE WORDLISTS TO BNC).

Note 2: We can provide nearly any type of data you want -- with one exception. Because of serious copyright issues, we cannot re-distribute the corpora in any format that would allow end users to re-create even one entire article from the original texts. Again, feel free to contact us if you have questions.