American English Word Frequency

We have what is perhaps the most complete and accurate list of word frequency for American English. The data comes from the Corpus of American English. This corpus:

  • Contains 385+ million words of text (20 million words each year, 1990-2007)

  • Is divided evenly between spoken, fiction, popular magazine, newspaper, and academic journals

In other words, this is the only wordlist that is based on a large corpus of American English that has texts from many different genres.

This frequency data can be used for many different purposes:

  • Developing teaching and testing materials

  • Creating frequency-based dictionaries and other lexicographical resources

  • Natural language processing

To see a sample frequency listing of the top 20,000 words, click on one of the following: text file, Excel. You can also see example that show frequency information from each of the five genres (spoken, fiction, popular magazines, newspapers, and academic journals), or the frequency information for each word form of each lemma. In addition, I can create pretty much any other format that you want (such as this hierarchical list, which has lemmas grouped without POS, then by POS, then by word form for a given POS).


Interested users are able to purchase these frequency lists. The one complication, however, is that in 2009 I will be publishing a frequency dictionary of the top 5000 lemmas of English, based on this list. As a result, we need to control the distribution of the frequency data before late 2009 (so that the data is not used for a product that might compete with this frequency dictionary), and users will therefore need to sign a non-disclosure agreement.

Until the frequency dictionary is published in late 2009, the prices are as follows. If you are interested in purchasing one of these lists or if you have other questions, please contact us.

Price ($)

Frequency data

Academic Commercial

750

1500

Top 20,000 lemmas, with information on rank order, frequency, lemma, and part of speech [See example]

1250

2500

Top 60,000 lemmas, with information on rank order, frequency, lemma, and part of speech.

 

   

250

500

(Additional $) Same as either of the above, but with frequency information for each word in each of the five main genres (spoken, fiction, popular magazines, newspapers, academic journals). [See example]

250

500

(Additional $) Same as either of the above, but with each form for each lemma (e.g. separate lines for sing, sings, sang, sung, singing -- with their associated frequencies -- under the entry sing (v). [See examples: Example 1 / Example 2]