English-Corpora.org


Note: click RETURN in the upper right-hand corner to return to this page, after clicking on any of the links below.

Note: these help files are for the older interface, which was used before May 2016. If you are using the newer interface, the layout will be slightly different, but the functionality is the same.

The Wikipedia corpus from English-Corpora.org, which was released in early 2015, contains 1.9 billion words in 4.4 million web pages, and you can search the entire corpus with the same type of queries as the other corpora from English-Corpora.org.

More importantly, though, you can also quickly and easily create "virtual" corpora "on the fly" for any topic that you want, such as:
     biology, investments, Buddhism, psychology, cars, basketball.
The topics can be as narrow as you want, including maybe just 5-10 different Wikipedia pages.

Once you have created these corpora via the web interface, you can then quickly and easily search in the corpora. First, you can find keywords, such as nouns in:
    biology, investments, Buddhism, psychology, cars, or basketball (overall frequency)
    biology, investments, Buddhism, psychology, cars, or basketball (more specific words for these corpora)
Of course, you can search for other words too, for example, such as verbs in Buddhism, adjectives in biology, or noun+noun in investments.

In addition to finding keywords, you can also search within your virtual corpora, such as matching words (e.g. financ*), strings of words (e.g. market + NOUN), collocates (e.g. of market), and concordance lines (e.g. for market).  (All of these examples are from the investments corpus, but you can obviously do searches for any corpus you create.)

There are a number of tutorials for the corpus on YouTube (*= alternate site, if YouTube is not accessible in your country)

General topic Length Individual topics
  Overview * 8:59 - Creating virtual corpora
- Finding keywords in your corpora
- Basic searches in your corpus (frequency, strings, collocates, concordances)
- Editing and managing your corpora
Creating virtual corpora: basic * 2:57 - Creating corpora by word or phrase in the Wikipedia article
- Creating corpora by the title of the Wikipedia article
Finding keywords in your corpus * 4:55 - Frequency listing of corpus by part of speech (noun, verb, adjective, adverb)
- Frequency listing by multi-word expression (Noun+Noun, Adj+Noun)
- Finding words that are more specific to your corpus
Searching within your corpus * 5:58 - Frequency listings (substrings)
- String search, e.g. market + NOUN
- Collocates (nearby words); useful insight into meaning and usage of word
- Concordance lines (re-sortable); see the patterns in which a word occurs
Comparing across corpora * 4:33 - Finding the frequency in the different corpora that you've created
- Example: the frequency of words for "obedience" in different religions
- Example: the frequency of the word gods in different religions
- Comparing concordance lines, e.g. stress in engineering and psychology
Managing your corpora * 3:04 - Deleting your virtual corpora
- "Hiding" or "ignoring" corpora (without completely deleting them)
- Renaming corpora
- Grouping virtual corpora by topic (e.g. science or finance)
Editing your corpora * 7:24 - Deleting individual pages from a corpus
- Deleting pages from your corpus from concordance lines
- Moving pages from one corpus to another
- Adding pages from one corpus to another
- Searching for words and then adding multiple pages to an existing corpus
Creating virtual corpora: advanced * 6:53 - Comparison of searching by words in text and searching by title
- When searching by title is better than searching by words in text
- When searching by title (alone) may not be enough
- By title: adding words that are not in the title
- By title: adding words that are or are not in the next of the page