corpus.byu.edu

corpora, size, queries = better resources, more insight


 Upgrade   Contributors 

 Academic site license 

Overview
Corpora
Size, speed, queries
Insight into variation

Updates (May 2016)
History / updates
FAQ / questions
Researchers

Register / create profile
Log in / password
Reset password

Related resources
   Full-text data
   Word frequency
   Collocates
   N-grams
   WordAndPhrase
   Academic vocabulary

Problems
Contact us


Note: click RETURN in the upper right-hand corner to return to this page, after clicking on any of the links below.

Note: these help files are for the older interface, which was used before May 2016. If you are using the newer interface, the layout will be slightly different, but the functionality is the same.

The BYU Wikipedia corpus, which was released in early 2015, was created by Mark Davies (professor of linguistics at Brigham Young University). It contains 1.9 billion words in 4.4 million web pages, and you can search the entire corpus with the same type of queries as the other BYU corpora.

More importantly, though, you can also quickly and easily create "virtual" corpora "on the fly" for any topic that you want, such as:
     biology, investments, Buddhism, psychology, cars, basketball.
The topics can be as narrow as you want, including maybe just 5-10 different Wikipedia pages.

Once you have created these corpora via the web interface, you can then quickly and easily search in the corpora. First, you can find keywords, such as nouns in:
    biology, investments, Buddhism, psychology, cars, or basketball (overall frequency)
    biology, investments, Buddhism, psychology, cars, or basketball (more specific words for these corpora)
Of course, you can search for other words too, for example, such as verbs in Buddhism, adjectives in biology, or noun+noun in investments.

In addition to finding keywords, you can also search within your virtual corpora, such as matching words (e.g. financ*), strings of words (e.g. market + NOUN), collocates (e.g. of market), and concordance lines (e.g. for market).  (All of these examples are from the investments corpus, but you can obviously do searches for any corpus you create.)

There are a number of tutorials for the corpus on YouTube (*= alternate site, if YouTube is not accessible in your country)

General topic Length Individual topics
  Overview * 8:59 - Creating virtual corpora
- Finding keywords in your corpora
- Basic searches in your corpus (frequency, strings, collocates, concordances)
- Editing and managing your corpora
Creating virtual corpora: basic * 2:57 - Creating corpora by word or phrase in the Wikipedia article
- Creating corpora by the title of the Wikipedia article
Finding keywords in your corpus * 4:55 - Frequency listing of corpus by part of speech (noun, verb, adjective, adverb)
- Frequency listing by multi-word expression (Noun+Noun, Adj+Noun)
- Finding words that are more specific to your corpus
Searching within your corpus * 5:58 - Frequency listings (substrings)
- String search, e.g. market + NOUN
- Collocates (nearby words); useful insight into meaning and usage of word
- Concordance lines (re-sortable); see the patterns in which a word occurs
Comparing across corpora * 4:33 - Finding the frequency in the different corpora that you've created
- Example: the frequency of words for "obedience" in different religions
- Example: the frequency of the word gods in different religions
- Comparing concordance lines, e.g. stress in engineering and psychology
Managing your corpora * 3:04 - Deleting your virtual corpora
- "Hiding" or "ignoring" corpora (without completely deleting them)
- Renaming corpora
- Grouping virtual corpora by topic (e.g. science or finance)
Editing your corpora * 7:24 - Deleting individual pages from a corpus
- Deleting pages from your corpus from concordance lines
- Moving pages from one corpus to another
- Adding pages from one corpus to another
- Searching for words and then adding multiple pages to an existing corpus
Creating virtual corpora: advanced * 6:53 - Comparison of searching by words in text and searching by title
- When searching by title is better than searching by words in text
- When searching by title (alone) may not be enough
- By title: adding words that are not in the title
- By title: adding words that are or are not in the next of the page