BYU Corpus of American English

This MedLibrary.org supplementary page on BYU Corpus of American English is provided directly from the open source Wikipedia as a service to our readers. Please see the note below on authorship of this content, as well as the Wikipedia usage guidelines. To search for other content from our encyclopedia supplement, please use the form below:

The freely-searchable 385+ million word Corpus of Contemporary American English (COCA) is the largest corpus of American English currently available, and the only publicly-available corpus of American English to contain a wide array of texts from a number of genres. In addition, since new texts will be added at least two times each year (20 million new words each year), it will serve as a unique linguistic history of American English since 1990.

Contents

Content

The corpus is composed of more than 385 million words in more than 150,000 texts, including 20 million words each year from 1990-2008. For each year (and therefore overall, as well), the corpus is evenly divided between the five genres of spoken, fiction, popular magazines, newspapers, and academic journals. The texts come from a variety of sources:

  • Spoken: (79 million words) Transcripts of unscripted conversation from nearly 150 different TV and radio programs.
  • Fiction: (75 million words) Short stories and plays from literary magazines, children’s magazines, popular magazines, first chapters of first edition books 1990-present, and movie scripts.
  • Popular Magazines: (81 million words) Nearly 100 different magazines, with a good mix (overall, and by year) between specific domains (news, health, home and gardening, women, financial, religion, sports, etc).
  • Newspapers: (76 million words) Ten newspapers from across the US, with a good mix between different sections of the newspapers, such as local news, opinion, sports, financial, etc.
  • Academic Journals: (76 million words) Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress classification system.

Queries

  • The interface is the same as the BYU-BNC interface for the 101 million word British National Corpus and 100 million word TIME Magazine corpus (see links below)
  • Queries by word, phrase, alternates, substring, part of speech, lemma, synonyms (see below), and customized lists (see below)
  • The corpus is tagged by CLAWS, the same tagger that was used for the BNC and the TIME corpus
  • Chart listings (totals for all matching forms in each genre or year, 1990-present, as well as for sub-genres) and table listings (frequency for each matching form in each genre or year)
  • Full collocates searching (up to ten words left and right of node word)
  • Comparisons between genres or time periods (e.g. collocates of 'chair' in fiction or academic, nouns with 'break the [N]' in newspapers or academic, adjectives that occur primarily in sports magazines, or verbs that are more common 2004-2008 than previously)
  • One-step comparisons of collocates of related words, to study semantic or cultural differences between words (e.g. comparison of collocates of 'small' and 'little', or 'Democrats' and 'Republicans', or 'men' and 'women', or 'rob' vs 'steal')
  • Users can include semantic information from a 60,000 entry thesaurus directly as part of the query syntax (e.g. frequency and distribution of synonyms of 'beautiful', synonyms of 'strong' occurring in fiction but not academic, synonyms of 'clean' + noun ('clean the floor', 'washed the dishes')
  • Users can also create their own own 'customized' word lists, and then re-use these as part of subsequent queries (e.g. lists related to a particular semantic category (clothes, foods, emotions), or a user-defined part of speech)
  • Note that the corpus is only available through the web interface, due to copyright restrictions.

See also

External links

References

  • Davies, Mark (2008), "Relational databases as a robust architecture for the analysis of word frequency". In AHRC ICT Methods Network: Expert Seminar on Linguistics: Word Frequency and Keyword Extraction, ed. Dawn Archer. Ashgate..
  • Davies, Mark (2005), "The advantage of using relational databases for large corpora: speed, advanced queries, and unlimited annotation". International Journal of Corpus Linguistics 10: 301-28.

Wikipedia content modification information:

  • This page was last modified on 28 October 2008, at 21:28.

Wikipedia Authorship and Review

Wikipedia content provided here is not reviewed directly by MedLibrary.org. Wikipedia content is authored by an open community of volunteers and is not produced by or in any way affiliated with MedLibrary.org.

Wikipedia Usage Guidelines

This article is licensed under the GNU Free Documentation License. It uses material from the Wikipedia article on "BYU Corpus of American English".

The URL for this specific entry is:

All Wikipedia text is available under the terms of the GNU Free Documentation License. (See Copyrights for details). Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc.