Corpus Linguistics
Categories
Chomsky@ (8)
Computers and Language@ (265) new updated
Dictionaries Thesauri and Reference@ (116) new
Discourse Analysis@ (10)
Ebonics@ (17)
Linguists- Homepages@ (14)
Modality@ (8)
Pragmatics@ (8)
Pronouns@ (2)
Standard English@ (5)
Teaching@ (30)
Teaching and Teacher Resources@ (363) new
Technology in Education@ (33) new
Links
American National Corpus
The American National Corpus (ANC) project is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. The ANC will provide the most comprehensive picture of American English ever created, and will serve as a resource for education, linguistic and lexicographic research, and technology development.
When completed, the ANC will contain a core corpus of at least 100 million words, comparable across genres to the British National Corpus (BNC). Beyond this, the corpus will include an additional component of potentially several hundreds of millions of words, chosen to provide both the broadest and largest selection of texts possible.
An online course in Corpus Linguistics (from the Univeristy of Lancaster)
Web pages to be used to supplement the book "Corpus Linguistics"
Bookmarks for Corpus-based Linguists
These annotated links (c. 1,000 of them) are meant mainly for linguists and language teachers who work with corpora, not computational linguists/NLP (natural language processing) people, so although the language-engineering-type links here are fairly extensive, they are not exhaustive (for such info, you'll have to look elsewhere).
Brown Corpus Manual
This Standard Corpus of Present-Day American English consists of 1,014,312 words of running text of edited English prose printed in the United States during the calendar year 1961. So far as it has been possible to determine, the writers were native speakers of American English. Although all of the material first appeared in print in the year 1961, some of it was undoubtedly written earlier. However, no material known to be a second edition or reprint of earlier text has been included.
Centre for English Corpus Linguistics
The UCL Centre for English Corpus Linguistics (CECL) is a specialist research centre with two core areas of research activity:
1. Computer learner corpus research
2. Cross-linguistic research
Computer Corpora - What relevance do they have for ELT?
Some examples of ways you can use concordancers in ELT.
Corpus.Byu.Edu
The following are some of the corpora that have been created by Mark Davies, Professor of Corpus Linguistics at Brigham Young University.
Dialogue Diversity Corpus
DDC is intended to facilitate all varieties of research that require dialogues from multiple situations as data. For studies of dialogue dynamics, situational effects in dialogue, dialogue coherence, dialogue genre comparison, studies of role and status in dialogue and many other topics, very diverse dialogue data must be brought to bear on single studies.
Framenet
The Berkeley FrameNet project is a lexicon-building effort in which we (1) study words; (2) describe the frames or conceptual structures which underlie these; (3) examine sentences, using a very large corpus of contemporary English that contains these words; and (4) record the ways in which information from the associated frames are expressed in these sentences.
Google as a Quick 'n Dirty Corpus Tool
Both "very gorgeous" and "very wonderful" could be found in a Google search. (Petersen:[jalttalk 25029]) While this may be true, anyone can make a web page these days including non-native speakers and first graders. This would seem to make Google useless as a source for such answers—or is it?
Hong Kong Virtual Language Center's Concordancer
This concordancer contains a number of old but still popular corpora like LOB and Brown and provides a fuller context for searched items. It can also display up to 1500 instances of searched items.
Michigan Corpus of Academic Spoken English
Welcome to the on-line, searchable part of our collection of transcripts of academic speech events recorded at the University of Michigan.
There are currently 152 transcripts (totaling 1,848,364 words) available at this site.
Natural Language and Computational Linguistics
The Natural Language and Computational Linguistics (NLCL) group (part of the Department of Informatics at the University of Sussex) is one of the largest groups in the UK of researchers focusing on statistical and corpus-based approaches to natural language processing.
Online Concordancers
A concordance gives a list of several words, phrases, or distributed structures along with immediate contexts, from a corpus or other collection of texts assembled for language study.
Variation in English words and phrases
This website allows you to quickly and easily search for a wide range of words and phrases of English in the 100 million word British National Corpus. You can search for words and phrases by exact word or phrase, wildcard or part of speech, or combinations of these.
WebCorp: The Web as Corpus
WebCorp is a suite of tools which allows access to the World Wide Web as a corpus - a large collection of texts from which facts about the language can be extracted.
What is Computational Linguistics?
Computational linguistics (CL) is a discipline between linguistics and computer science which is concerned with the computational aspects of the human language faculty. It belongs to the cognitive sciences and overlaps with the field of artificial intelligence (AI), a branch of computer science aiming at computational models of human cognition. Computational linguistics has applied and theoretical components.
WordSmith Tools
WordSmith Tools is lexical analysis software for the PC. Published by Oxford University Press since 1996 and now at version 4.0.