The American National Corpus (ANC) project is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. The ANC will provide the most comprehensive picture of American English ever created, and will serve as a resource for education, linguistic and lexicographic research, and technology development.
When completed, the ANC will contain a core corpus of at least 100 million words, comparable across genres to the British National Corpus (BNC). Beyond this, the corpus will include an additional component of potentially several hundreds of millions of words, chosen to provide both the broadest and largest selection of texts possible.
This Standard Corpus of Present-Day American English consists of 1,014,312 words of running text of edited English prose printed in the United States during the calendar year 1961. So far as it has been possible to determine, the writers were native speakers of American English. Although all of the material first appeared in print in the year 1961, some of it was undoubtedly written earlier. However, no material known to be a second edition or reprint of earlier text has been included.
DDC is intended to facilitate all varieties of research that require dialogues from multiple situations as data. For studies of dialogue dynamics, situational effects in dialogue, dialogue coherence, dialogue genre comparison, studies of role and status in dialogue and many other topics, very diverse dialogue data must be brought to bear on single studies.
The Berkeley FrameNet project is a lexicon-building effort in which we (1) study words; (2) describe the frames or conceptual structures which underlie these; (3) examine sentences, using a very large corpus of contemporary English that contains these words; and (4) record the ways in which information from the associated frames are expressed in these sentences.
Welcome to the on-line, searchable part of our collection of transcripts of academic speech events recorded at the University of Michigan.
There are currently 152 transcripts (totaling 1,848,364 words) available at this site.
The Natural Language and Computational Linguistics (NLCL) group (part of the Department of Informatics at the University of Sussex) is one of the largest groups in the UK of researchers focusing on statistical and corpus-based approaches to natural language processing.
This website allows you to quickly and easily search for a wide range of words and phrases of English in the 100 million word British National Corpus. You can search for words and phrases by exact word or phrase, wildcard or part of speech, or combinations of these.
Computational linguistics (CL) is a discipline between linguistics and computer science which is concerned with the computational aspects of the human language faculty. It belongs to the cognitive sciences and overlaps with the field of artificial intelligence (AI), a branch of computer science aiming at computational models of human cognition. Computational linguistics has applied and theoretical components.