Bookmarks forCorpus-based Linguists

On-line Study Resources/Syllabi | Help with Statistics | Helpdesks/FAQs | Glossaries | E-mail Discussion Lists | Introductory NLP/Computational Linguistics | Courses in Natural Language Processing (NLP) &Programming | Standards and Special Interest Groups (SIGs) in Corpus-Building and NLP

Courses, FAQs, Info, E-lists, Standards


On-line Study Resources on Corpus-based Linguistics

(& humanities computing) See also my page on References, Papers, Journals (use the left-frame menu)

Corpus Linguistics by McEnery & Wilson Lancaster University)

an introductory course on corpus-based linguistics; companion site to the textbook. For errata, click here.

W3-Corpora Project’s pages on CBL

introductory tutorial and information on corpora and CBL; + search engine for a number of corpora

Introduction to the Use of Computer Corpora in Linguistics

Susan Hockey’s very basic, foundational tutorial for absolute beginners

* Course Outlines/Syllabi

Berber-Sardinha’s Corpus Linguistics Course

course outline and bibliography (some parts in Brazilian Portuguese)

Hongyin Tao’s Seminar in Corpus Linguistics

Course outline and lecture notes (UCLA)

On-line Help with Statistics & on-line tools

SISA Statistical Computation Web Site

allows you to do statistical analysis directly on the Internet. Choose a statistical procedure, fill in the form, click the button, and the analysis will take place on the spot. Has guides to which statistical procedures is appropriate for your specific problem.

VassarStats Statistical Computation Web Site

Similar to the SISA site (above). Handy on-line stats calculators (user- friendly tools for performing statistical computations) + statistical tables calculator

Brigitte Krenn’s and Christer Samuelsson’s The Linguist’s Guide to Statistics [Postscript file]

*Despite the title, it’s more for NLP people than linguists.* Basic probability theory, information theory and stochastic grammars. A downloadable book (Postscript format: [Help with file formats here]).

Joakim Nivre’s web course on statistical NLP

based on the Krenn & Samuelsson book (see above). More for NLP people than linguists.

See also the link to the Log-likelihood Wizard (a better alternative to chi-squared).


Helpdesks/FAQs (Frequently Asked Questions) / Info on CBL, corpora, concordancing, NLP

Concordances: Producing and Using them

Copious personal comments on concordancers and concordancing, including a comparison of various programs.

An introduction to concordancing

"What can the computer show us?" from the TACTweb Online Workbook

The history of computer assisted text analysis & concordancing

from the TACTweb Online Workbook

A Survey of Electronic Corpora and Related Resources for Language Researchers

rather dated, but historically useful survey of electronic resources up until the early 90s. Electronic version of Chapter 10 (pp. 263-310) of the following book: Edwards, Jane A. & Martin D. Lampert (eds). (1993) Talking Data: transcription and coding in discourse research. London and Hillsdale, NJ: Erlbaum.

Help with recording and transcribing a spoken corpus

Some tips from the people at Cornell on getting high-quality recordings, and on how to code/mark-up and transcribe.

Glossaries for Corpus-based linguistics and NLP

A Glossary of Corpus-based Linguistics & NLP (at Mannheim)

a handy on-line glossary of terminology and concepts used in the field

Systematic Dictionary of Corpus Linguistics (Lithuania)

a slim glossaryexplaining basic terms and concepts in corpus-based linguistics and NLP/language engineering

Glossary of Corpus Linguistics (W3C site)

a short and rather impoverished glossary

For language teachers new to computing and the internet, have a look at the ICT4LT glossary of terms specific to ICT, CALL and language learning/teaching.


Electronic/E-Mail Discussion Lists

CORPORA list

the main discussion list for corpus-based linguists and NLP (natural language processing) researchers (warning: tends to be dominated by the latter); this link is to the hypertext archive (see the info page on how to sign up)

OR try the SIGLEX index of this discussion list (Selected messages to the CORPORA mailing list have been categorized and links to the threads have been provided. The categorization is based on a SIGLEX ontology. The links have been generated automatically based on subject, the date, and the sender. The links include only the years 1997 to the present. Before 2000, the CORPORA archive is not threaded)

CLLT (Corpus Linguistics and Language Teaching)

largely replicates the CORPORA list in many ways, but is more friendly towards those with more pedagogical and less computational proclivities

Linguist List

one of the oldest, and still one of the best; covers all aspects and branches of linguistics, including CBL (to a limited degree); has links to fonts, software, corpora (but not comprehensive, and rather dated). *Warning: very active and very comprehensive, so watch as your mailbox clutters up with postings.

Humanist list

an international electronic discussion list on the application of computers to the humanities (i.e. history, literature, etc. rather than linguistics/ language teaching); allied with the Association for Computers and the Humanities (ACH) and the Association for Literary and Linguistic Computing (ALLC).

Didn’t find the e-list you wanted? Search among the hundreds listed at the Linguist List list archives. Also relevant for TESL teachers is the TESL-CA list (the TESL and Technology Branch of TESL-L List...the 'CA' stands for 'computer-assisted').

On-line Introductory Books or Papers on NLP/Computational Linguistics

Data-Intensive Linguistics

Chris Brew’s unfinished manuscript for beginning students of computational linguistics/NLP. Some broken links and missing graphics, but informative nonetheless.

Information Theory Primer

primer written for molecular biologists who are unfamiliar with information theory.

A Mathematical Theory of Communication

Shannon’s original paper on information theory.

Courses in Natural Language Processing (NLP) & Programming

(a small sample of syllabi and course material)

Applied Computational Linguistics

a practically-oriented on-line virtual course/hypertextbook on computational linguistics, focussing on the application of NLP techniques to a specific project (the development of an intelligent dictionary look-up program for reading/translating web pages). Aimed at advanced undergraduate and graduate students. Based at the University of Tübingen

Language and Statistics

CMU introductory course on computational linguistics (syllabus & notes)

LING 361 Intro to Computational Linguistics

at Georgetown University

LING-360 Perl Programming

at Georgetown University

LING 5200 Computational Methods in Linguistics

at the University of Colorado at Boulder

Statistical Natural Language Processing

web-based course in statistical natural language processing

Statistical Models in Natural-Language Processing

Eugene Charniak's course: statistical methods for learning a natural language and applying the knowledge to specific tasks.

UNIX Tutorial

for those who want to go beyond Windows

Unix for Linguists

some links to on-line Unix courses

Lectures on Tools for Corpus Linguistics

an introduction to processing corpora with UNIX tools and Perl programming

Corpus Mining: Perl and Python Programming (Jon Fernquest’s page)

Perl programming for teachers, translators, and writers, for creating language discovery tools.

Standards and Special Interest Groups (SIGs) in Corpus-Building and NLP

Corpus Encoding Standard (CES)

(see also: XCES: XML Version of the CES)

for NLP people and larger-corpus builders; a set of encoding standards for corpus-based work and natural language processing applications

COCOSDA

The International Committee for the Co-ordination and Standardisation of Speech Databases and Assesment Techniques

established to encourage and promote international interaction and cooperation in the foundation areas of Spoken Language Processing, esepcially for Speech Input/Output. COCOSDA supports the development of spoken language resources and speech technology evaluation. For the former, COCOSDA promotes the development of distinctive types of spoken language data corpora for the purpose of building and/or evaluating current or future spoken language technology. For the latter COCOSDA offers coordination of projects and research efforts to improve their efficiency.

Dublin Core Metadata Initiative (DCMI)

an organization dedicated to promoting the widespread adoption of interoperable metadata standards and developing specialized metadata vocabularies for describing resources that enable more intelligent information discovery systems. Includes An XML Encoding of Simple Dublin Core Metadata, XMLS Schemas and RDF Schemas.

DISC Best Practice Guide

extends and specialises software engineering best practice to the particular purposes of dialogue engineering, that is, to the development and evaluation of spoken language dialogue systems (SLDSs). The Guide consists of the DISC Dialogue Engineering Model, a series of SLDS design support tools, state-of-the-art overviews and references to relevant sites and running SLDSs world-wide, a glossary, DISC publications, and Guide evaluation stuff for you to use.

EAGLES
(Expert Advisory Group on Language Engineering Standards)

An initiative of the European Commission which aims to accelerate the provision of standards for: (i) Very large-scale language resources (such as text corpora, computational lexicons and speech corpora); (ii) Means of manipulating such knowledge, via computational linguistic formalisms, mark up languages and various software tools; (iii) Means of assessing and evaluating resources, tools and products.

International Standards for Language Engineering (ISLE)

(operates under the aegis of the EAGLES initiative)

aim is to develop human language technology (HLT) standards within an international framework, in the context of the EU-US International Research Cooperation initiative. There is an increasing Asian interest for the initiative and the relevance of standards in the field of HLT. Its objectives are to support national projects, HLT RTD projects and the language technology industry in general by developing, disseminating and promoting de facto HLT standards and guidelines for language resources, tools and products.

ISLE Metadata Initiative (IMDI)

(operates under the aegis of the EAGLES initiative)

goal is to propose a standard of meta-data descriptions of Multi-Media/Multi-Modal Language resources. Using such a standard it will become possible to create a browsable and searchable universe of such resources in the Internet. This will enable interested parties to efficiently locate suitable resources and thus increases their reusability..

LE-PAROLE (Language Engineering - Preparatory Action for Linguistic Resources Organization for Language Engineering)

The European PAROLE project aimed to create a large-scale harmonised set of "core" corpora and lexica for over a dozen West European languages in compliance with common guidelines. 14 Western European language groups participated in the PAROLE project. Language corpora and lexica were built according to the same design and composition principles, in the period 1996-1998. For each of these languages, the project has resulted in a 20-million-word text corpus composed according to similar design principles and TEI encoded according to the PAROLE DTD. 250,000 words are POS encoded. Another product of the PAROLE project is a set of harmonised lexica containing a minimum of 20,000 entries provided with morphosyntactic and syntactic information. More info on PAROLE here and here. Restricted access site here (hosted by the Istituto di Linguistica Computazionale). Detailed descriptions of PAROLE text corpora and lexica may be found in the ELDA catalogue here.

Multilevel Annotation, Tools Engineering (MATE)

The MATE project aims to facilitate re-use of language resources by addressing the problems of creating, acquiring, and maintaining language corpora. Done through: (i) the development of a standard for annotating resources; (ii) the provision of tools which will make the processes of knowledge acquisition and extraction more efficient. Specifically, MATE will treat spoken dialogue corpora at multiple levels, focusing on prosody, (morpho-) syntax, co-reference, dialogue acts, and communicative difficulties, as well as inter-level interaction. The results of the project will be of particular benefit to developers of spoken language dialogue systems but will also be directly useful for other applications of language engineering.

Open Language Archives Community (OLAC)

A metadata initiative for language data and NLP tools

offers a standard format for describing/cataloguing corpora and tools in a search-engine-friendly way. See fuller description here

Resource Description Framework (RDF)

a framework for supporting resource description, or metadata (data about data), for the Web. RDF provides common structures that can be used for interoperable XML data exchange, and follows the W3C design principles of interoperability, evolution, and decentralization.

SIGDIAL (Special Interest Group on Discourse and Dialogue; afilliated with the Association for Computational Linguistics (ACL) and the International Speech Communication Association (ISCA))

Links on coding schemes, languague resources, methods, tools, projects, references, conferences, etc. Aims to: promote development and distribution of reusable discourse processing components; explore techniques for evaluation of dialogue systems; share resources and data among the international community; encourage empirical methods in research; agree upon standards for discourse transcription, segmentation, and annotation; promote collaboration among developers of various dialogue system components support student participation in the discourse and dialogue community.

SIGLEX (Special Interest Group on the Lexicon)

an umbrella for a variety of research interests ranging from lexicography and the use of online dictionaries to computational lexical semantics.

Text Encoding Initiative (TEI)

TEI is an international and interdisciplinary standard that helps libraries, museums, publishers, and individual scholars represent all kinds of literary and linguistic texts for online research and teaching, using an encoding scheme that is maximally expressive and minimally obsolescent. The TEI’s Guidelines for Electronic Text Encoding and Interchange is here.

Unicode

Description from the Linguistic Annotation page: The Unicode Consortium brings together software corporations and researchers at the leading edge of standardizing international character encoding. The outcome of this cooperation is The Unicode Standard, which provides the foundation for internationalization and localization of software. Unicode has a conference series and an FAQ. There are character charts, including one for IPA extensions.

XML/SGML standards and associated resources

(For XML software/tools, see the 'Software, Tools, Frequency Lists...' page)

The XML Cover Pages

including Academic Applications of XML

Edinbrugh XML Research Dissemination Workshop

XML Markup Technologies for Working with Linguistic Data, 10-11 May 2001, Edinburgh.

XML Corpus Encoding Standard (XCES)

instantiates the EAGLES Corpus Encoding Standard (CES) DTDs for linguistic corpora, developed by the Department of Computer Science, Vassar College, and Equipe Langue et Dialogue, LORIA/CNRS..

Center for Electronic Texts in the Humanities (CETH)

has some info and links on SGML/XML


If you need help with file formats for some of the downloads, [click here]

Did you find this web site useful? Do let me know, to encourage me to keep updating the site.

[TOP of this page]