Courses, FAQs, Info, E-lists, Standards

On-line Study Resources on Corpus-based Linguistics (& humanities computing) See also my page on *References, Papers, Journals* (use the left-frame menu)
Corpus Linguistics by McEnery & Wilson Lancaster University)	an introductory course on corpus-based linguistics; companion site to the textbook. For errata, click here.
W3-Corpora Project’s pages on CBL	introductory tutorial and information on corpora and CBL; + search engine for a number of corpora
Introduction to the Use of Computer Corpora in Linguistics	Susan Hockey’s very basic, foundational tutorial for absolute beginners
* Course Outlines/Syllabi
Berber-Sardinha’s Corpus Linguistics Course	course outline and bibliography (some parts in Brazilian Portuguese)
Hongyin Tao’s Seminar in Corpus Linguistics	Course outline and lecture notes (UCLA)

For notes on using concordancers in the language classroom and corpus-related courses for lg teachers, see the section on CALL/CBL & Language Teaching
If you want to attend intensive courses on corpus-based methods, try the Tuscan Word Centre
The ITRI at Brighton, UK also conducts courses on corpus-based lexicography. In 2003, they also had a short course on "Corpora and Language Teaching".

On-line Help with Statistics & on-line tools
SISA Statistical Computation Web Site	allows you to do statistical analysis directly on the Internet. Choose a statistical procedure, fill in the form, click the button, and the analysis will take place on the spot. Has guides to which statistical procedures is appropriate for your specific problem.
VassarStats Statistical Computation Web Site	Similar to the SISA site (above). Handy on-line stats calculators (user- friendly tools for performing statistical computations) + statistical tables calculator
Brigitte Krenn’s and Christer Samuelsson’s The Linguist’s Guide to Statistics [Postscript file]	Despite the title, it’s more for NLP people than linguists. Basic probability theory, information theory and stochastic grammars. A downloadable book (Postscript format: [Help with file formats here]).
Joakim Nivre’s web course on statistical NLP	based on the Krenn & Samuelsson book (see above). More for NLP people than linguists.

See also the link to the Log-likelihood Wizard (a better alternative to chi-squared).

Helpdesks/FAQs (Frequently Asked Questions) / Info on CBL, corpora, concordancing, NLP
Concordances: Producing and Using them	Copious personal comments on concordancers and concordancing, including a comparison of various programs.
An introduction to concordancing	"What can the computer show us?" from the TACTweb Online Workbook
The history of computer assisted text analysis & concordancing	from the TACTweb Online Workbook
A Survey of Electronic Corpora and Related Resources for Language Researchers	rather dated, but historically useful survey of electronic resources up until the early 90s. Electronic version of Chapter 10 (pp. 263-310) of the following book: Edwards, Jane A. & Martin D. Lampert (eds). (1993) Talking Data: transcription and coding in discourse research. London and Hillsdale, NJ: Erlbaum.
Help with recording and transcribing a spoken corpus	Some tips from the people at Cornell on getting high-quality recordings, and on how to code/mark-up and transcribe.

Glossaries for Corpus-based linguistics and NLP
A Glossary of Corpus-based Linguistics & NLP (at Mannheim)	a handy on-line glossary of terminology and concepts used in the field
Systematic Dictionary of Corpus Linguistics (Lithuania)	a slim glossaryexplaining basic terms and concepts in corpus-based linguistics and NLP/language engineering
Glossary of Corpus Linguistics (W3C site)	a short and rather impoverished glossary

For language teachers new to computing and the internet, have a look at the ICT4LT glossary of terms specific to ICT, CALL and language learning/teaching.

Electronic/E-Mail Discussion Lists
CORPORA list	the main discussion list for corpus-based linguists and NLP (natural language processing) researchers (warning: tends to be dominated by the latter); this link is to the hypertext archive (see the info page on how to sign up) OR try the SIGLEX index of this discussion list (Selected messages to the CORPORA mailing list have been categorized and links to the threads have been provided. The categorization is based on a SIGLEX ontology. The links have been generated automatically based on subject, the date, and the sender. The links include only the years 1997 to the present. Before 2000, the CORPORA archive is not threaded)
CLLT (Corpus Linguistics and Language Teaching)	largely replicates the CORPORA list in many ways, but is more friendly towards those with more pedagogical and less computational proclivities
Linguist List	one of the oldest, and still one of the best; covers all aspects and branches of linguistics, including CBL (to a limited degree); has links to fonts, software, corpora (but not comprehensive, and rather dated). *Warning: very active and very comprehensive, so watch as your mailbox clutters up with postings.
Humanist list	an international electronic discussion list on the application of computers to the humanities (i.e. history, literature, etc. rather than linguistics/ language teaching); allied with the Association for Computers and the Humanities (ACH) and the Association for Literary and Linguistic Computing (ALLC).

Didn’t find the e-list you wanted? Search among the hundreds listed at the Linguist List list archives. Also relevant for TESL teachers is the TESL-CA list (the TESL and Technology Branch of TESL-L List...the 'CA' stands for 'computer-assisted').

On-line Introductory Books or Papers on NLP/Computational Linguistics
Data-Intensive Linguistics	Chris Brew’s unfinished manuscript for beginning students of computational linguistics/NLP. Some broken links and missing graphics, but informative nonetheless.
Information Theory Primer	primer written for molecular biologists who are unfamiliar with information theory.
A Mathematical Theory of Communication	Shannon’s original paper on information theory.

Courses in Natural Language Processing (NLP) & Programming (a small sample of syllabi and course material)
Applied Computational Linguistics	a practically-oriented on-line virtual course/hypertextbook on computational linguistics, focussing on the application of NLP techniques to a specific project (the development of an intelligent dictionary look-up program for reading/translating web pages). Aimed at advanced undergraduate and graduate students. Based at the University of Tübingen
Language and Statistics	CMU introductory course on computational linguistics (syllabus & notes)
LING 361 Intro to Computational Linguistics	at Georgetown University
LING-360 Perl Programming	at Georgetown University
LING 5200 Computational Methods in Linguistics	at the University of Colorado at Boulder
Statistical Natural Language Processing	web-based course in statistical natural language processing
Statistical Models in Natural-Language Processing	Eugene Charniak's course: statistical methods for learning a natural language and applying the knowledge to specific tasks.
UNIX Tutorial	for those who want to go beyond Windows
Unix for Linguists	some links to on-line Unix courses
Lectures on Tools for Corpus Linguistics	an introduction to processing corpora with UNIX tools and Perl programming
Corpus Mining: Perl and Python Programming (Jon Fernquest’s page)	Perl programming for teachers, translators, and writers, for creating language discovery tools.

Standards and Special Interest Groups (SIGs) in Corpus-Building and NLP
Corpus Encoding Standard (CES) (see also: XCES: XML Version of the CES)	for NLP people and larger-corpus builders; a set of encoding standards for corpus-based work and natural language processing applications
COCOSDA The International Committee for the Co-ordination and Standardisation of Speech Databases and Assesment Techniques	established to encourage and promote international interaction and cooperation in the foundation areas of Spoken Language Processing, esepcially for Speech Input/Output. COCOSDA supports the development of spoken language resources and speech technology evaluation. For the former, COCOSDA promotes the development of distinctive types of spoken language data corpora for the purpose of building and/or evaluating current or future spoken language technology. For the latter COCOSDA offers coordination of projects and research efforts to improve their efficiency.
Dublin Core Metadata Initiative (DCMI)	an organization dedicated to promoting the widespread adoption of interoperable metadata standards and developing specialized metadata vocabularies for describing resources that enable more intelligent information discovery systems. Includes An XML Encoding of Simple Dublin Core Metadata, XMLS Schemas and RDF Schemas.
DISC Best Practice Guide	extends and specialises software engineering best practice to the particular purposes of dialogue engineering, that is, to the development and evaluation of spoken language dialogue systems (SLDSs). The Guide consists of the DISC Dialogue Engineering Model, a series of SLDS design support tools, state-of-the-art overviews and references to relevant sites and running SLDSs world-wide, a glossary, DISC publications, and Guide evaluation stuff for you to use.
EAGLES (Expert Advisory Group on Language Engineering Standards)	An initiative of the European Commission which aims to accelerate the provision of standards for: (i) Very large-scale language resources (such as text corpora, computational lexicons and speech corpora); (ii) Means of manipulating such knowledge, via computational linguistic formalisms, mark up languages and various software tools; (iii) Means of assessing and evaluating resources, tools and products.
International Standards for Language Engineering (ISLE) (operates under the aegis of the EAGLES initiative)	aim is to develop human language technology (HLT) standards within an international framework, in the context of the EU-US International Research Cooperation initiative. There is an increasing Asian interest for the initiative and the relevance of standards in the field of HLT. Its objectives are to support national projects, HLT RTD projects and the language technology industry in general by developing, disseminating and promoting de facto HLT standards and guidelines for language resources, tools and products.
ISLE Metadata Initiative (IMDI) (operates under the aegis of the EAGLES initiative)	goal is to propose a standard of meta-data descriptions of Multi-Media/Multi-Modal Language resources. Using such a standard it will become possible to create a browsable and searchable universe of such resources in the Internet. This will enable interested parties to efficiently locate suitable resources and thus increases their reusability..
LE-PAROLE (Language Engineering - Preparatory Action for Linguistic Resources Organization for Language Engineering)	The European PAROLE project aimed to create a large-scale harmonised set of "core" corpora and lexica for over a dozen West European languages in compliance with common guidelines. 14 Western European language groups participated in the PAROLE project. Language corpora and lexica were built according to the same design and composition principles, in the period 1996-1998. For each of these languages, the project has resulted in a 20-million-word text corpus composed according to similar design principles and TEI encoded according to the PAROLE DTD. 250,000 words are POS encoded. Another product of the PAROLE project is a set of harmonised lexica containing a minimum of 20,000 entries provided with morphosyntactic and syntactic information. More info on PAROLE here and here. Restricted access site here (hosted by the Istituto di Linguistica Computazionale). Detailed descriptions of PAROLE text corpora and lexica may be found in the ELDA catalogue here.
Multilevel Annotation, Tools Engineering (MATE)	The MATE project aims to facilitate re-use of language resources by addressing the problems of creating, acquiring, and maintaining language corpora. Done through: (i) the development of a standard for annotating resources; (ii) the provision of tools which will make the processes of knowledge acquisition and extraction more efficient. Specifically, MATE will treat spoken dialogue corpora at multiple levels, focusing on prosody, (morpho-) syntax, co-reference, dialogue acts, and communicative difficulties, as well as inter-level interaction. The results of the project will be of particular benefit to developers of spoken language dialogue systems but will also be directly useful for other applications of language engineering.
Open Language Archives Community (OLAC) A metadata initiative for language data and NLP tools	offers a standard format for describing/cataloguing corpora and tools in a search-engine-friendly way. See fuller description here
Resource Description Framework (RDF)	a framework for supporting resource description, or metadata (data about data), for the Web. RDF provides common structures that can be used for interoperable XML data exchange, and follows the W3C design principles of interoperability, evolution, and decentralization.
SIGDIAL (Special Interest Group on Discourse and Dialogue; afilliated with the Association for Computational Linguistics (ACL) and the International Speech Communication Association (ISCA))	Links on coding schemes, languague resources, methods, tools, projects, references, conferences, etc. Aims to: promote development and distribution of reusable discourse processing components; explore techniques for evaluation of dialogue systems; share resources and data among the international community; encourage empirical methods in research; agree upon standards for discourse transcription, segmentation, and annotation; promote collaboration among developers of various dialogue system components support student participation in the discourse and dialogue community.
SIGLEX (Special Interest Group on the Lexicon)	an umbrella for a variety of research interests ranging from lexicography and the use of online dictionaries to computational lexical semantics.
Text Encoding Initiative (TEI)	TEI is an international and interdisciplinary standard that helps libraries, museums, publishers, and individual scholars represent all kinds of literary and linguistic texts for online research and teaching, using an encoding scheme that is maximally expressive and minimally obsolescent. The TEI’s Guidelines for Electronic Text Encoding and Interchange is here.
Unicode	Description from the Linguistic Annotation page: The Unicode Consortium brings together software corporations and researchers at the leading edge of standardizing international character encoding. The outcome of this cooperation is The Unicode Standard, which provides the foundation for internationalization and localization of software. Unicode has a conference series and an FAQ. There are character charts, including one for IPA extensions.
XML/SGML standards and associated resources (For XML software/tools, see the 'Software, Tools, Frequency Lists...' page)
The XML Cover Pages	including Academic Applications of XML
Edinbrugh XML Research Dissemination Workshop	XML Markup Technologies for Working with Linguistic Data, 10-11 May 2001, Edinburgh.
XML Corpus Encoding Standard (XCES)	instantiates the EAGLES Corpus Encoding Standard (CES) DTDs for linguistic corpora, developed by the Department of Computer Science, Vassar College, and Equipe Langue et Dialogue, LORIA/CNRS..
Center for Electronic Texts in the Humanities (CETH)	has some info and links on SGML/XML

If you need help with file formats for some of the downloads, [click here]

Did you find this web site useful? Do let me know, to encourage me to keep updating the site.

[TOP of this page]

Courses, FAQs, Info, E-lists, Standards

On-line Study Resources on Corpus-based Linguistics

On-line Help with Statistics & on-line tools

Helpdesks/FAQs (Frequently Asked Questions) / Info on CBL, corpora, concordancing, NLP

Glossaries for Corpus-based linguistics and NLP

Electronic/E-Mail Discussion Lists

On-line Introductory Books or Papers on NLP/Computational Linguistics

Courses in Natural Language Processing (NLP) & Programming

Standards and Special Interest Groups (SIGs) in Corpus-Building and NLP

XML/SGML standards and associated resources