Multimedia/Multimodal Corpora

The list below includes historical digital library initiatives. Not all are structured/formatted as other standardized text corpora.

American Rhetoric/Online Speech Bank Index to some 400+ active links to 5000+ fulltext, audio & video(streaming) versions of public speeches, sermons, legal proceedings, lectures, debates, interviews, other recorded media events, & a declaration or two.
BACKBONE (European languages, incl. English) BACKBONE is a European project; web-based pedagogic corpora of video-recorded spoken interviews with native speakers of English, French, German, Polish, Spanish and Turkish as well as non-native speakers of English as a Lingua Franca (ELF). On-line Search here.
Conversations with History a collection of interviews (edited/not-faithful-to-the-original transcripts + streaming videos (for most interviews)) with distinguished people from all over the world about their lives & their work (diplomats, statesmen, & soldiers; economists & political analysts; scientists & historians; writers & foreign correspondents; activists & artists). At the heart of each interview is a focus on individuals & ideas that make a difference. The series is produced at the Institute of International Studies at the University of California at Berkeley. Conceived in1982 as a way to capture & preserve through conversation & technology the intellectual ferment of our times, Conversations with History includes over 300 interviews.
GeM Corpus (Genre & Multimodality) The GeM project ran from 1999 until 2002 & was concerned with developing the first XML annotation scheme for multilayered description of illustrated documents with complex layout. The GeM framework allows layout, rhetorical structure, content & language of different text types to be represented & interrogated. Output:
  • An annotated corpus of newspapers, illustrated bird guides, instruction manuals & websites
  • An XML annotation scheme for illustrated documents
  • A prototype generator (implemented with) XSLT that produces laid-out pages expressed in terms of XSL:FO.
Gesture Database (Max Planck Institute in Nijmegen) Consists of the video recordings (no accompanying transcript/corpus texts, as far as I know) of speech & gestures that spontaneously accompany speech, & the annotations regarding gesture & speech in the recording. The recordings were made in different cultures, including the Netherlands, Italy, the USA, Japan, Turkey, Australian Aboriginal communities, Mexico, Belize, & Ghana. Speech events are recorded that elicits spontaneous gestures, such as narration of traditional stories & autobiographical stories, description of the local environment, & route direction. Unfortunately, the website only contains sparse information and no links to any data or actual examples.
Historical Voices “The purpose of Historical Voices is to create a significant, fully searchable online database of spoken word collections spanning the 20th century - the first large-scale repository of its kind. Historical Voices will both provide storage for these digital holdings & display public galleries that cover a variety of interests & topics.” Includes synchronised text- & -audio RealMedia presentations (see, for e.g., the Flint Sit-Down Strike). Transcripts are not formatted like standardised corpora, but have the advantage of being linked to sound recordings.
MICASE (Michigan Corpus of Academic Spoken English; American English). See fuller description here.
Multimedia Adult English Learner Corpus (MAELC) A database of video of classroom interaction & associated written materials collected from university-level Intensive English Language Program classes at Portland State University; adult ESL classes from beginning to upper-intermediate proficiency; more than 3,600 hours of classroom interaction recorded by six cameras and multiple microphones. Access upon request.
Multimedia Movie Corpus on the Web Read about this corpus of American movies created using subtitles in five languages (English, French, German, Italian & Spanish). Not accessible Thu Jan 02 10:55:30 2020.
New South Voices A project of the Special Collections Unit, J. Murrey Atkins Library, Univ of North Carolina at Charlotte. Provides online access to a unique collection of over 800 interviews, narratives & conversations collected by UNC Charlotte faculty & students & several community organizations documenting the Charlotte region in the 20th century (transcriptions, audio & video files & supplementary materials, primarily photographs). The interviews cover a wide range of historical subjects, from African American churches & Billy Graham crusades to women’s basketball & World War II. Other interviews, narratives & conversations document the experiences & language of new arrivals to the area.
Corpus of Video-Mediated English as a Lingua Franca Conversations (ViMELF) ViMELF contains 20 fully transcribed Skype conversations between 40 speakers from Germany (20 speakers), Spain (5), Italy (5), Finland (5), and Bulgaria (5), totaling 744.5 minutes (ca. 12.5 hours), with an average conversation length of 37.23 minutes.
The corpus comprises 113 670 words in the plain text version and 152 472 items in the annotated version.
The transcripts are available as .docx and .txt files; the videos in MPEG4 format.
Several versions are available: the fully annotated pragmatic version as text and XML, a lexical version, and a POS-tagged version (auto-tagged with the CLAWS C7 tagset).

If you found this web site useful, or found an outdated link, don’t forget to let me know.