AuCo: Audio Corpora of languages of Vietnam and neighbouring countries

The AuCo collection hosts audio recordings of language of Vietnam and neighbouring countries, including data of endangered/little-endowed languages. AuCo stands for Audio Corpora; it is also a reference to Âu Cơ, a fairy who bore an egg sac that hatched a hundred children: the Hundred Peoples (Bách Việt), ancestor to the Vietnamese and to the multitude of other ethnic groups of the area. The round dots in the logo of the AuCo collection are an allusion to these hundred eggs – a symbol of the cultural and linguistic diversity reflected in the collection. 

 The aim of the AuCo collection is to collect the documents recorded by researchers in the course of their research activity. The AuCo collection thereby fulfills an important function: it allows for cumulative progress in speech data collection. Preparing, recording and annotating audio data sets is highly time-consuming; with a little extra investment of time and effort, the data can be prepared in such a way as to be re-usable by others, for various purposes (including phonetic / phonological analysis, and automatic speech processing, but also language teaching / language revitalization). The AuCo collection aims to contribute to the documentation of a precious human heritage: the languages of the world. It also aims to facilitate interdisciplinary research involving engineers and linguists, through the sharing of data, tools and methods.

The AuCo collection is open to documents of various types: from unique heritage recordings dating back several decades, to everyday recordings of national languages collected for one-off research purposes. Because there is no telling when and how documents will be re-used, the AuCo collection chooses not to exclude any type of data.

Fieldwork on the Naxi language (Yunnan)

The documents AuCo collection were recorded and transcribed / annotated by researchers from very different backgrounds, including the members of the “Speech Communication” department of the International Research Institute MICA (HUST - CNRS/UMI-2954 - Grenoble INP, Hanoi University of Science and Technology). The tasks involved in the preparation of the documents for archiving and online distribution are realized by members of the “Speech Communication” department of the MICA Institute. Long-term preservation (perennial archiving) and online distribution are taken charge of by the Très Grand Equipement Adonis, in partnership with CINES and IN2P3. Tasks of data filing are realized with the help of the two centres that serve as archive entry points: the Pangloss Collection / Cocoon data repository (CNRS-LACITO), and the Speech and Language Data Repository: SLDR (CNRS-LPL).

Fieldwork on Mo Piu (Hmong) From 2009 to 2013, data collection work by MICA members focused on Mơ Piu (Hmongic group of Hmong-Mien language family), which has less than 250 speakers. Extensive data sets were collected through trips to the field (in Lao Cai Province) and invitations of speakers to Hanoi, gradually changing the status of this language from fully unknown to well-documented. Data are in the process of being archived at the SLDR archive.

Current research highlights (2014-2016) include:


Languages Team Main objectives
Tai Yo and Tai Pao
(Tai-Kadai family)
Frédéric Pain,
Matthew Deo,
Đinh Thị Hằng
creating a multimedia online edition of a manuscript in Lai Pao writing, a writing system unique to Vietnam
- publishing entirely phonemicized, annotated and translated audio recordings
Việt-Mường (Vietic) languages :
Cuối Chăm, Arem, and Mường
Nguyễn Thị Minh Châu,
Phạm Thị Thu Hà,
Cao Thành Việt
- fundamental documentation and study of a contemporary Mường dialect
- annotation and online publication of historical recordings
- inputting data from (verified) field notes, and synchronizing with recordings
Vietnamese dialect of:
Phong Nha, Quảng Bình
Alexis Michaud,
Nguyễn Thị Minh Châu
- analysis of the dialect in historical perspective
- data publication

New to November 2014:

The vocabulary list for Southeast Asia used by Michel Ferlus is now online in a multilingual version suitable for automatic processing. A PDF document presenting this list is available here; the word list in XLS format is available here. This document is also deposited in the HAL archive. A script for generating XML documents from this list combined with Praat annotations is now available too (author: Mạc Đăng-Khoa). Its principles are explained in this PowerPoint slideshow, which also presents ongoing documentation work on data by Michel Ferlus. The tools are now (8/11/2014) fully functional and we shall begin producing XML documents this month. Announcements will be posted on this page. We plan to make Arem data (Vietic group of Austroasiatic) available in 2015.

New to August 2015:

Three data sets are available online: Cuối Chăm, Arem, and Mường. Data from 7 languages of the Tai-Kadai group are being finalized; upload is planned for the near future.

An online tool for aligning manuscripts with annotation and audio recordings has been developed. See a demo (in English) here.

See also a video (in Vietnamese) presenting the re-discovery of the Lai Pao (Lai Paw) writing system, which, along with Tai Don, will be among the first writing systems to be processed with this tool. The author of this tool, Matthew Deo, received financial support from the LabEx "Empirical Foundations of Linguistics" in 2015.