Mme TRAN Thi Anh Xuan, doctorante en co-direction MICA-GIPSA-Lab a brillamment soutenu sa thèse à Grenoble le 30 mars 2016 et ainsi obtenu le titre de Docteur en Sciences, spécialité EEATS (Electronique, Electricité, Automatique, Traitement du Signal)


Titre : Acoustic gesture modeling. Application to a Vietnamese speech recognition system


Directeur de thèse (MICA côté français) : M. Eric Castelli
Co-directeur de thèse (MICA côté vietnamien) : Mme PHAM Thi Ngoc Yen
Co-encadrant (GIPSA-Lab) : Mme Nathalie Vallée

Membres du jury:

Mme Martine ADDA-DECKER Directrice de Recherche, CNRS - Laboratoire de Phonétique et de Phonologie Président
M. Goerges LINARES Professeur de l'Université d'Avignon et des Pays de Vaucluse, LIA, Avignon Rapporteur
M. François PELLEGRINO Directeur de Recherche, CNRS, Laboratoire Dynamique du Langage, Lyon Rapporteur
M. Eric CASTELLI Professeur & chargé de recherche CNRS, Institut MICA, Hanoi Directeur de thèse
Mme PHAM Thi Ngoc Yen Professeur, Institut Polytechnique de Hanoi Co-directeur de thèse
Mme Nathalie VALLEE Chargé de Recherche CNRS, GIPSA-Lab, Grenoble Co-encadrante



Speech plays a vital role in human communication. Selection of relevant acoustic speech features is key in the design of any system using speech processing. For some 40 years, speech was typically considered as a sequence of quasi-stable portions of signal (vowels) separated by transitions (consonants). Despite a wealth of studies that clearly document the importance of coarticulation, and reveal that articulatory and acoustic targets are not context-independent, the view that each vowel has an acoustic target that can be specified in a context-independent manner remains widespread. This point of view entails strong limitations. It is well known that formant frequencies are acoustic characteristics that bear a clear relationship with speech production, and that can distinguish among vowels. Therefore, vowels are generally described with static articulatory configurations represented by targets in the acoustic space, typically by formant frequencies in F1-F2 and F2-F3 planes. Plosive consonants can be described in terms of places of articulation, represented by locus or locus equations in an acoustic plane. But formant frequencies trajectories in fluent speech rarely display a steady state for each vowel. They vary with speaker, consonantal environment (co-articulation) and speaking rate (relating to continuum between hypo- and hyper-articulation). In view of inherent limitations of static approaches, the approach adopted here consists in studying both vowels and consonants from a dynamic point of view.

Firstly we studied the effects of the impulse response in the beginning, at the end and during transitions of the signal both in the speech signal and at the perception level. Variations of the phases of the components were then examined. Results show that the effects of these parameters can be observed in spectrograms. Crucially, the amplitudes of the spectral components distinguished under the approach advocated here are sufficient for perceptual discrimination. From this result, for all speech analysis, we only focus on amplitude domain, deliberately leaving aside phase information. Next we extent the work to vowel-consonant-vowel perception from a dynamic point of view. These perceptual results, together with those obtained earlier by Carré (2009a), show that vowel-to-vowel and vowel-consonant-vowel stimuli can be characterized and separated by the direction and rate of the transitions on formant plane, even when absolute frequency values are outside the vowel triangle (i.e. the vowel acoustic space in absolute values).

Due to limitations of formant measurements, the dynamic approach needs to develop new tools, based on parameters that can replace formant frequency estimation. Other features was studied.

On this basis, these features are used as a tool to compute dynamic characteristics. We propose a new way to model the dynamic speech features: our analysis work were performed on transitions of vowel-to-vowel (V1V2) sequences of both Vietnamese and French. Finally, these dynamic acoustic speech features are used in Vietnamese automatic speech recognition system with several obtained interesting results.

Key words: vowel gesture, dynamic acoustic features, magnitude of speech, transition direction and rate, SSCF Angles, automatic speech recognition.