International Research Institute MICA - The 1st top multimedia unit in Vietnam - HMM-based Vietnamese Text-To-Speech: Prosodic Phrasing Modeling, Corpus Design, System Design, and Evaluation

Séminaire de Mme NGUYEN Thi Thu Trang, doctorante cotutelle entre l'Université Paris-Sud 11 et l'Institut MICA - Date : vendredi 11 septembre, 9h30 - Lieu : salle "seminar", B1, Institut MICA, Hanoi University of Science and Technology

Intervenant :
Mme NGUYEN THi Thu Trang, doctorante en cotutelle entre le laboratoire LIMSI, Université Paris-Sud 11 et le Département Speech Communication de l'Institut MICA

Date : vendredi 11 septembre, 9h30
Lieu : salle "seminar room", 9ème étage, bâtiment B1, Institut MICA, Hanoi University of Science and Technology
Langue : le séminaire sera présenté en anglais

Résumé/Abstract:
The thesis objective is to design and build a high quality Hidden Markov Model (HMM-)based Text-To-Speech (TTS) system for Vietnamese – a tonal language. The system is called VTED (Vietnamese TExt-to-speech Development system). In view of the great importance of lexical tones, a "tonophone" – an allophone in tonal context – was proposed as a new speech unit in our TTS system. A new training corpus, VDTS (Vietnamese Di-Tonophone Speech corpus), was designed for 100% coverage of di-phones in tonal contexts (i.e. di-tonophones) using the greedy algorithm from a huge raw text. A total of about 4,000 sentences of VDTS were recorded and pre-processed as a training corpus of VTED.
In the HMM-based speech synthesis, although pause duration can be modeled as a phoneme, the appearance of pauses cannot be predicted by HMMs. Lower phrasing levels above words may not be completely modeled with basic features. This research aimed at automatic prosodic phrasing for Vietnamese TTS using durational clues alone as it appeared too difficult to disentangle intonation from lexical tone. Syntactic blocks, i.e. syntactic phrases with a bounded number of syllables (n), were proposed for predicting final lengthening (n = 6) and pause appearance (n = 10). Improvements for final lengthening were done by some strategies of grouping single syntactic blocks. The quality of the predictive J48-decision-tree model for pause appearance using syntactic blocks combining with syntactic link and POS (Part-Of-Speech) features reached F-score of 81.4% (Precision=87.6%, Recall=75.9%), much better than that of the model with only POS (F-score=43.6%) or syntactic link (F-score=52.6%) alone.
The architecture of the system was proposed on the basis of the core architecture of HTS with an extension of a Natural Language Processing part for Vietnamese. Pause appearance was predicted by the proposed model. Contextual feature set included phone identity features, locational features, tone-related features, and prosodic features (i.e. POS, final lengthening, break levels). Mary TTS was chosen as a platform for implementing VTED. In the MOS (Mean Opinion Score) test, the first VTED, trained with the old corpus and basic features, was rather good, 0.81 (on a 5 point MOS scale) higher than the previous system – HoaSung (using the non- uniform unit selection with the same training corpus); but still 1.2-1.5 point lower than the natural speech. The quality of the final VTED, trained with the new corpus and prosodic phrasing model, progressed by about 1.04 compared to the first VTED, and its gap with the natural speech was much lessened. In the tone intelligibility test, the final VTED received a high correct rate of 95.4%, only 2.6% lower than the natural speech, and 18% higher than the initial one. The error rate of the first VTED in the intelligibility test with the Latin square design was about 6-12% higher than the natural speech depending on syllable, tone or phone levels. The final one diverged about only 0.4-1.4% from the natural speech.

FaLang translation system by Faboba

Trang Chủ

HMM-based Vietnamese Text-To-Speech: Prosodic Phrasing Modeling, Corpus Design, System Design, and Evaluation

Vietnam landscape view