Poster Presentation Machine Learning in CALL
Exploring Prosodic Prominence Control in Synthesized Golden Speaker’s Speech for Pronunciation Training
Advances in text-to-speech (TTS) technology have created new opportunities for pronunciation training. Zero-shot TTS (ZS-TTS) models, for example, are capable of synthesizing speech in a learner’s own voice while producing more native-like pronunciation, the so-called golden speaker. In addition, some models support instruction-based generation, which allows for expressive modifications such as emphasizing specific words or inserting breath pauses and laughter within an utterance. Prosodic elements, such as prominence, affect listeners’ intelligibility and can modify the meaning of discourse. However, computer-assisted pronunciation training (CAPT) research has largely focused on segmental features. This study investigates whether instruction-based ZS-TTS can generate pedagogically exaggerated prominence patterns using the learner’s voice, with the potential to enhance pronunciation and listening training. Using CosyVoice2, a ZS-TTS model with emphasis instruction control, this study investigates whether marking target words with those instructions results in measurable acoustic changes associated with prominence. Controlled sentence pairs will be synthesized in neutral and emphasis-marked conditions, including multiple-hypothesis cases in which different words within the same sentence are emphasized to alter the discourse nuance. Acoustic analyses will focus on relative pitch variation, word duration, and intensity. This exploratory work aims to examine whether instruction-based prominence control is consistent and pedagogically meaningful.