Control-TTS : Zero-shot Audio-guided Speaker-specific and Style-controllable Text to Speech Synthesis
Abstract. Text to Speech Synthesis (TTS) is essential for cross-modal human-computer interaction, widely applied in smart homes, voice assistants, and mobile devices. In recent years, user demands for TTS have extended beyond generating natural and fluent speech to include personalized customization of voice timbre and style, presenting new challenges for speech synthesis systems. Current research progress in this area primarily involves zero-shot speech synthesis and style-controllable speech synthesis. However, zero-shot synthesis cannot control speech style, while traditional style-controllable synthesis lacks the ability to specify speaker timbre. As a result, each approach presents limitations, struggling to balance style control with speaker timbre specification. To address this issue, we identified a novel task: Controllable Speech Synthesis with Reference Speech Examples, enabling the direct control of speaker timbre and speaking style through two reference speech samples to generate new timbre-style combinations. To tackle this new task, we propose the Control-TTS model. Control-TTS can decouple and recombine the speaker timbre and speaking style from the reference speech examples, synthesizing novel combinations of timbre and style, thereby enhancing the diversity of synthesized speech. Our experiments on the VccmDataset demonstrate that Control-TTS exhibits comparable or state-of-the-art performance in terms of Naturalness Mean Opinion Score (NMOS), Word Error Rate (WER), speaker similarity, and style similarity.
Author.
- Tianwei Lan*
- Mengyuan Deng*
* Equal contribution.