Towards High-Quality Neural TTS for Low-Resource Languages by Learning Compact Speech Representations

Submitted

This paper aims to enhance low-resource TTS by reducing data requirements using compact speech representations. We train a Multi-Stage Multi-Codebook (MSMC) VQ-GAN to learn the representation, MSMCR, and decode it to waveforms. Subsequently, we train a multi-stage predictor to predict MSMCRs from the text for TTS synthesis. Moreover, we optimize the training strategy by using more audio to learn MSMCRs for low-resource languages. It selects audio from other languages on speaker similarity to augment the training set and applies transfer learning to improve training quality. In experiments, the proposed system significantly outperforms FastSpeech and VITS in standard and low-resource scenarios, showing lower data requirements. The proposed training strategy effectively enhances MSMCRs on waveform reconstruction, improving TTS performance in low-resource scenarios. Finally, we apply the proposed system in Cantonese TTS, further validating its effectiveness for low-resource languages.


Samples
Loading......
Sample paged based on HiFi-GAN page.