A Multi-Scale Time-Frequency Spectrogram Discriminator
for GAN-based Non-Autoregressive TTS

Haohan Guo, Hui Lu, Xixin Wu, Helen Meng

Submitted to INTERSPEECH 2022

Generative adversarial network (GAN) has been validated as an useful approach to improve non-autoregressive TTS by adversarial training with an extra model discriminating the real and the generated speech. For effective training process, a powerful discriminator is always needed to capture rich difference between them. In this paper, a multi-scale time-frequency discriminator is proposed to enhance TTS to generate more realistic Mel-spectrogram. It operates the spectrogram as a 2D image to exploit the correlation of different components in time domain and frequency domain. And it employs the encoder-decoder based U-Net to capture richer information at different scales. Subjective tests are conducted to verify its effectiveness. Both of multi-scale and time-frequency discriminating show significant improvement in preference test. Besides, the MOS test shows that TTS trained with it can already perform well even if the vocoder is not fine-tuned on TTS outputs. Finally, the visualization of discriminator outputs is performed to validate that low-resolution and high-resolution information are both utilized.


Samples
Loading......
Sample paged based on HiFi-GAN page.