Abstract

Adversarial waveform generation has been a popular approach as the backend of singing voice conversion (SVC) to generate high-quality singing audio efficiently. However, the instability of GAN models also leads to some problems, such as pitch jitters and U/V errors. It affects the smoothness and continuity of the harmonic component, hence degrades the conversion quality seriously. To tackle this problem, this paper proposes an approach based on harmonic signals to enhance the audio generation in SVC. It firstly extracts the sine excitation from F0, then filters it with an estimated linear time-vary (LTV) filter, finally feeds both of them to the waveform generation module. Two mainstream models, MelGAN and ParallelWaveGAN, are used in the experiment to validate the effectiveness of the proposed approach. The MOS test results show that the models with harmonic signals outperform the baseline models in both sound quality and singer similarity, which achieves significant improvement. The analysis also shows that it effectively improves the smoothness and continuity of harmonics in the generated audio.