Abstract

This paper describes an end-to-end adversarial singing voice conversion (EA-SVC) approach. It can directly generate arbitrary singing waveform by given phonetic posteriorgram (PPG) representing content, F0 representing pitch, and speaker embedding representing timbre, respectively. Proposed system is composed of three modules: generator, the audio generation discriminator, and the feature disentanglement discriminator. The generator encodes the features in parallel and inversely transforms them into the target waveform. In order to make timbre conversion more stable and controllable, speaker embedding is further decomposed to the weighted sum of a group of trainable vectors representing different timbre clusters. Further, to realize more robust and accurate singing conversion, disentanglement discriminator is proposed to remove pitch and timbre related information that remains in the encoded PPG. Finally, a two-stage training is conducted to keep a stable and effective adversarial training process. Subjective evaluation results demonstrate the effectiveness of our proposed methods. Proposed system outperforms conventional cascade approach and the WaveNet based end-to-end approach in terms of both singing quality and singer similarity. Further objective analysis reveals that the model trained with the proposed two-stage training strategy can produce a smoother and sharper formant which leads to higher audio quality.

Audio Samples

These models are presented:

WaveNet: The WaveNet based end-to-end SVC approach.
C-SVC: conventional cascade SVC approach
EA-SVC: the proposed end-to-end adversarial SVC approach.
- EA-SVC-1: EA-SVC without decomposition & feature disentanglement
- EA-SVC-2: EA-SVC without feature disentanglement

Singing Voice Conversion for Unseen Singers

Male to male

Source	Target

Use original pitch

WaveNet	C-SVC	EA-SVC

Use shifted pitch

WaveNet	C-SVC	EA-SVC

Male to female

Source	Target

Use original pitch

WaveNet	C-SVC	EA-SVC

Use shifted pitch

WaveNet	C-SVC	EA-SVC

Female to female

Source	Target

Use original pitch

WaveNet	C-SVC	EA-SVC

Use shifted pitch

WaveNet	C-SVC	EA-SVC

Female to male

Source	Target

Use original pitch

WaveNet	C-SVC	EA-SVC

Use shifted pitch

WaveNet	C-SVC	EA-SVC

Timbre Transfer

alpha = 0	alpha = 0.25	alpha = 0.5	alpha = 0.75	alpha = 1.0

Pitch Control

alpha = 0.5	alpha = 0.8	alpha = 1.0	alpha = 1.2	alpha = 1.5

Phonetic Posteriorgrams based Many-to-Many Singing Voice Conversion via Adversarial Training

Anonymous Authors
Anonymous Affiliations

Abstract

Audio Samples

Singing Voice Conversion for Unseen Singers

Male to male

Use original pitch

Use shifted pitch

Male to female

Use original pitch

Use shifted pitch

Female to female

Use original pitch

Use shifted pitch

Female to male

Use original pitch

Use shifted pitch

Timbre Transfer

Pitch Control

Phonetic Posteriorgrams based Many-to-Many Singing Voice Conversion via Adversarial Training

Anonymous Authors Anonymous Affiliations

Abstract

Audio Samples

Singing Voice Conversion for Unseen Singers

Male to male

Use original pitch

Use shifted pitch

Male to female

Use original pitch

Use shifted pitch

Female to female

Use original pitch

Use shifted pitch

Female to male

Use original pitch

Use shifted pitch

Timbre Transfer

Pitch Control

Anonymous Authors
Anonymous Affiliations