InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself


Abstract

It is challenging to accelerate the training process while ensuring both high-quality generated voices and acceptable inference speed. In this paper, we propose a novel neural vocoder called InstructSing, which can converge much faster compared with other neural vocoders while maintaining good performance by integrating differentiable digital signal processing and adversarial training. It includes one generator and two discriminators. Specifically, the generator incorporates a harmonic-plus-noise (HN) module to produce 8kHz audio as an instructive signal. Subsequently, the HN module is connected with an extended WaveNet by an UNet-based module, which transforms the output of the HN module to a latent variable sequence containing essential periodic and aperiodic information. In addition to the latent sequence, the extended WaveNet also takes the mel-spectrogram as input to generate 48kHz high-fidelity singing voices. In terms of discriminators, we combine a multi-period discriminator, as originally proposed in HiFiGAN, with a multi-resolution multi-band STFT discriminator. Notably, InstructSing achieves comparable voice quality to other neural vocoders but with only one-tenth of the training steps on a 4 NVIDIA V100 GPU machine.

Model Architecture

Audio Samples

All of the spectrogram use same acoustic model.

Audio Quality

变色的生活任性的挑拨 疯狂地冒出了头

GT InstructSing HN-UnifiedSourceFilterGAN RefineGAN DDSP

风停了也无所谓 只因为你曾说 Everything

GT InstructSing HN-UnifiedSourceFilterGAN RefineGAN DDSP

我们也不为谁掉眼泪

GT InstructSing HN-UnifiedSourceFilterGAN RefineGAN DDSP

你有我的蝴蝶

GT InstructSing HN-UnifiedSourceFilterGAN RefineGAN DDSP

小小的一片云呀 慢慢地走过来 请你们歇歇脚呀

GT InstructSing HN-UnifiedSourceFilterGAN RefineGAN DDSP

我们走到分叉路口

GT InstructSing HN-UnifiedSourceFilterGAN RefineGAN DDSP

变成我记忆里的明信片

GT InstructSing HN-UnifiedSourceFilterGAN RefineGAN DDSP

原来你也在这里啊爱一个人

GT InstructSing HN-UnifiedSourceFilterGAN RefineGAN DDSP

啊啊哦我为你歌唱

GT InstructSing HN-UnifiedSourceFilterGAN RefineGAN DDSP

拥抱着夜来香 吻着夜来香

GT InstructSing HN-UnifiedSourceFilterGAN RefineGAN DDSP

Compare different steps

小小的一片云呀 慢慢地走过来 请你们歇歇脚呀

InstructSing-10ksteps HN-UnifiedSourceFilterGAN-10ksteps HN-UnifiedSourceFilterGAN-50ksteps RefineGAN-10ksteps RefineGAN-50ksteps DDSP-10ksteps DDSP-50ksteps

我们走到分叉路口

InstructSing-10ksteps HN-UnifiedSourceFilterGAN-10ksteps HN-UnifiedSourceFilterGAN-50ksteps RefineGAN-10ksteps RefineGAN-50ksteps DDSP-10ksteps DDSP-50ksteps

变成我记忆里的明信片

InstructSing-10ksteps HN-UnifiedSourceFilterGAN-10ksteps HN-UnifiedSourceFilterGAN-50ksteps RefineGAN-10ksteps RefineGAN-50ksteps DDSP-10ksteps DDSP-50ksteps

原来你也在这里啊爱一个人

InstructSing-10ksteps HN-UnifiedSourceFilterGAN-10ksteps HN-UnifiedSourceFilterGAN-50ksteps RefineGAN-10ksteps RefineGAN-50ksteps DDSP-10ksteps DDSP-50ksteps

啊啊哦我为你歌唱

InstructSing-10ksteps HN-UnifiedSourceFilterGAN-10ksteps HN-UnifiedSourceFilterGAN-50ksteps RefineGAN-10ksteps RefineGAN-50ksteps DDSP-10ksteps DDSP-50ksteps

拥抱着夜来香 吻着夜来香

InstructSing-10ksteps HN-UnifiedSourceFilterGAN-10ksteps HN-UnifiedSourceFilterGAN-50ksteps RefineGAN-10ksteps RefineGAN-50ksteps DDSP-10ksteps DDSP-50ksteps

Singing Voice Quality Evaluation

The evaluation experiments are conducted on the server with 12 Intel Xeon CPU, 256GB memory and 1 NVIDIA V100 GPU.

Quality evaluation at different training steps (objective and subjective)
Relative improvement at different steps compared with the score at 10k step

Ablation Study

Ablation Study