XiaoiceSing 2: A High-Fidelity Singing Voice Synthesizer Based on Generative Adversarial Network


Abstract

XiaoiceSing is a singing voice synthesis (SVS) system that aims at generating 48kHz singing voices. However, the mel-spectrogram generated by it is over-smoothing in middle- and high-frequency areas due to no special design for modeling the detail of these parts. In this paper, we propose XiaoiceSing2, which can generate the details of middle- and high-frequency parts to better construct the full-band mel-spectrogram. Specifically, in order to alleviate this problem, XiaoiceSing2 adopts a generative adversarial network (GAN), which consists of a FastSpeech-based generator and a multi-band discriminator. We improve the feed-forward Transformer (FFT) block by adding multiple residual convolutional blocks in parallel with the self-attention block to balance the local and global features. The multi-band discriminator contains three sub-discriminators responsible for low-, middle-, and high-frequency parts of the mel- spectrogram, respectively. Each sub-discriminator is composed of several segment discriminators (SD) and detail discriminators (DD) to distinguish the audio from different aspects. The experiment on our internal 48kHz singing voice datasets shows that XiaoiceSing2 significantly improves the quality of the singing voice over XiaoiceSing.

Model Architecture

Audio Samples

All of the spectrogram use same vocoder .

Audio Quality

明天你好,含着泪微笑,越美好

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

滴滴答,我早茶月光洒在你头发,平行的画滴滴

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

曾经将征服全世界,到最后回首才发现

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

那天晚上满天星星,平行时空下的约定

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

第一次遇见阴天遮住你侧脸

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

喵咪咪喵咪咪喵咪咪,一个尾巴细又长

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

我是个小小交通警,漂亮的制服穿在身

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

我唱着妈妈唱着的歌谣

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

在乡间的小路上

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

若那一刻重来我不哭

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

但你不肯觉醒

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

春蚕到死

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

长在哨所旁

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

从来我就不曾后悔

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

让疾风吹呀吹,尽管给我俩考验

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

可是我现在只想把你手儿牵

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

难免有不能控制的宣泄

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

一起郊游,今天别想太多

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

哀愁,我们下手拉大手

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

为什么舍不得熄灭

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

信仰有时尽

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

有时在相同的地方

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

一个人游荡

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

压抑着自己心跳

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

该说的都说了

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

谈及关于你的话题

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

亲手将他放走,无牵的手

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

寻马碲铁还要敲多少吉他,才能买得到

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

轻刷着和弦,情人节卡片

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

至少让我不和谁牵扯

GT Xiaoice2 Xiaoice1(WORLD) Xiaoice1(HiFi-WaveGAN) +ConvFFT ++SD

Subjective Evaluation

The evaluation experiments are conducted on the server with 12 Intel Xeon CPU, 256GB memory and 1 NVIDIA V100 GPU.

MOS Test Result

Mel-spectrogram Analysis

The dimension of mel-spectrogram is 120.

Mel-spectrogram Analysis

Ablation Study

The evaluation experiments are conducted on the server with 12 Intel Xeon CPU, 256GB memory and 1 NVIDIA V100 GPU.

Ablation Study