XiaoiceSing 2: A High-Fidelity Singing Voice Synthesizer Based on Generative Adversarial Network
Abstract
XiaoiceSing is a singing voice synthesis (SVS) system that aims at
generating 48kHz singing voices. However, the mel-spectrogram
generated by it is over-smoothing in middle- and high-frequency areas due to no special design for modeling the detail of these parts. In
this paper, we propose XiaoiceSing2, which can generate the details
of middle- and high-frequency parts to better construct the full-band
mel-spectrogram. Specifically, in order to alleviate this problem, XiaoiceSing2 adopts a generative adversarial network (GAN), which
consists of a FastSpeech-based generator and a multi-band discriminator. We improve the feed-forward Transformer (FFT) block
by adding multiple residual convolutional blocks in parallel with
the self-attention block to balance the local and global features.
The multi-band discriminator contains three sub-discriminators
responsible for low-, middle-, and high-frequency parts of the mel-
spectrogram, respectively. Each sub-discriminator is composed of
several segment discriminators (SD) and detail discriminators (DD)
to distinguish the audio from different aspects. The experiment
on our internal 48kHz singing voice datasets shows that XiaoiceSing2 significantly improves the quality of the singing voice over XiaoiceSing.
Model Architecture
Audio Samples
All of the spectrogram use same vocoder .
Audio Quality
明天你好,含着泪微笑,越美好
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
滴滴答,我早茶月光洒在你头发,平行的画滴滴
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
曾经将征服全世界,到最后回首才发现
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
那天晚上满天星星,平行时空下的约定
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
第一次遇见阴天遮住你侧脸
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
喵咪咪喵咪咪喵咪咪,一个尾巴细又长
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
我是个小小交通警,漂亮的制服穿在身
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
我唱着妈妈唱着的歌谣
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
在乡间的小路上
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
若那一刻重来我不哭
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
但你不肯觉醒
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
春蚕到死
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
长在哨所旁
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
从来我就不曾后悔
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
让疾风吹呀吹,尽管给我俩考验
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
可是我现在只想把你手儿牵
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
难免有不能控制的宣泄
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
一起郊游,今天别想太多
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
哀愁,我们下手拉大手
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
为什么舍不得熄灭
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
信仰有时尽
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
有时在相同的地方
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
一个人游荡
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
压抑着自己心跳
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
该说的都说了
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
谈及关于你的话题
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
亲手将他放走,无牵的手
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
寻马碲铁还要敲多少吉他,才能买得到
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
轻刷着和弦,情人节卡片
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
至少让我不和谁牵扯
GT
Xiaoice2
Xiaoice1(WORLD)
Xiaoice1(HiFi-WaveGAN)
+ConvFFT
++SD
Subjective Evaluation
The evaluation experiments are conducted on the server with 12 Intel Xeon CPU, 256GB memory and 1 NVIDIA V100 GPU.
MOS Test Result
Mel-spectrogram Analysis
The dimension of mel-spectrogram is 120.
Mel-spectrogram Analysis
Ablation Study
The evaluation experiments are conducted on the server with 12 Intel Xeon CPU, 256GB memory and 1 NVIDIA V100 GPU.