HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation
Abstract
Entertainment-oriented singing voice synthesis (SVS) requires a vocoder to generate high-fidelity (e.g. 48kHz) audio.
However, most text-to-speech (TTS) vocoders cannot reconstruct the waveform well in this scenario.
In this paper, we propose HiFi-WaveGAN to synthesize the 48kHz high-quality singing voices in real-time.
Specifically, it consists of an Extended WaveNet served as a generator, a multi-period discriminator proposed in HiFiGAN, and a multi-resolution spectrogram discriminator borrowed from UnivNet.
To better reconstruct the high-frequency part from the full-band mel-spectrogram, we incorporate a pulse extractor to generate the constraint for the synthesized waveform.
Additionally, a novel auxiliary spectrogram-phase loss is proposed to further approximate the real distribution.
The experimental results show that our proposed HiFi-WaveGAN obtains 4.23 in the mean opinion score (MOS) metric for the 48kHz SVS task, significantly outperforming other neural vocoders.
Model Architecture
Audio Samples
All of the spectrogram use same acoustic model.
Audio Quality
让你在回味,自不醉人人自醉,因为回忆
GT
hifi-wavegan
hifigan
Parallel WaveGAN
我好无聊
GT
hifi-wavegan
hifigan
Parallel WaveGAN
怀念从前,是因为太留恋
GT
hifi-wavegan
hifigan
Parallel WaveGAN
印象假设,然后当真了
GT
hifi-wavegan
hifigan
Parallel WaveGAN
我还在寻找着我的小森林
GT
hifi-wavegan
hifigan
Parallel WaveGAN
春风吹,阳光照,红领巾,胸前飘
GT
hifi-wavegan
hifigan
Parallel WaveGAN
甩裙摆画著圆圈,花美得兴高采烈,那香味有点阴险
GT
hifi-wavegan
hifigan
Parallel WaveGAN
每一个岔路口 沿着路标 寻找着自由
GT
hifi-wavegan
hifigan
Parallel WaveGAN
你是谁,为了谁,我的兄弟姐妹不流泪
GT
hifi-wavegan
hifigan
Parallel WaveGAN
二朵玫瑰, 每一朵都像你那样美,你的美
GT
hifi-wavegan
hifigan
Parallel WaveGAN
不怕晚餐孤单的滋味,只是难免会
GT
hifi-wavegan
hifigan
Parallel WaveGAN
因为爱情总是难舍难分,何必在意那一点点温存,要知道伤心总是难免的
GT
hifi-wavegan
hifigan
Parallel WaveGAN
到底怎么了,我快被打败爱你的路,不完全合拍
GT
hifi-wavegan
hifigan
Parallel WaveGAN
以为把这一座城市抛开 就可以终结伤害 却不明白
GT
hifi-wavegan
hifigan
Parallel WaveGAN
有什么熬不过,大不了唱首歌
GT
hifi-wavegan
hifigan
Parallel WaveGAN
不知情,不责怪,热闹散场,躲回安静的床
GT
hifi-wavegan
hifigan
Parallel WaveGAN
Singing Voice Quality Evaluation
The evaluation experiments are conducted on the server with 12 Intel Xeon CPU, 256GB memory and 1 NVIDIA V100 GPU.
Quality Evaluation (objective and subjective)
Spectrogram Analysis
Spectrogram Contrastive Analysis of PWG and HiFi-WaveGAN