HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation


Abstract

Entertainment-oriented singing voice synthesis (SVS) requires a vocoder to generate high-fidelity (e.g. 48kHz) audio. However, most text-to-speech (TTS) vocoders cannot reconstruct the waveform well in this scenario. In this paper, we propose HiFi-WaveGAN to synthesize the 48kHz high-quality singing voices in real-time. Specifically, it consists of an Extended WaveNet served as a generator, a multi-period discriminator proposed in HiFiGAN, and a multi-resolution spectrogram discriminator borrowed from UnivNet. To better reconstruct the high-frequency part from the full-band mel-spectrogram, we incorporate a pulse extractor to generate the constraint for the synthesized waveform. Additionally, a novel auxiliary spectrogram-phase loss is proposed to further approximate the real distribution. The experimental results show that our proposed HiFi-WaveGAN obtains 4.23 in the mean opinion score (MOS) metric for the 48kHz SVS task, significantly outperforming other neural vocoders.

Model Architecture

Audio Samples

All of the spectrogram use same acoustic model.

Audio Quality

让你在回味,自不醉人人自醉,因为回忆

GT hifi-wavegan hifigan Parallel WaveGAN

我好无聊

GT hifi-wavegan hifigan Parallel WaveGAN

怀念从前,是因为太留恋

GT hifi-wavegan hifigan Parallel WaveGAN

印象假设,然后当真了

GT hifi-wavegan hifigan Parallel WaveGAN

我还在寻找着我的小森林

GT hifi-wavegan hifigan Parallel WaveGAN

春风吹,阳光照,红领巾,胸前飘

GT hifi-wavegan hifigan Parallel WaveGAN

甩裙摆画著圆圈,花美得兴高采烈,那香味有点阴险

GT hifi-wavegan hifigan Parallel WaveGAN

每一个岔路口 沿着路标 寻找着自由

GT hifi-wavegan hifigan Parallel WaveGAN

你是谁,为了谁,我的兄弟姐妹不流泪

GT hifi-wavegan hifigan Parallel WaveGAN

二朵玫瑰, 每一朵都像你那样美,你的美

GT hifi-wavegan hifigan Parallel WaveGAN

不怕晚餐孤单的滋味,只是难免会

GT hifi-wavegan hifigan Parallel WaveGAN

因为爱情总是难舍难分,何必在意那一点点温存,要知道伤心总是难免的

GT hifi-wavegan hifigan Parallel WaveGAN

到底怎么了,我快被打败爱你的路,不完全合拍

GT hifi-wavegan hifigan Parallel WaveGAN

以为把这一座城市抛开 就可以终结伤害 却不明白

GT hifi-wavegan hifigan Parallel WaveGAN

有什么熬不过,大不了唱首歌

GT hifi-wavegan hifigan Parallel WaveGAN

不知情,不责怪,热闹散场,躲回安静的床

GT hifi-wavegan hifigan Parallel WaveGAN

Singing Voice Quality Evaluation

The evaluation experiments are conducted on the server with 12 Intel Xeon CPU, 256GB memory and 1 NVIDIA V100 GPU.

Quality Evaluation (objective and subjective)

Spectrogram Analysis

Spectrogram Contrastive Analysis of PWG and HiFi-WaveGAN

Ablation Study

Ablation Study