CrossSinger: A Cross-Lingual Multi-Singer High-Fidelity Singing Voice Synthesizer Trained on Monolingual Singers
Abstract
It is challenging to build a multi-singer high-fidelity singing voice synthesis system with cross-lingual ability by only using monolingual singers in the training stage. In this paper, we propose CrossSinger, which is a cross-lingual singing voice synthesizer based on Xiaoicesing2. Specifically, we utilize International Phonetic Alphabet to unify the representation for all languages of the training data. Moreover, we leverage conditional layer normalization to incorporate the language information into the model for better pronunciation when singers meet unseen languages. Additionally, gradient reversal layer (GRL) is utilized to remove singer biases included in lyrics since all singers are monolingual, which indicates singer's identity is implicitly associated with the text. The experiment is conducted on a combination of three singing voice datasets containing Japanese Kiritan dataset, English NUS-48E dataset, and one internal Chinese dataset. The result shows CrossSinger can synthesize high-fidelity songs for various singers with cross-lingual ability, including code-switch cases.
Model Architecture
Audio Samples
All of the spectrogram use same vocoder .
Code Switch
原来你是这种 bad boy
Chinese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
English Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
Japanese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
Bad boy bad boy 你的坏让我不明白
Chinese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
English Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
Japanese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
Bad boy bad boy 我必须要离
Chinese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
English Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
Japanese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
I'm letting go 我终于舍得为你放开手
Chinese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
English Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
Japanese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
世界 let's start from here 无所谓
Chinese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
English Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
Japanese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
Chinese Song
笑着哭着都快活
Chinese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
English Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
Japanese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
在我心里面
Chinese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
English Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
Japanese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
有什么话让人会受伤
Chinese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
English Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
Japanese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
English Song
I'm just a little bit caught in the middle life is a maze
Chinese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
English Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
Japanese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
and love is a riddle I don't no where to go
Chinese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
English Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
Japanese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
and I don't no why
Chinese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
English Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
Japanese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
Japanese Song
きみわ つかみどうころがない ばかり
Chinese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
English Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
Japanese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
かってなきたい ばかりしちゃう
Chinese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
English Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
Japanese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
めおあけたら ひろがるぶらにゅわる
Chinese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
English Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
Japanese Singer
GT
CrossSinger
Xiaoicesing2
w/o singer bias eliminator
w/o conditional layer norm
Subjective Evaluation
We conducted a comprehensive series of experiments to
evaluate the performance of our proposed model, using the
average Mean Opinion Score (MOS) as the evaluation metric. The evaluation was conducted from three key aspects:
sound quality, pronunciation accuracy, and naturalness. Each
sub-experiment involved 20 segments for each language per
singer. We recruited a total of 60 listeners, with 20 assigned
to each language. The experimental results are presented in Table 1, clearly
indicating that CrossSinger significantly outperforms Xiaoicesing2 in terms of cross-lingual ability.
MOS Test Result
Ablation Study
For the ablation study, we individually removed
the language conditional layer normalization (CLN) and the
singer bias eliminator, as indicated in Table 2.