CrossSinger: A Cross-Lingual Multi-Singer High-Fidelity Singing Voice Synthesizer Trained on Monolingual Singers


Abstract

It is challenging to build a multi-singer high-fidelity singing voice synthesis system with cross-lingual ability by only using monolingual singers in the training stage. In this paper, we propose CrossSinger, which is a cross-lingual singing voice synthesizer based on Xiaoicesing2. Specifically, we utilize International Phonetic Alphabet to unify the representation for all languages of the training data. Moreover, we leverage conditional layer normalization to incorporate the language information into the model for better pronunciation when singers meet unseen languages. Additionally, gradient reversal layer (GRL) is utilized to remove singer biases included in lyrics since all singers are monolingual, which indicates singer's identity is implicitly associated with the text. The experiment is conducted on a combination of three singing voice datasets containing Japanese Kiritan dataset, English NUS-48E dataset, and one internal Chinese dataset. The result shows CrossSinger can synthesize high-fidelity songs for various singers with cross-lingual ability, including code-switch cases.

Model Architecture

Audio Samples

All of the spectrogram use same vocoder .

Code Switch

原来你是这种 bad boy

Chinese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

English Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

Japanese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

Bad boy bad boy 你的坏让我不明白

Chinese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

English Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

Japanese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

Bad boy bad boy 我必须要离

Chinese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

English Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

Japanese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

I'm letting go 我终于舍得为你放开手

Chinese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

English Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

Japanese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

世界 let's start from here 无所谓

Chinese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

English Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

Japanese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

Chinese Song

笑着哭着都快活

Chinese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

English Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

Japanese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

在我心里面

Chinese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

English Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

Japanese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

有什么话让人会受伤

Chinese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

English Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

Japanese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

English Song

I'm just a little bit caught in the middle life is a maze

Chinese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

English Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

Japanese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

and love is a riddle I don't no where to go

Chinese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

English Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

Japanese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

and I don't no why

Chinese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

English Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

Japanese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

Japanese Song

きみわ つかみどうころがない ばかり

Chinese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

English Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

Japanese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

かってなきたい ばかりしちゃう

Chinese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

English Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

Japanese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

めおあけたら ひろがるぶらにゅわる

Chinese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

English Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

Japanese Singer

GT CrossSinger Xiaoicesing2 w/o singer bias eliminator w/o conditional layer norm

Subjective Evaluation

We conducted a comprehensive series of experiments to evaluate the performance of our proposed model, using the average Mean Opinion Score (MOS) as the evaluation metric. The evaluation was conducted from three key aspects: sound quality, pronunciation accuracy, and naturalness. Each sub-experiment involved 20 segments for each language per singer. We recruited a total of 60 listeners, with 20 assigned to each language. The experimental results are presented in Table 1, clearly indicating that CrossSinger significantly outperforms Xiaoicesing2 in terms of cross-lingual ability.

MOS Test Result

Ablation Study

For the ablation study, we individually removed the language conditional layer normalization (CLN) and the singer bias eliminator, as indicated in Table 2.

Ablation Study