Automatic Prosody Annotation with Pre-Trained Text-Speech Model

Authors: Ziqian Dai, Jianwei Yu, Yan Wang, Nuo Chen, Yanyao Bian, Guangzhi Li,Deng Cai, Dong Yu

Abstract: Prosodic boundary plays an important role in text-to-speech synthesis (TTS) in terms of naturalness and readability. However, the acquisition of prosodic boundary labels relies on manual annotation, which is costly and time-consuming. In this paper, we propose to automatically extract prosodic boundary labels from text-audio data via a neural text-speech model with pre-trained audio encoders. This model is pre-trained on text and speech data separately and jointly fine-tuned on TTS data in a triplet format: {speech, text, prosody}. The experimental results on both automatic evaluation and human evaluation demonstrate that: 1) the proposed text-speech prosody annotation framework significantly outperforms text-only baselines; 2) the quality of automatic prosodic boundary annotations is comparable to human annotations; 3) TTS systems trained with model-annotated boundaries are slightly better than systems that use manual ones.

Architecture of the proposed text-speech automatic prosody annotation model:

The proposed framework consists of three main components: a text encoder, an audio encoder, and a multi-modal fusion decoder.

AVSE

Prosodic boundaries of Mandarin :

The hierarchical prosody annotation adopted in this work categorizes the prosodic boundaries of Mandarin speech into five levels, which from low to high are Character (CC), Lexicon Word (LW), Prosodic Word (PW), Prosodic Phrase (PPH) and Intonational Phrase (IPH). Prosodic Word (PW), Prosodic Phrase (PPH) and Intonational Phrase (IPH) correspond to three different lengths of pause in speech from short to long. Lexicon Word (LW) indicates syntactic boundary between words, and Chinese Character (CC) is the smallest unit of Chinese.

AVSE

Input and output of the model:

We conduct three different kinds of audio encoders in our model, which are CNN-Char, Conformer-Char and Conformer-PPG. We take BERT as our baseline, which shares the same architecture with the text encoder in our model, and only takes text as input.

Suppose the input audio and text is as follows.

Raw Audio:                

Raw Text:                   没有很爱你,只是在大街上拼命追逐,和你背影很像的人。

Here are the examples of manual annotation, BERT and our model:

Manual Annotation:

AVSE

BERT:

AVSE

Our model(with Conformer-PPG as audio encoder):

AVSE

TTS MOS test:

The primary motivation of this work is to reduce the annotation cost of TTS system construction. Therefore, whether the automatic annotation is sufficient as an alternative to human annotation in TTS system training is worth studying. We take the DurAIN TTS as our test-bed and conduct crowd-sourced MOS tests to compare TTS systems trained with automatic prosody annotations, manual annotations, and without prosody annotations. For all TTS systems, we adopt the same text and prosody content in the original test set as inputs and randomly shuffle the order of the utterances to exclude other interference factors, only examining the audio prosody. Note that each input used in the MOS test contains at least one PW or IPH prosodic boundary with at least 12 Chinese characters. For the system trained without prosodic boundary, the prosodic boundary in the input text will be omitted. Each audio sample is rated by 24 testers, who are asked to evaluate the prosody naturalness of the synthesized speech on a five-point scale, with the lowest and highest scores being 1 (“Bad”) and 5 (“Excellent”). The MOS results with 95% confidence intervals are shown in the following table.

AVSE

Here are part of the speech for MOS test. Besides speech generated by three TTS systems mentioned above, we also give the raw speech recorded by human and the text and manual annotated prosody for inference.

1.Text
 
ProsodyCCCCCCLWCCLWCCLWCCPWCCLWCCLWCCCCLWCCLWCCIPHCC
 
Human recorded Automatic prosody Manual prosody No prosody
 
2.Text广区。
 
ProsodyCCPPHCCLWCCLW CC LW LW CC LW CC PW CC LW CC PPH CC LW CC LW CC IPH CC
 
Human recorded Automatic prosody Manual prosody No prosody
 
3.Text西
 
ProsodyLWCCLWCCPWCCCCLWCCPPHCCLWCCLWCCPPHCCLWCCCCPWCCLWCCIPHCC
 
Human recorded Automatic prosody Manual prosody No prosody
 
4.Text
 
ProsodyCCPWCCLWCCCCLWLWCCPPHCCLWCCCCPWCCLWCCIPHCC
 
Human recorded Automatic prosody Manual prosody No prosody
 
5.Text西
 
ProsodyCCLWCCCCLWCCLWCCPPHCCCCCCLWLWCCLWCCPWCCLWCCIPHCC
 
Human recorded Automatic prosody Manual prosody No prosody
 
6.Text
 
ProsodyCCPWCCCCLWCCLWCCLWLWCCCCLWCCCCLWCCCCPPHCCLWCCCCIPHCC
 
Human recorded Automatic prosody Manual prosody No prosody
 
7.Text
 
ProsodyCCLWCCLWCCLWCCPPHCCCCLWLWCCCCLWCCLWCCCCPWCCCCLWCCIPHCC
 
Human recorded Automatic prosody Manual prosody No prosody
 
8.Text
 
ProsodyCCLWCCCCPPHCCLWLWCCLWCCIPHCC
 
Human recorded Automatic prosody Manual prosody No prosody
 
9.Text
 
ProsodyLWCCLWCCPPHCCLWCCCCCCLWCCPPHCCLWCCCCLWCCCCIPHCC
 
Human recorded Automatic prosody Manual prosody No prosody
 
10.Text
 
ProsodyCCCCCCLWCCCCLWLWCCCCPPHCCCCLWCCPPHCCLWCCIPHCC
 
Human recorded Automatic prosody Manual prosody No prosody