Automatic Prosody Annotation with Pre-Trained Text-Speech Model

Authors: Ziqian Dai, Jianwei Yu, Yan Wang, Nuo Chen, Yanyao Bian, Guangzhi Li,Deng Cai, Dong Yu

Abstract: Prosodic boundary plays an important role in text-to-speech synthesis (TTS) in terms of naturalness and readability. However, the acquisition of prosodic boundary labels relies on manual annotation, which is costly and time-consuming. In this paper, we propose to automatically extract prosodic boundary labels from text-audio data via a neural text-speech model with pre-trained audio encoders. This model is pre-trained on text and speech data separately and jointly fine-tuned on TTS data in a triplet format: {speech, text, prosody}. The experimental results on both automatic evaluation and human evaluation demonstrate that: 1) the proposed text-speech prosody annotation framework significantly outperforms text-only baselines; 2) the quality of automatic prosodic boundary annotations is comparable to human annotations; 3) TTS systems trained with model-annotated boundaries are slightly better than systems that use manual ones.

Architecture of the proposed text-speech automatic prosody annotation model:

The proposed framework consists of three main components: a text encoder, an audio encoder, and a multi-modal fusion decoder.

Prosodic boundaries of Mandarin :

The hierarchical prosody annotation adopted in this work categorizes the prosodic boundaries of Mandarin speech into five levels, which from low to high are Character (CC), Lexicon Word (LW), Prosodic Word (PW), Prosodic Phrase (PPH) and Intonational Phrase (IPH). Prosodic Word (PW), Prosodic Phrase (PPH) and Intonational Phrase (IPH) correspond to three different lengths of pause in speech from short to long. Lexicon Word (LW) indicates syntactic boundary between words, and Chinese Character (CC) is the smallest unit of Chinese.

Input and output of the model:

We conduct three different kinds of audio encoders in our model, which are CNN-Char, Conformer-Char and Conformer-PPG. We take BERT as our baseline, which shares the same architecture with the text encoder in our model, and only takes text as input.

Suppose the input audio and text is as follows.

Raw Audio:

Raw Text: 没有很爱你，只是在大街上拼命追逐，和你背影很像的人。

Here are the examples of manual annotation, BERT and our model:

Manual Annotation:

BERT:

Our model(with Conformer-PPG as audio encoder):

TTS MOS test：

The primary motivation of this work is to reduce the annotation cost of TTS system construction. Therefore, whether the automatic annotation is sufficient as an alternative to human annotation in TTS system training is worth studying. We take the DurAIN TTS as our test-bed and conduct crowd-sourced MOS tests to compare TTS systems trained with automatic prosody annotations, manual annotations, and without prosody annotations. For all TTS systems, we adopt the same text and prosody content in the original test set as inputs and randomly shuffle the order of the utterances to exclude other interference factors, only examining the audio prosody. Note that each input used in the MOS test contains at least one PW or IPH prosodic boundary with at least 12 Chinese characters. For the system trained without prosodic boundary, the prosodic boundary in the input text will be omitted. Each audio sample is rated by 24 testers, who are asked to evaluate the prosody naturalness of the synthesized speech on a five-point scale, with the lowest and highest scores being 1 (“Bad”) and 5 (“Excellent”). The MOS results with 95% confidence intervals are shown in the following table.

Here are part of the speech for MOS test. Besides speech generated by three TTS systems mentioned above, we also give the raw speech recorded by human and the text and manual annotated prosody for inference.

Text

该

公

司

的

现

金

管

理

系

统

无

法

应

付

激

增

的

撤

资

要

求

。

Prosody

IPH

Human recorded	Automatic prosody	Manual prosody	No prosody

Text

最

后

中

央

批

准

广

东

和

福

建

两

省

先

行

一

步

，

搞

经

济

特

区。

。

Prosody

PPH

IPH

Human recorded	Automatic prosody	Manual prosody	No prosody

Text

在

束

河

古

镇

各

街

巷

深

处

，

我

总

能

遇

到

背

着

背

篓

的

纳

西

老

妇

。

Prosody

PPH

IPH

Human recorded	Automatic prosody	Manual prosody	No prosody

Text

或

用

掺

有

白

酒

的

水

浸

泡

，

有

明

显

的

去

咸

效

果

。

Prosody

PPH

IPH

Human recorded	Automatic prosody	Manual prosody	No prosody

Text

这

对

顾

炎

武

启

发

很

大

，

顾

炎

武

还

抄

写

一

部

带

到

山

西

。

Prosody

PPH

IPH

Human recorded	Automatic prosody	Manual prosody	No prosody

Text

该

奖

适

用

于

任

何

一

个

有

该

网

站

账

户

的

本

科

生

或

者

研

究

生

。

Prosody

PPH

IPH

Human recorded	Automatic prosody	Manual prosody	No prosody

Text

十

九

世

纪

六

十

年

代

，

那

里

是

彼

得

堡

无

家

可

归

者

过

夜

的

地

方

。

Prosody

PPH

IPH

Human recorded	Automatic prosody	Manual prosody	No prosody

Text

奶

开

心

地

笑

得

像

一

朵

花

儿

！

Prosody

PPH

IPH

Human recorded	Automatic prosody	Manual prosody	No prosody

Text

她

因

为

喜

悦

变

得

红

扑

的

脸

蛋

就

像

树

上

的

红

苹

果

。

Prosody

PPH

IPH

Human recorded	Automatic prosody	Manual prosody	No prosody

10.

Text

兴

冲

地

跑

起

来

看

杨

乐

，

好

想

问

她

汪

涵

来

不

！

Prosody

PPH

IPH

Human recorded	Automatic prosody	Manual prosody	No prosody