Authors: Haohan Guo, FK Soong, Lei He, Lei Xie
Abstract: The end-to-end (e2e) TTS, which can predict speech directly from a given sequence of graphemes or phonemes, has shown improved performance over the conventional TTS. However, its predicting capability is still limited by the acoustic/phonetic coverage of the training data, usually constrained by the training set size. To further improve the TTS quality in pronunciation, prosody and perceived naturalness, we propose to exploit the information embedded in a syntactically parsed tree where the inter-phrase/word information of a sentence is organized in a multilevel tree structure. Specifically, two key features: phrase structure and relations between adjacent words are investigated. Experimental results in subjective listening, measured on three test sets, show that the proposed approach is effective to enhance pronunciation clarity, prosody and naturalness of the synthesized speech of the baseline system.

Note: These samples are synthesized from Mel-spectrum by Griffin-Lim algorithm. We just need to pay attention to prosody because this work has no relation with sound quality basically.

Common test set (Left: baseline, Mid: phrase structure based features, Right: word relation based features)

“Although they expect higher data speeds eventually , IT managers don't know which technologies will deliver .”

“Mobile operators say that new network technologies will increase mobile bandwidth over two hundred times , pushing speeds from today's nine .”

“At a meeting of Microsofts top one hundred and fifty executives , Nadella tried to get everyone back on course .”

“Twitter , Amazon , Google , and others have all removed their Apple Watch apps . ”

“Brins resume , which was last updated more than twenty years ago , is still available online .”

Complex test set (Left: baseline, Mid: phrase structure based features, Right: word relation based features)

“Satya Nadella: I think it is important for companies like ours to have a set of principles that governs some of the most important things like privacy and security and or immigration and take a stance . ”

“In addition , Microsoft also launched a new set of tools for developers who want to use its Visual Studio Code IDE for building models with CNTK , TensorFlow , Theano , Keras and Caffe two . ”

“But it will be the driver's responsibility to make sure that children under 14 do not ride in the front unless they are wearing a seat belt of some kind. ”

“They are trying to find out whether there is something about the way we teach language to children which in fact prevents children from learning sooner. ”

“Even the folk knowledge in social systems on which ordinary life is based in earning, spending, organizing, marrying, taking part in political activities, fighting and so on, is not very dissimilar from the more sophisticated images of the social system derived from the social sciences, even though it is built upon the very imperfect samples of personal experience. ”

“All the faith he had had had had no effect on the outcome of his life.”

Pathological test set (Left: baseline, Mid: phrase structure based features, Right: word relation based features)

“zero zero four cd three oh seven zero zero zero zero zero zero zero three zero zero zero zero zero zero zero one . ”

“added dot-net framework install in script backslash backslash m b x hyphen t o p o c e n t r a l backslash clientsetup backslash v m c l i e n t s e t u p dot c m d comma to fix issue of running dot-net binaries on w two k . ”