University of West Bohemia in Pilsen

Acoustic speech synthesis


Ing. Matoušek Jindřich, Ph.D. <zdkrnoul@kky.zcu.cz>

Acoustic synthesis of speech produces the speech signal - the spoken speech. As the accompanying component of the model of the talking head or of the signed language processing system it could help e.g. less auditory impaired people who could then exploit both visual information, obtained by lip-reading, and acoustic information included in the formed speech signal at the same time. Speech synthesizer could also give invaluable services to people impaired in other ways - mute people or people with voice malfunctions could use their own speech synthesis system to produce "their" speech; people who are unable to speak after an apoplexy could benefit from the talking head technology and use it for learning to speak. Teaching the auditory unimpaired people the signed language could be another example of an important application of the talking head.

The aim of acoustic synthesis of speech is to produce speech which is usually close to the voice characteristics of a given person. The synthetic speech should not mimic the voice quality of the person only but also his/her style of speaking etc. Speech synthesis is the most time-consuming part of the talking head creation process. Text-to-speech (TTS) technology is employed to produce speech in the automatic way. TTS is the most general and also the most difficult task of speech synthesis as its job is to convert an arbitrary text into the corresponding speech. Thanks to this technology, the talking head could "pronounce" an arbitrary text. TTS comprises a set of special modules and algorithms which ensure the automatic conversion of the written text into the spoken speech. They include text processing (e.g. analysis and normalization), conversion of text into its pronunciation form (i.e. phonetic transcription and generation of contours of prosodic features of speech), acoustic unit inventory creation and speech production itself.

Figure 1: The general scheme of a concatenative text-to-speech system.

We have developed a unique methodology of high-quality text-to-speech synthesis of Czech speech. The Czech TTS system is based on the principle of concatenative synthesis of speech, nowadays the most successful and the most used approach to speech synthesis worldwide. Simply said, the basic principle of this approach is to represent the significant acoustic events of human speech using so-called speech units or speech segments. The resulting speech is then created by the concatenation of the speech units. Sub-word units, e.g. phones (most often with respect to the context of the surrounding phones - so called triphones) or diphones (simply said, they start in the middle of a phone and end in the middle of the following phone), rank among the most suitable unit types.

Figure 2: The examples of speech units.

The key to the successful speech synthesis is the careful preparation of the acoustic unit inventory - i.e. segments of speech which the speech synthesiser employs. Since the quality of the resulting synthetic speech depends to a large degree both on the richness of speech segments present in the inventory and the accuracy with which they are extracted from the reference utterances, we utilize the methodology of the automatic inventory creation based on a large number of natural speech utterances. The automation is an important aspect of our system because it enables to build very precise and acoustically and linguistically rich (huge speech corpora - tens of hours of speech - could be employed) acoustic unit inventories which to a large degree contribute to the high quality of the synthesised speech. Such approach is known as a so-called corpus-based concatenative speech synthesis because the speech corpus (i.e. a set of natural speech utterances spoken by a single speaker, whose voice the synthesiser mimics, and their representation in orthographic, phonetic, spectral or prosodic domains) is the basic material for building the acoustic unit inventories. Our system is the first and the only speech synthesiser in the Czech Republic so far which utilizes this new-generation technology. The significant criterion of the synthetic speech quality is naturalness. Naturalness of speech to a large degree depends on how well the prosodic characteristics of speech are modelled (the prosodic characteristics describe mainly the evolution of the sentence melody and the loudness and duration of the individual speech segments). In our system, we have developed a unique method for the modelling and selection of the natural prosodic contours extracted from the natural speech utterances.

The acoustic part (in the form of a general TTS system) could be, also separately from the talking head, applied in various dialogue systems - it could be used for the automatic reading of SMS messages, e-mails, electronic documents, books etc. (see http://voice.zcu.cz or http://www.kky.zcu.cz/en/research-fields/acoustic-speech-synthesis).