University of West Bohemia in Pilsen

Audio-visual speech recognition


Ing. Císař Petr <pcisar1@kky.zcu.cz>

Audio-visual speech recognition

Task of automatic speech recognition by a computer, when both acoustic and visual component of speech is used. Speech is produced by vocal tract and the result is an acoustic signal that one can hear. The result is also a movement of a speech organ that one can see. Unfortunately, visible part of speech organs are only lips, teeth, tongue and cheeks. Thus the visual component of speech contains less information than the acoustic component. The visual speech component is used by hearing-impaired people (for lip-reading), but is also used unconsciously by all people in common communication, especially in noisy environment.


Audio-visual speech recognition chart
Figure 1: Audio-visual speech recognition diagram. The whole recognition process consists fo two parts, acoustic and visual part. These two branches of the process can be joined in some points in process. Visual part consists of three main blocks. First block handles finding area of interest, that means finding the head of a speaker and determining the location of lips. Second block handles the parameterization, its task is to decribe visual speech so that the description contains the information about the speech but not information about the speaker. Third block handles the combination of description of acoustic and visual parts and the recognition.

Problem of speech recognition in noisy environment. Acoustic noise affects only acoustic component of speech. That is why visual recognition is used as a support of acoustic speech recognition in noisy environment.

How to use visual information
Figure 2: How to use visual information


Audio-visual speech recognition is recognition of speech by human or machine, in which both acoustic and visual components of speech are used. Visual component of speech is visual part of vocal tract. Since the non-visible part of vocal tract contributes to the production of speech, the visual part of speech contains less speech information than the acoustic part. But for hearing-impaired or even deaf people visual speech component is important source of information. Lip-reading (visual speech recognition) is used by people without disabilities, too. It helps better understanding in case when the acoustic speech is less intelligible.

In computer speech rercognition visual component of speech is used for support of acoustic speech recognition. Design of an audiovisual speech recognizer is based on human lip-reading expert experiences. Hearing impaired people achieve recognition rate of 60-80% in dependence on lip-reading conditions. Most important conditions for good lip-reading are quality of visual speech of a speaker (proper articulation), angle of view, ilumination). Sometimes people, who are well understood from acoustic component may be not well lip-read.

In lip-reading visemes are used as basic speech units. A viseme is a group of phonemes, whose visual expression is the same. For example phonemes p, b, and m create one viseme. In Czech language, approximately 13 visemes are distinguished. Similarities of phonemes within one viseme are given by missing information in visual component of speech. Big problem in lip-reading is influence of subsequently pronounced visemes in one word (co-articulation effect). Visemes can be classified as influencing and influenced. The shape of viseme can be altered by impact of surrounding visemes. This phenomenon complicates the lip-reading.

What viseme is spoken?
Obrázek 3: What viseme is spoken? Try to guess.