MATCH Technology Tutorial - Basic Principles of Speech Synthesis

Basic Principles of Speech Synthesis

There are two ways for a computer to generate speech.

Formant Synthesis generates speech output by using an Acoustic Model to replicate how a human speaks. It does this by varying the frequency of a waveform and altering the pitch to create a voice. Formant Synthesis systems are available on most operating systems such as Microsoft Windows and Macintosh OS X. However, this form of speech synthesis does not create natural output. Instead, it creates the stereotypical robotic voices which we expect of computer-generated speech.

The University of Edinburgh is using a different technique called Concatenative Synthesis, which is based around the concept of piecing together segments of recorded speech to create a complete sentence. Early concatenative systems used a relatively small set of units that contained small sequences of speech to cover all possible transitions between speech sounds. Edinburgh use a further development of concatenative synthesis called Unit Selection. This generates speech by piecing together sequences from many texts recorded by a single speaker. These sequences can be very short, or as long as whole sentences. Unit Selection speech is more natural and can be understood more easily than Formant Synthesis.