Speech Recognition - Speech Technology Section Header Image

What is Speech Recognition?

Automatic speech recognition is a process by which a computer takes a speech signal (recorded using a microphone) and converts it into words. Speech recognition is a hard problem for a number of reasons:

  • Many different words can be spoken. The average person uses thousands or tens of thousands of words.

  • In speech the boundaries between words are not obvious - one word runs on into another. So the problem is one of finding the words as well as identifying them. This usual case is called continuous speech recognition. Sometimes, to make the problem easier, systems demand that people leave pauses between words. This is an unnatural way of speaking. Recognizing such speech is referred to as isolated word recognition.

  • When people speak casually or conversationally, it is a lot more difficult for the computer to recognize what they are saying compared to when they are dictating or reading from a script. We say that the acoustic variability is much greater.

  • It is more straightforward for a system to be speaker-dependent - able to recognize the speech of a particular speaker. But it is much more useful if a system is speaker-independent - able to recognize anyone.

  • Speech recognition works much better in a quiet room with a nearby microphone. If this is not the case, then other sounds may also be recorded and it is much harder to recognize such noisy speech.

Speech recognition systems work best in particular applications where a person is expected to be speaking about a particular subject (e.g. booking a doctor's appointment). In such cases, the speech recognition system can take advantage of the fact that there a person will most likely talk about only a very limited number of things.

The University of Edinburgh has a large group of researchers working on speech recognition. The work ranges from very basic research (building mathematical models of how speech works), through research into recognizing the speech of elderly users, to recognizing speech recorded using distant microphones (e.g. on a table top).


Left_Bar_Image
Continue
Right_Bar_Image