Speech recognition

Speech recognition is a technology that enables computers, software, and electronic devices to recognize, interpret, and respond to human speech. This technology, often referred to as speech recognition technology (SRT) or automatic speech recognition (ASR), is distinct from voice recognition, which identifies the unique characteristics of individual voices primarily for security purposes. Applications of speech recognition include automated customer service systems, talk-to-text software, and voice-controlled devices like smart assistants.

The development of speech recognition technology has evolved significantly over the decades, starting from simple systems that could understand a limited number of spoken words to advanced applications that leverage artificial intelligence and large databases for improved accuracy. Key milestones include the creation of systems like "Audrey," which could recognize numbers in the 1950s, and "Harpy," which could understand over a thousand words by the 1970s. In recent years, virtual assistants like Siri and Alexa have popularized the technology, integrating it into daily life.

Speech recognition typically involves two main phases: converting speech sounds into numeric values and comparing those values against databases that include an acoustic model, a lexicon, and a language model. This process allows for a wide range of applications across various industries, enhancing functionalities in healthcare, automotive navigation, and more. The technology continues to evolve, making significant impacts on how humans interact with machines.

Published in: 2024

By: Ungvarsky, Janine

Subject Terms

Speech recognition

Speech recognition refers to the technology behind the ability of some computers, software programs, and electronic devices to recognize, interpret, and respond to human speech. It is also referred to as speech recognition technology (SRT), automatic speech recognition (ARS), and speech processing. The term speech recognition also refers to the branch of computer linguistics that deals with the study and development of speech recognition equipment and programs.

Speech recognition is different from voice recognition, which refers to technology that specifically determines the unique patterns of individual voices. Voice recognition is often used as a means to identify specific individuals for security purposes, but it can also be part of the way a speech recognition program "learns" to improve its responses.

There are many uses for speech recognition, including automated phone systems used by businesses, talk-to-text software that enables people to speak instructions to a computer or to dictate text for conversion to a written format, and verbal control of devices such as television remotes and handheld electronics.

History

The earliest forms of speech recognition technology could only understand numbers. Bell Laboratories created the "Audrey" system in 1952; it could identify numbers spoken by a single voice. Researchers continued to refine the technology, and by 1962 a system called "Shoebox" was developed that could understand sixteen spoken words.

These initial forays into speech recognition were built around words limited to four vowels and nine consonants, but they were enough to attract interest and funding from the U.S. Department of Defense. By 1976, researchers at Carnegie Mellon University had developed "Harpy," with the ability to understand about 1,100 words. That technological leap was made possible in part by improved search technology that enabled speech recognition programs to find the spoken words in its programming more quickly.

In the 1980s, the capabilities of speech recognition technology continued to grow because of the application of a hidden Markov model to search protocols. Technology applying this statistical method was able to determine whether certain sounds were words, rather than merely matching up sounds to known patterns. This enabled the development of some applications for businesses and dictation programs for home use that were able to work with thousands of words. Since the technology dealt with words one at a time, the programs were time-consuming to use; however, it worked well enough for businesses to begin adopting voice-responsive automated answering services by the middle of the 1990s.

The ability to integrate speech recognition programs with powerful computer search engines such as Google enabled the technology to take a giant step forward in the early 2000s. Designers were able to use the search engine database, including the millions of searches done by users, instead of a predetermined database. This allowed the speech recognition programming to have access to many more possibilities for predicting and determining spoken words. It also allowed for the programming to incorporate the physical location of a speaker in the speech interpretation process based on information from searches the user had conducted. For example, the speech recognition technology may more readily recognize the name of a specific local restaurant if the user has already searched for the same place using the search engine. At the same time, the increased use of mobile devices created a market for applications and programs using speech recognition.

In the 2010s, Apple launched Siri, and Amazon launched Alexa, both virtual assistants. They performed tasks such as initiating phone calls, searching the Internet, and scheduling events. The virtual assistants communicated using speech synthesis, which generated a human-like voice. By the 2020s, speech recognition had become an important artificial intelligence (AI) technology. By implementing AI, the healthcare industry could use speech recognition to transcribe medical records and assist in diagnostics, while the automotive industry could use it to carry out drivers' commands to control multimedia in vehicles and assist in navigation, among other tasks.

How It Works

Speech recognition technology has two phases. In the first, the speech sounds are processed and turned into numeric values that represent different vocal sounds. In many cases, the technology also identifies and ignores sounds that are not determined to be part of the voice, such as background noise. These numeric values are then compared to databases to help determine what words have been spoken. These databases include an acoustic model, a lexicon, and a language model.

The acoustic model includes all the sounds used in a language. Some applications of speech recognition include voice recognition and can be "trained" to recognize an individual's voice and the words used most frequently. This allows the technology to comprehend the unique way the person speaks, increasing the accuracy of its interpretations and responses.

The lexicon is a list of words and the way they correspond to the numeric values determined in the first step of the process. The lexicon also helps the speech recognition program determine how to pronounce the words.

The language model helps the speech recognition program with correctly combining words to form grammatically correct sentences or phrases. Different forms of speech recognition technology have varying levels of language support. For instance, a phone system designed to direct a caller to one of a handful of departments in a small car dealership can have a restricted database that focuses on words callers are most likely to use, such as "service" or "sales." Limiting a system to a set linguistic content in this way can help it respond quicker and more accurately to a caller's input.

Bibliography

Blunsom, Phil. "Hidden Markov Models." Utah State University Computer Science Department. PDF. Web. 8 Mar. 2016. http://digital.cs.usu.edu/~cyan/CS7960/hmm-tutorial.pdf

"How Speech Recognition Works." Microsoft. Microsoft. Web. 8 Mar. 2016. https://msdn.microsoft.com/en-us/library/hh378337%28v=office.14%29.aspx

Juang, B.H. and Lawrence R. Rabiner. "Automatic Speech Recognition – A Brief History of the Technology Development." University of California, Santa Barbara. The Regents of the University of California. PDF. Web. 8 Mar. 2016. http://www.ece.ucsb.edu/Faculty/Rabiner/ece259/Reprints/354‗LALI-ASRHistory-final-10-8.pdf

Oremus, Will. "I Didn't Type This Article." Slate. The Slate Group. 23 Apr. 2014. Web. 8 Mar. 2016. http://www.slate.com/articles/technology/technology/2014/04/the‗end‗of‗typing‗speech‗recognition‗technology‗is‗getting‗better‗and‗better.html

Pinola, Melanie. "Speech Recognition through the Decades: How We Ended Up with Siri." PCWorld. IDG Consumer & SMB. Web. 8 Mar. 2016. http://www.pcworld.com/article/243060/speech‗recognition‗through‗the‗decades‗how‗we‗ended‗up‗with‗siri.html?page=2

"Speaker Recognition." U.S. National Science and Technology Council Subcommittee on Biometrics and Identity Management. PDF. Web. 8 Mar. 2016. http://www.biometrics.gov/Documents/SpeakerRec.pdf

"What Is Speech Recognition?" Geeks for Geeks, 21 June 2024, www.geeksforgeeks.org/what-is-speech-recognition/. Accessed 20 Nov. 2024.

Speech recognition

On this Page

Subject Terms

Speech recognition

History

How It Works

Bibliography