Talk to Me – Voice Computing

Technologists predict that one of the most consequential changes in our daily lives will soon come from being able to converse with computers. We are starting to see the early stages of this today as many of us now have personal assistants in our homes such as Amazon’s Alexa, Apple’s Siri, Microsoft’s Cortana or Google’s Personal Assistant. In the foreseeable future, we’ll be able to talk to computers in the same way we talk to each other, and that will usher in perhaps the most important change ever of the way that humans interact with technology.

In the book Talk to Me: How Voice Computing Will Transform the Way We Live, Work, and Think the author James Vlahos looks at the history of voice computing and also predicts how voice computing will change our lives in the future. This is a well-written book that explains the underlying technologies in an understandable way. I found this to be a great introduction to the technology behind computer speech, an area I knew little about.

One of the first things made clear in the book is the difficulty of the technical challenge of conversing with computers. There are four distinct technologies involved in conversing with a computer. First is Automatic Speech Recognition (ASR) where human speech is converted into digitized ‘words’. Natural Language Understanding (NLU) is the process used by a computer to interpret the meaning of the digitized words. Natural Language Generation (NGR) is how the computer formulates the way it will respond to a human request. Finally, Speech Synthesis is how the computer converts its answer into audible words.

There is much progress being made with each of these areas. For example, the ASR developers are training computers on how humans talk using machine learning and huge libraries of actual human speech and human interactions from social media sites. They are seeing progress as computers learn the many nuances of the ways that humans communicate. In our science fiction we’ve usually portrayed future computers that talk woodenly like Hal from 2001: A Space Odyssey. It looks like our future instead will be personal assistants that speak to each of us using our own slang, idioms, and speaking style, and in realistic sounding voices of our choosing. The goal for the industry is to make computer speech indistinguishable from human speech.

The book also includes some interesting history of the various voice assistants. One of the most interesting anecdotes is about how Apple blew its early lead in computer speech. Steve Jobs was deeply interested in the development of Siri and told the development team that Apple was going to give the product a high priority. However, Jobs died on the day that Siri was announced to the public and Apple management put the product on the back burner for a long time.

The book dives into some technologies related to computer speech and does so in an understandable way. For instance, the book looks at the current status of Artificial Intelligence and at how computers ‘learn’ and how that research might lead to better voice recognition and synthesis. The book looks at the fascinating attempts to create computer neural networks that mimic the functioning of the human brain.

Probably the most interesting part of the book is the last few chapters that talk about the likely impact of computer speech. When we can converse with computers as if they are people, we’ll no longer need a keyboard or mouse to interface with a computer. At that point, the computer is likely to disappear from our lives and computing will be everywhere in the background. The computing power needed to enable computer speech is going to have to be in the cloud, meaning that we just speak when we want to interface with the cloud.

Changing to voice interface with the cloud also drastically changes our interface with the web. Today most of us use Google or some other search engine when searching for information. While most of us select one of the choices offered on the first or second page of the search results, in the future the company that is providing our voice interface will be making that choice for us. That puts a huge amount of power into the hands of the company providing the voice interface – they essentially get to choreograph our entire web experience. Today the leading candidates to be that voice interface are Google and Amazon, but somebody else might grab the lead. There are ethical issues associated with a choreographed web – the company doing our voice searches is deciding the ‘right’ answer to questions we ask. It will be incredibly challenging for any company to do this without bias, and more likely they will do it to drive profits. Picture Amazon driving all buying decisions to its platform.

The transition to voice computing also drastically changes the business plans of a lot of existing technology companies. Makers of PCs and laptops are likely to lose most of their market. Search engines become obsolete. Social media will change drastically. Web advertising will take a huge hit when we don’t see ads – it’s hard to think users will tolerate listening to many ads as part of the web interface experience.

The book makes it clear that this is not science fiction but is a technology that will be maturing during the next decade. I recently saw a video of teens trying to figure out how to use a rotary dial phone, but it might not be long before kids will grow up without ever having seen a mouse or a QWERTY keyboard. I will admit that a transition to voice is intimidating, and they might have to pry my keyboard from my cold, dead hands.