No Jitter is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Automated Speech Recognition: On the Brink of Revolution

ASR_062420-AdobeStock_317575930.jpeg

Photo of man speaking to his smartphone
Image: fizkes - stock.adobe.com
Keeping customer-facing simple is essential in delivering a good customer experience. Especially when harried, customers don’t want grief from the user interface, be that an old-school IVR or a newfangled virtual agent. Increasingly, contact center providers are turning to voice as the interface of choice for taking the pain out of a customer engagement.
 
Key here is automated speech recognition (ASR). Once only able to allow a yes/no or single-word response to an IVR prompt, ASR technology is becoming increasingly sophisticated – but, according to Five9 CTO Jonathan Rosenberg, has a way to go before reaching its full potential.
 
In this Q&A with No Jitter, Rosenberg shares his thoughts on the importance of ASR, the role of artificial intelligence (AI) and ASR, pricing, and where he hopes we’re headed with the technology.
Headshot of Jonathan Rosenberg, Five9

Jonathan Rosenberg, CTO, Five9

 
 
With today’s heavy focus on the customer experience, why is it important that companies address ASR from a business perspective?
Rosenberg: ASR is just a technology. What is important is the outcome it — with other technologies — can produce. The outcome it enables is a natural, human-like interaction that can replace or augment the traditional IVR systems that are a mainstay of contact centers today. Users are used to products like Alexa and Siri that allow them to interact naturally. Ask a question, get an answer. Why should a contact center interaction be different? Rather than following a fixed, tree-like menu structure, users just provide information and answer questions. Rather than using phone keypads to enter information, users can speak naturally.
In five years, I hope we laugh at the idea of ‘press one for sales’ as an icon of a prior era.
 
How can businesses use legacy data to train prediction models?
Rosenberg: Call recordings are the digital gold of the contact center. Think about the information that is within them. They contain the types of questions customers have. They contain the issues they are complaining about. They contain the satisfaction levels of those customers. They contain the quality and performance of the agents. They contain information on what takes the most time and where money can be saved. They contain the lingo and vocabulary that customers use about your product. Every single one of these can be extracted by applying speech recognition, following by intelligent tooling designed to obtain, and then use, this insight.
 
We believe that the hard part of AI in the contact center is exactly this problem — how to use legacy data to quickly train prediction models, and then measure the accuracy and ROI of those models in production.
 
What’s different about ASR for the contact center versus other applications?
Rosenberg: ASR in the contact center is literally the worst-case scenario for this technology. Often you hear about the amazing accuracy — in the 90-95% range — of ASR technology often exceeding human beings. That is true — but only for the perfect case of a human being reading aloud, without background noise, written prose, with a perfect recording of wideband voice.
 
Compare that to the use case in the contact center of an agent assistant. In this case, we have poor quality audio to start with. It is from a mobile phone (and thus usually a lousy codec), with background noise and spotty coverage. The content of the speech is often rambling and confused. There are accents. The agent and the customer might talk over each other. They use lingo that is specific to the products of the enterprise.
 
We did a study about a year ago to examine the accuracy of ASR from different vendors when run against contact center recordings. One of the shocking lessons was that it was extremely difficult to get two human beings to listen to the same recording and come up with the same transcription.
 
We had a small team doing this in order to compute what is called “ground truth” in AI systems. Ground truth represents what the right answer to a prediction model is supposed to be. You compare the actual output to ground truth to determine its accuracy. Not only did we struggle to come to a single ground truth for many recordings, but we found that the difference between the manual transcriptions from different engineers on the team differed by percentages that were as large as the differences in accuracy between the ASR systems!
 
As a result, the accuracy of all ASR systems is much lower in this environment, making it extra challenging to achieve good results.
 
How important is precision when it comes to ASR for the contact center?
Rosenberg: ASR alone is just a piece of the puzzle. Most contact center applications of ASR include other machine learning components — natural language processing, topic models, conversation classifiers, and so on. They also include older rules-based systems, such as regular expressions, workflows, and so on. What matters is the overall accuracy of the system once all of these are put together. In our experience, the accuracy of the ASR matters much less in some use cases (agent assistance being one such example), and a lot more in others (such as a voicebot doing order processing).
 
What do advances in AI mean for ASR and the technology five years from now?
Rosenberg: The main thing I’m looking forward to is reduction in costs for the same level of accuracy. Right now, ASR is still relatively expensive.
 
Here’s an interesting bit of math. Let’s say you want to build a virtual agent to handle the same volume of calls as a human agent. A reasonable call volume for an agent is 4,000 minutes of voice per month. If you look online at public pricing from Google for its ASR product, you’ll see it’s $0.009 for 15 seconds of audio for the product needed for contact centers (the enhanced model, no logging). Multiply it out — and that is $144 per agent per month just in ASR costs!
 
Obviously, vendors negotiate prices and do a bunch of things to bring these costs down so that our products are affordable. My main point, though, is that the baseline public pricing is still high for pervasive usage in the contact center. Innovations in new models and computational optimizations will bring these prices down, and once that happens, we’re really going to see some revolution in the contact center.

EC Virtual Logo

To learn more about enterprise speech technologies, attend the Enterprise Connect Digital Conference & Expo 2020 taking place online Aug. 3 to 6. On Monday, Aug. 3, industry analyst Jon Arnold will be presenting the session, "Speech Technologies: Innovations and Use Cases," from 4:00 p.m. to 5:15 p.m. The session will conclude with a live chat Q&A; to participate, register now!