No Jitter is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

AI Plus Speech Equals New Value: Page 2 of 2

Google WaveNet -- Hearing Is Believing
You don’t need me to tell you about Google, but you might need me to tell you about WaveNet. Google strands run through every thread of the AI tapestry, and it’s not much different for voice. There’s a separate post to be written about that, but for now, all roads lead to WaveNet.
 
If you don’t know, Google made a savvy acquisition of U.K.-based DeepMind in 2014. This is the company that used AI to defeat the champs at Go, and if you thought IBM Watson was impressive for beating Ken Jennings, you need an AI refresh. The Go story is a topic for yet another post, and it’s an ominous sign of what’s to come as neural networks and deep learning find their way into everyday life.
 
I digress, but this takes us to WaveNet. I’m not an expert on how they do it, but WaveNets are based on neural networks that have developed new models of generating audio that are more accurate and natural-sounding than other text-to-speech (TTS) models. That’s all I’m going to say, and will now let your ears do the testing.
 
Below are two 30-second clips created by a Google team headed by Dan Aharon, product manager of cloud and speech products. During the course of my research for my Enterprise Connect talk, we discussed ways to illustrate how good Google’s speech technology has become. While the Otter example is about speech-to-text (STT) and real-time transcription, WaveNet is about using AI to create text-to-speech outputs that sound really human-like. Aside from getting the language right, the bigger challenge is generating utterances that have the natural flow, cadence, pacing, tone, etc. of the human voice.
 
Dan asked me to write a narrative to explain this, and that he would generate two speech samples for comparison. The first sample below is what’s called Standard TTS, and it sounds OK, but rather stilted. Compare that to the second sample, which uses the same narrative, but generated from WaveNet. Just to be clear, pay more attention to the audio quality than what’s being said.
 
The narrative is exactly the same for both samples, and it may sound confusing since it’s referring to the sound quality being different for each sample, but you don’t actually hear it within each sample. The first sample is one approach for TTS end-to-end, and the second example is for the second approach -- I realize it’s a bit awkward when listening to the narrative, but that’s just the way it turned out.
 
What really matters is the audio quality comparison, so that aside, I’m not going to say it’s a perfect emulation of human speech, but the WaveNet version is warmer and more natural sounding for sure, and from there, it’s not a big leap to see how this quality of TTS will make conversational AI second nature before long.
 
Google TTS -- Standard Example (without WaveNet)
 
Google TTS – WaveNet Example
 
TTS has a different set of collaboration use cases from STT, and the starting point is producing audio that sounds like we do, especially for listening to longer-form content such as a podcast that AI compiles for you based on excerpts from a long report you don’t have time to read, but can digest during your drive into work.
 
Once you get comfortable with TTS applications, it’s not such a big leap to conversational AI, along the lines of what Google showed with Duplex last year. Despite the Duplex demo passing the Turing Test -- which definitely cuts both ways -- Google has some challenges to work through there, and like it or not, Amazon Web Services’ Alexa for Business is going to find its legs.
 
At some point soon, conversational AI is going to drive new workplace value for voice that we couldn’t even imagine a few short years ago. Whether or not you believe what you’re seeing and hearing in this post, it’s here now, and I have no doubt the Venn diagram circles for AI and speech are going to move a lot closer together, and that’s going to be good news for collaboration.
 
BCStrategies logo

BCStrategies is an industry resource for enterprises, vendors, system integrators, and anyone interested in the growing business communications arena. A supplier of objective information on business communications, BCStrategies is supported by an alliance of leading communication industry advisors, analysts, and consultants who have worked in the various segments of the dynamic business communications market.