Google Debuts Text-to-Speech for App Integration

Google has had text-to-speech synthesis technology in its own products for years -- think Google Assistant, Search, Maps, and Home -- as Dan Aharon, product manager for Cloud AI at Google, told me in a No Jitter briefing. And Google Cloud customers have been asking for access to the technology for a long time now, so that developers can add text-to-speech capabilities to their own applications, he added. But the company has taken its time delivering because "we wanted to make sure that the voices we produce for Cloud are different than the voices we produce for our Google products," -- to eliminate any confusion from consumers about what is and isn't a Google product, he said.

Well, the wait is over. Today, via a Google Blog post, Aharon announced that the company is bringing its text-to-speech synthesis technology to the Google Cloud Platform with Cloud Text-to-Speech.

Cloud Text-to-Speech lets developers choose from 32 different voices in 12 language variants, and it supports a variety of audio formats including mp3 and wav. Developers can also customize pitch, speaking rate, and volume gain.

Cloud Text-to-Speech includes a number of high-fidelity voices that were built using WaveNet, which is a neural network for raw audio that was created by DeepMind, a Google subsidiary focused on long-term research in machine learning and artificial intelligence (AI). WaveNet came out of a research paper published roughly a year and a half ago, Aharon told me.

"In late 2016, DeepMind introduced the first version of WaveNet -- a neural network trained with a large volume of speech samples that is able to create raw audio waveforms from scratch," Aharon wrote in the Google Blog post. "During training, the network extracts the underlying structure of the speech, for example which tones follow one another and what shape a realistic speech waveform should have. When given text input, the trained WaveNet model generates the corresponding speech waveforms, one sample at a time, achieving higher accuracy than alternative approaches."

In the year and a half since, the Google Speech team has been investing heavily and working closely with DeepMind to productize the WaveNet model. This resulted in improvements that allow the model to generate raw waveforms 1,000 times faster than the original model, as well as create waveforms with 24,000 samples a second. Additionally, Google has increased the resolution of each sample from 8 bits to 16 bits, which results in higher quality audio and a more human-like sound, Aharon said.

The company is touting the new WaveNet model as producing the most human-like, natural-sounding speech available today. As shown in the Google graphic below, testing groups gave the U.S. English WaveNet voices an average mean opinion score of 4.1 (scale of 1-5), which is more than 20% better than the MOS given for standard (non-WaveNet) voices -- "closing the gap to human speech by over 70%," Aharon said. And due to the WaveNet model requiring less recorded audio input, Google expects to continue to improve the quality and variety of voices it makes available to Cloud customers over the next several months.

portable

Mean Opinion Scores -- Graphic from Google

Google shared Cloud Text-to-Speech with alpha customers privately under NDA. "One of the things they like about the product is it's really good at pronunciation -- names, dates, times, etc.," Aharon said. "Other [text-to-speech] systems require customers to go reformat text to make it pronounce properly."

Two customers who are already using the service include Cisco and Dolphin ONE, the blog states.

"As the leading provider of collaboration solutions, Cisco has a long history of bringing the latest technology advances into the enterprise," said Tim Tuttle, CTO of Cognitive Collaboration at Cisco, in a prepared statement. "Google's Cloud Text-to-Speech has enabled us to achieve the natural sound quality that our customers desire."

To start, Google is targeting three main use cases with Cloud Text-to-Speech: intelligent IVRs in call centers, speech-enabling IoT devices, and converting text-based media into a spoken format. For call centers, enterprises can leverage Cloud Text-to-Speech to reduce or eliminate their reliance on pre-recorded human audio samples. In a customer service context, if a customer calls in for more information on toasters, for example, the IVR system using text-to-speech will be able to respond back in natural language, Aharon said. "Imagine it's replacing an IVR."

With IoT devices, the use case is very similar to IVR, Aharon said. Users want to be able to talk to it and ask it to lower the volume, for example, and have it respond back in a natural voice. For the third use case, look at your favorite news site and imagine that you could click a button and have your news articles read to you in an audio format, Aharon explained.

Cloud Text-to-Speech is priced per 1 million characters of text processed. For standard (non-WaveNet) voices, the first 4 million characters are free and then it's $4 per 1 million characters, and for WaveNet voices, the first 1 million characters are free after which it's $16 per 1 million characters. To put this into perspective, each million characters is equivalent to roughly 23 to 24 hours of audio, so 4 million characters would be around 90-100 hours of speech, Aharon said.

For those who may have missed it, at Enterprise Connect 2018 earlier this month, Diane Chaleff, a Google Cloud Office of the CTO executive, gave an Industry Vision Address about how machine learning technologies will become core to the communications tools of the future. Watch her talk below to get caught up:

Follow Michelle Burbick and No Jitter on Twitter!
@nojitter
@MBurbick