No Jitter is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Decoding Dialogflow: Enabling Voice

Voice-enabled bots are revolutionizing the market for interactive voice response (IVR) systems. When compared to a traditional DTMF interface, voice can be a highly effective user interface replacement; plus, it becomes a shortcut to quickly get customers to the information they really want. Traditional IVR’s demise is destined to be slow, but it will be sure and steady.
 
We’re just a few years into a new era in which generic, speaker-independent speech-to-text solutions have become very, very good, with word error rates under 5%. This is as good as most humans understand speech and has significant implications for how we will interface with our devices and with our customer support systems.
 
In this article, the seventh in a series, I’ll examine how to integrate voice with Google using Dialogflow and Contact Center AI (CCAI) for customer service interactions.
 
CCAI and Dialogflow: Not One and the Same
First, you must remember that CCAI and Dialogflow aren’t the same. CCAI is a wrapper around Dialogflow and other Google AI capabilities. While CCAI uses Dialogflow, it has a special set of APIs that provide capabilities available only to CCAI partners and their customers. In other words, Dialogflow alone doesn’t offer these capabilities through its APIs. CCAI also provides “call state,” which Dialogflow doesn’t have on its own. Call state enables some very useful functionality:
 
  1. Barge-In— In a Dialogflow voice interaction, you can’t “barge in,” meaning that Dialogflow must finish its work and provide some type of response in its entirety before another action can occur. For example, if the customer’s intent was an inquiry in which a frequently asked questions database held a response, Dialogflow would proceed to read back the entire FAQ entry associated with the query, even if it missed the mark and the user wanted to move on. There’s no way to interrupt Dialogflow directly. On the other hand, CCAI has a barge-in capability that allows customers to interrupt Dialogflow with another query. The bot will listen for a new and different intent within the query.
  2. Agent Handoff — The CCAI API has a built-in agent handoff capability so that if the user ever needs to go to a live agent, the system immediately returns processing control to the contact center for routing to an agent. All bot interaction data transfers to the agent, as well. While this capability isn’t available in Dialogflow, it can sort of be replicated, but with some significant programming.
 
Relying on a CCAI partner ensures that all the integration a contact center needs with Google’s CCAI components — Virtual Agent, Agent Assist, Sentiment Analysis, and Conversation Topic Modeler — are functional and ready to go without any additional low-level programming required by the contact center owner.
 
CCAI management wrapper flow chart

CCAI eases the use of Google's capabilities for natural language processing, sentiment analysis, agent assist, and conversation topic modeler. You can do most of this on your own, but that effort would require a lot of custom code, and you don’t get state information as part of the call.

 
gRPC: The Voice Interface into Dialogflow
Regardless of whether you use a CCAI partner’s implementation or program a voice connection into Dialogflow yourself, the connection must use the Google Remote Procedure Call protocol, gRPC. The CCAI partners have programmed this voice connectivity for you; you’ll have to do it yourself if you build your own code.
 
Why didn’t Google just use SIP, which is the favorite protocol in our industry? Well, gRPC does a lot more than transport voice. It lets client applications call methods or functions on remote servers as if they were local. Google uses gRPC extensively to enable its distributed applications and services.
 
If you plan to interface voice with CCAI/Dialogflow, be aware that Dialogflow’s speech-to-text engine supports only the six audio compression protocols (codecs) listed below. You’ll want to make sure that the voice interface you present to your customers uses one of these codecs natively. This is necessary to avoid transcoding between codecs, because transcoding leads to voice quality losses and reduced performance in the speech-to-text engine.
 
G.711 PCMU Linear PCM AMR/AMR-WB
OPUS Speex Wideband FLAC

 

When using gRPC, you’ll need to encode your compressed voice packets into Base64 format for transmission to Google’s back end. The Google cloud will decode the Base64 packets and decompress the audio. Likewise, the text-to-speech audio stream from Google back to your application will arrive in an object that uses Base64 encoding. You’ll need to decode the Base64 returned audio into the normal SIP or WebRTC packet type your application is using, and then transmit it to the user’s device for decompressing and listening to the response.

Streaming Voice vs. Batch Voice
Dialogflow can accept voice either in a continuous stream or as a single batch. If you’re using a CCAI partner to interface voice with Dialogflow, you’ll want to make sure the partner has implemented the streaming audio model.
 
The difference between them is that the streaming method sends voice information continuously to Dialogflow, reducing the latency between what a user says and the responses/fulfillments provided by Dialogflow. When using the streaming model, Dialogflow is continuously processing the audio input, making intent detection and entity recognition much faster than when using batch processing.
 
In the batch method, the transmission to Dialogflow happens only after a user completes an entire phrase. Dialogflow then processes that batch of audio and returns a response. For real-time conversations and interactions, like those in a contact center with a live customer calling in, the streaming voice method is a better option.
 
Languages Dialogflow Supports
Dialogflow supports 32 languages or language dialects for speech-to-text; most, but not all of these also can be used for providing responses using text-to-speech. The telephony gateway and knowledge connectors (which sort of translates to CCAI’s Agent Assist) are only available in English dialects at present, while sentiment analysis is available in a handful of different languages.
 
Google Dialogflow language support for speech to text/text to speech

Language support for Dialogflow speech and text capabilities

Google
 
Three Ways for Connecting Voice
In the discussion above, we referenced only two ways for connecting voice into Dialogflow. It turns out that there are really three ways to do this:
 
1. The Google Dialogflow Phone Gateway — Google has created a “beta” version of a phone gateway that provides a simple way to telephony-enable a Dialogflow bot for testing, and offers 30 days of free usage for this purpose. After going through the simple steps of getting a phone number from Google, you can voice-enable a bot in just minutes — Google has done the work of integrating this gateway with Dialogflow so that you don’t have to. Google doesn’t recommend using the phone gateway for large production systems because it limits the number of phone numbers you can subscribe and pay for to 60.
 
Three ways to get voice into Dialogflow

Three ways to get voice into Dialogflow: 1) CCAI partners, 2) custom programming, 3) the Dialogflow Phone Gateway.

 
 
2. Using a CCAI partner — this is the case we reference above in which the contact center provider does the integration with Google and provides a relatively easy interface for connecting voice calls into the contact center to Dialogflow and Agent Assist.
 
3. Developing the integration yourself — this is the other case we reference above; it requires a lot of heavy lifting in order to get voice into Dialogflow. You might choose to integrate voice with Dialogflow and Knowledge Connector this way if you have an application for which the features of CCAI and/or the other integrated capabilities in the contact center aren’t required.
 
Connecting Voice from Mobile Apps and Websites to Dialogflow
If you want to build an app on a mobile device or a website that integrates with Dialogflow, you can use either the CCAI method wherein the contact center partner acts as a voice aggregator, or you can build a custom voice interface. If you’re going to run these apps through your contact center, then the contact center provider will provide some type of an API that will allow your app to connect to the contact center via voice.
 
If you’re building the app from scratch, then you’ll need to figure out how to transport the voice using some protocol like SIP or WebRTC to your back end for processing. And, you’ll want to be sure that you use one of the audio codecs Google supports. Your backend processing application will need to convert the voice packets to the gRPC format before sending the stream to Google for processing.
 
In Summary
  • Getting voice into Dialogflow has two easy methods and one hard one. The CCAI partner method or the Dialogflow Phone Gateway are the easy ways, or you can custom program your own voice integration.
  • Using CCAI is the only method that provides call state while interacting with Dialogflow, and it has the benefits of barge-in and live agent transfer.
  • Google Cloud Speech-to-Text is available in 32 languages/dialects while Text-to-Speech is available in 28.
  • Google’s speech-to-text engine only supports six audio codecs, so make sure your voice application uses one of these.
 
What’s Next
The next article in this series will appear in early October, with a focus on integrating Dialogflow with back-end systems.