No Jitter is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Conversational Speech Interfaces in the Contact Center

Do you know whether that voice at the contact center is real or software exercising conversational speech? Sometimes I do, and other times I don't. But what I do know is this: The more that software can be tailored to understand and respond to natural speech, the less need we will have for agents.

In this regard, I think of the product Dragon, which includes natural speech recognition that I could use to avoid typing these blogs. Add a well thought out set of prerecorded speech responses and you have an automated agent or personal assistant.

When I read the IEEE Computer Society article, The Road to Natural Conversational Speech Interfaces by Charles L. Ortiz of Nuance Communications, I immediately thought of the contact center and the ramifications that automated speech dialog would produce. This blog is based on a series of questions presented to Charles.

What is a conversational speech interface?
A Conversational User Interface (CUI) is one in which the meanings of previous user utterances are incorporated into the interpretation (that is, the determination of the meaning) of a new user utterance. This is in contrast to one-shot systems in which every utterance or query has to stand on its own (that is, carry all of the information necessary to correctly respond to it, as does, say, a Google query). Imagine, for example, a conversation with a system about travel plans:

  1. I want to plan a trip to NYC.
  2. When will you be leaving?
  3. November 1.
  4. How long will you be staying?
  5. One week
  6. Do you need a hotel reservation?

Utterances 1, 3, and 5 are all utterances by the user. By themselves they mean very little, however. If I say to you only the words "one week" you are likely to look at me very strangely. You can't make sense of it unless you have heard the prior parts of the conversation.

In addition, just like humans, a CUI must be able to support interruptions to a conversation; that is, a user changing a topic (in the middle of the above travel conversation, you might ask what the weather is like in Philadelphia). Even the system questions above leave out assumed information. This is viewed as quite natural to a human. For example, "When will you be leaving?" by itself means little, what is meant is "When will you be leaving to go to NYC?"

Where can it be applied?
Anywhere that language provides a natural medium for interaction, virtual personal assistant, a database access system (say, for plane or restaurant reservations), etc. Sometimes, other modes of interaction, such as point-and-click actions may need to be integrated.

Can it be applied to the contact center?
Yes, and it would make interactions with a contact center more fluid. It would simplify those interactions particularly given the mixed initiative character (that is, whether the person or the system is guiding the conversation). For example, who is asking and who is answering the questions of most ordinary human conversations.

What would be the impact on the customer interaction?
It would lead to less frustration and less repetition on the part of the user. This is less frustrating because the user could guide the conversation in the direction that he or she wants it to take. The user input to the system becomes more "actionable" in the following sense. Users often complain that they have to answer a fixed set of questions or follow a particular flow of questions before they can present the reasons behind their call or before the system has enough information to take useful action (e.g., tell them that a check has cleared).

How would it benefit the contact center?
Faster problem resolution
Yes, most call centers follow a sort of "tree" of question-answer pairs, which eventually get to a "leaf" in the tree where the system has enough information to act. The flexibility in terms of mixed initiative would help side-step the need to go through the entire path of the tree (that is, all of the requisite question/action pairs) before getting to the desired leaf. The fact that a human responder would not be needed as often would mean that resolution should be arrived at more quickly.

Staff reduction
Probably, since it would reduce the number of times that a user would be motivated to ask to speak to an "operator," thereby side-stepping the automated functionality of the call center and seeking human support. However, we have not yet conducted formal experiments to justify this conclusion. It does, however, represent one of the reasons that we are pursuing this line of work.

Reduce staff turnover
Charles did not express any opinion. I believe that by reducing the simpler agent tasks to an automated conversational speech interface, the remaining agents will deal with the more complex problems and not be burdened with the many dumb interactions that do not require much intelligence to resolve. I think the agent value will increase, agent frustration decrease, and so may the pay, which would reduce turnover.

Would you discuss some of the technical challenges that need to be addressed?
If you look at the example interaction given earlier, the CUI in that case must keep track of the context, which is that a trip to NYC is planned starting November 1. Without that context, each utterance, query, or request by the user would have to stand on its own. You would have to state to the system something like, "I want to find hotels in NYC for one week starting on November 1," which is rather long-winded and tedious to have to speak. I think we have all felt some trepidation just before we issue a natural language query to a personal assistant, hoping not to forget mentioning some bit of information that would be important. This makes it especially frustrating if you forget some of the information (such as the duration of the hotel stay) and then have to start all over.

CUIs maintain a context of the prevailing conversation to leverage the efficiency of human language (the ability to say more than what is said explicitly). CUIs also allow for either the user or the system to take the initiative in the conversation (that is, decide where the conversation goes or what the next topic is: "mixed initiative"). One additional challenge is to be able to build systems, which, given that they can converse about one particular domain of application (say, restaurants) can switch or apply their knowledge to a new domain (such as entertainment).

How do you expect these challenges to be solved?
Some of these challenges have been addressed in isolation. One of the major challenges is getting it all to work in a coordinated fashion. This requires that the natural language understanding interprets an individual utterance, with the knowledge representation that maintains the context, accesses large databases of information, and completes the speech recognition (going from the speech to English text).

When do you expect to see this applied to the contact center?
This will happen over the next five years, but not all at once. Greater functionality will be introduced incrementally.