Where's the AI in Cisco Spark Assistant?

As covered elsewhere on No Jitter, Cisco today introduced Cisco Spark Assistant, an AI-powered assistant aimed sharply at improving the efficiency of meetings. Rowan Trollope, SVP & GM of IoT and Applications at Cisco, and his team are on a relentless pursuit to make the meeting room experience so easy and convenient that the technology simply disappears, allowing participants to focus on the content of their meeting. Today's Spark Assistant announcement is another in a long line of innovations Cisco is bringing to the meeting experience, and to our industry overall.

Cisco is positioning Spark Assistant as an AI-powered voice assistant; let's decompose how Spark Assistant works so that we can identify where the artificial intelligence actually resides in this product.

At a high level, Spark Assistant can be diagrammed as a series of processes beginning with a user's voice command and ending with the assistant performing an action and informing the user of that action.

portable
Figure 1. A block diagram of how Cisco Spark Assistant works

In this five-step process, there are three areas where Cisco has invoked the use of artificial intelligence: speech-to-text, natural language processing, and text-to-speech. Because speech to text and natural language processing are CPU intensive, these functions are performed by processors up in the Spark cloud and not on the Spark Room endpoint.

Speech to Text

As a user is speaking, an analog to digital converter in the newly announced Spark Room 70 (see "Hey Spark, How Is Cisco Partner Summit?"), samples the soundwave at 44 kHz, and each measurement is assigned a number that reflects the amplitude of the soundwave at a given point in the utterance.

portable

This digital string of numbers is then sent to the Cisco Spark cloud, where additional filtering and normalization occurs. It is after this step that the artificial intelligence begins.

When speech is processed, it is broken up into phonemes, which are the elemental building blocks of sounds that comprise spoken language. In the English language, there are approximately 40 phonemes. These phonemes are then processed, with the algorithm looking at what came before and what came after a particular phoneme. The AI algorithm does this to try to put the phonemes into some sort of context so that it can determine what words, phrases, and sentences a person has spoken. It turns out that Cisco is using several third-party algorithms in Spark Assistant to convert the speech to text, and it will make a decision on which to use closer to when the product becomes generally available.

This speech-to-text algorithm must be trained so that it understands words that are specific to a particular domain. As we think about Cisco Spark Assistant, the domain is clearly one of meetings. Thus, the algorithm will be able to detect words like "Start my meeting," "Call Michael's meeting room," "Call Sidney," or "End my meeting."

The output from the speech-to-text functional block is a string of text that should be what the user has spoken. At this point in the process, the system has simply converted speech to text, but it has not determined the intent of what the user actually wants it to do. This happens in the natural language processing block.

Natural Language Processing

The goal of the natural language processing block is to determine the intent of the user, along with any entities or objects that this intent must act upon. This is another significant artificial intelligence processing step. This is the area where the MindMeld software acquired by Cisco earlier this year comes into play.

In its initial debut, Cisco Spark Assistant has four primary "intents," or actions, that it can invoke:

  1. Start a meeting
  2. Call into somebody else's meeting room
  3. Call another video endpoint or telephone
  4. End a call

Each of these intents necessarily involve other entities or objects that must be acted upon. For example, if a person says, "Start my meeting," the system can figure out that it's supposed to start a meeting, but it also has to figure out who "my" is. Consequently, there must be some type of identification of who is speaking to the system. Spark Assistant will try to identify who is in a room using Cisco's proximity mechanism, which involves the use of subsonic signaling between a person's mobile device and the video endpoint. If there are multiple people in the room, the system may need to ask which person is speaking so that it can properly identify who the word "my" is referring to. In future versions of the Spark Assistant, it will be able to determine who is speaking through facial recognition or through other means that may be added later, such as voice fingerprinting.

Other things that the system must figure out include which other endpoints to call. So, for example, if a person says, "Call Michael's phone", the system must have some integration with the speaker's contact base, probably through Active Directory, or another LDAP directory, so that it can query which "Michaels" are in the person's contact list and what their contact parameters are.

As you can see, there is a lot of background processing that must occur to determine what the user really wants the system to do. The good news is that this first version of Spark Assistant focuses on the meeting domain, so the system logic is constrained to identify intents and entities that may be involved in some type of video meeting. Additional domains may be added at a future time, which would further increase the utility of Spark Assistant.

Once the Assistant's artificial intelligence software has determined the intent of what the user wants to do, along with the entities involved to accomplish that intent, it then must interface with some type of mechanism that will invoke an action.

Action Performed

Spark Assistant's artificial intelligence processing is performed in the cloud, but once the intent and the entities are identified, this information is sent back to the Spark Room 70 video endpoint. The endpoint accepts this intent along with the entity involved, and will use it to launch a call to a specified person, start a particular user's meeting, or end a call. These are functions that a Spark Room system already does, but heretofore, they have had to be done in a more manual fashion. Of course, Spark Room systems integrate with Spark's call control mechanism to launch calls and to end calls. This particular step of the Spark Assistant process does not involve artificial intelligence, but rather it uses the AI information from previous steps to automate existing functions that have typically been done manually.

Text to Speech

Once Spark Assistant has identified what the user wants to do, and has invoked that action on the Spark Room 70 endpoint, it plays synthesized speech telling the user what action it is taking and, if necessary, who it is calling. This step of the process uses the intents that were found earlier along with the entities. The entities are important here because if the user said to call Michael, the text-to-speech output would need to synthesize the name Michael in its response. Although text-to-speech is still considered an artificial intelligence process, it is one of the easier AI processes because there is really no machine learning or deep learning required.

Conclusion

Cisco Spark Assistant is the latest in a series of enhancements Cisco is making to the meeting room experience. This particular assistant relies heavily on artificial intelligence processing in the Cisco Spark cloud. It also integrates with local processing capability on the endpoint to execute the command the user has spoken to the assistant.

We should expect to see a broadening of the capabilities found in Cisco Spark Assistant over time to cover additional intents that may be found within the overall meeting paradigm, such as scheduling, recording, summarizing, and tagging. These later capabilities are simply speculation on my part, but they would make sense given where Cisco is trying to go with enhancing ease-of-use and the overall quality of the meeting room experience.

We should also expect to see other examples of business-focused AI systems from Cisco and others becoming pervasive throughout enterprises, impacting the efficiency with which we work and perform. We won't really see Spark Assistant in a production mode until sometime next year. Cisco explained that making the Assistant smart enough to do a simple demo is one thing, but making it a robust, secure business tool that works every time is a much more difficult task. Hence, the delivery date sometime in 2018.

Related content:

At Enterprise Connect Orlando 2018, coming March 12 to 15, hear directly from Cisco on its vision and product direction in a keynote address. Jonathan Rosenberg, VP & CTO of Cisco's Collaboration Technology Group, will take the stage on Tuesday, March 13 at 10 AM. Register now using the code NOJITTER to save an additional $200 off the Advance Rate or get a free Expo Plus pass.