AI & Speech Tech Prevail at Enterprise Connect 2022

Jon Arnold Speech Header.jpg

Image: Author

Jon Arnold’s fifth installment of “State of the Speech and Voice Recognition Market” had one message for Enterprise Connect 2022 attendees: Conversational AI is a game-changer in customer engagement, for better and for worse.

Arnold kicked off his “Where Speech Tech Is Today—and Where It’s Heading session by explaining why enterprises are riding the evolutionary wave from chatbots to virtual assistants to conversational AI.

Arnold’s upshot was when it comes to speech recognition technologies, “AI brings all kinds of new things…it’s a happening space…it isn’t going away, and it shouldn’t be.” He set up a contrast between chatbots and conversational AI-powered virtual assistants; chatbots are transactional, close-ended, structured, algorithmic, and replicate pre-existing conversations while virtual assistants are conversational, open-ended, and unstructured.

While Arnold said not a lot has changed from last year, he noted that an enterprise speech tech ecosystem is emerging. The ecosystem includes pure plays (Deepgram, Dubber, LumenVox, Otter.ai, Rev.ai, Speechmatics, Verbit), major platforms (AWS-Amazon Lex, Google, IBM-Watson Assistant, Microsoft-Azure Cognitive Services), and UCaaS providers (Avaya, Cisco Webex, Dialpad AI, Microsoft Teams, Zoom).

“If you don’t know these companies, you should,” Arnold said, adding that translation and transcription are standard components in their offerings.

Where Speech Technology Lives in the Enterprise

Until now, most enterprise use cases have revolved around customer service, the contact center, and customer experience. Arnold presented data demonstrating the top five leading use cases for speech technology, which go beyond customer-centric interactions: web-conferencing transcription, customer experience & analytics, subtitling & closed captioning, education, academic & research transcription, and medical transcription. Other use cases demonstrating the greatest commercial impact are consumer electronics, compliance, legal transcription, and media monitoring name a few.

Arnold also emphasized the following four core enterprise applications with a focus on collaboration and productivity: speech-to-text applications for meetings, virtual assistants, automatic speech recognition for conversational analytics, and real-time translation.

Arnold explained the significance of speech-to-text for meetings, transcriptions, and video captions by discussing how they can make a workplace more inclusive. “All of a sudden, that virtual desktop environment is powerful for anybody and everybody,” which can give disabled people access to the same tools and flow of information everybody else has.

A virtual assistant takes notes, so you don’t have to, “and the digital assistant becomes your personal secretary.” On top of that, Arnold explained search capabilities turn speech into a data stream. Once speech is searchable, it’s easier to sift through the speech data and automate tasks related to a meeting.

Arnold says greatest commercial impact is around web-conferencing transcription. “AI has improved the quality of speech recognition to the point where its 95% better in terms of replicating human speech.” He proceeded to use Amazon’s Alexa or Apple’s Siri as an example. “Instead of barking commands [at the assistant], you can converse with that virtual assistant, who can do things for you, respond, and even prompt you when they think there's something you need to know, such as you're late for your meeting."

The next layer in making sense of what a virtual assistant says is automatic speech recognition (ASR)—where the virtual assistant recognizes speech automatically because it has been programmed with machine learning to understand and do something with it. Then you have ASR for conversational analytics, and that’s the next layer where you’re trying to make sense of what the assistant said. Arnold explained, “this is where we get to things like context, intent, and understanding what a person means when they say something.” He added it’s not enough to capture the verbiage—you must know what a person is trying to say.

Where AI & Speech Tech Is Heading: Beyond Collaboration

While explaining where speech tech is heading, Arnold highlighted immersive models and the metaverse. While our workdays begin and end in the physical world, augmented reality is on the horizon. Arnold cited Cisco Webex Hologram as an example that may enable a feeling of co-presence by delivering photorealistic, real-time holograms of actual people. This idea of holograms, virtual projection images, virtual projections of people so you can be in the room with your teammates, “it’s breakthrough stuff,” Arnold said. “The possibilities with AI are getting interesting because eyewear is bringing virtual elements into your workflows.”

Microsoft Mesh for Teams, or as Arnold refers to it—Microsoft's big push into the virtual world—is the other side of the immersive model. “This is Microsoft’s move into a post-PC world because they know at some point PCs will go away,” Arnold said. “What’s interesting about this is the mix of people and avatars…so again, the virtual and physical worlds are getting closer…it’s getting harder to care about the difference.”

Meta has gone fully virtual in the workplace collaboration space because “it’s another application with a use case for these technologies,” Arnold said. It’s a little gamified because the avatars resemble puppets and are cut off at the waist, but Arnold thinks it’s a fun way to work, and people could effectively work this way. Meta is almost entirely in the virtual world, and Arnold told attendees, if you’re willing to give it a try, “you might be surprised with how much you can do at that point.”

Arnold explained that the idea of the metaverse is about benefits. “You’re betting heavily on where people are going to want to socialize, and from there on, where they’re going to want to work and do business.”

Arnold noted that NVIDIA has the upper hand on GPUs, which make computers process faster “because that’s what AI, in particular, is all about,” he said. For AI in particular, “you need a lot of horsepower,” and lack thereof is one of the fallbacks, Arnold explained. For example, “the metaverse can’t work until PCs can process data change fast enough at a scale to make it a good experience.” Arnold noted that the next generation of computers will be purpose-built to support stuff like horsepower, and whatever form the metaverse takes, “voice will be central to adoption.”

AI Adoption: Cautionary Notes to Keep in Mind

Arnold also addressed how AI can head in the right and wrong directions. He set up numerous opposing outcomes—intentional tracking vs. unintentional monitoring, tech that compromises privacy vs. tech that boosts productivity, technology that automates work vs. tech that inspires worker creativity, and technology that enhances user trust vs. technology that erodes trust.

"Those desktop devices running all day long, capturing side conversation, suddenly become top surveillance technology," he said, emphasizing that enterprises must be mindful of this. "Not because of what it can do, but how employees may perceive what you're trying to do." Arnold explained why enterprises must be transparent about using this technology for the right purposes because “you don’t want to compromise privacy.”

He advised enterprises to focus their efforts on identifying deepfakes rather than validating what’s real and authentic because with innovation comes good and bad actors. “Technology is neutral, but AI bias complicates things.”

AI is now driving all forms of technology, including enterprise use cases such as collaboration. Arnold noted that immersive collaboration is coming, and “the big players are making it happen.” He added, “Where AI goes, speech tech will follow.” Similarly, his final thoughts in this space concluded with a mantra. “Where consumers go, enterprises will follow.” And the major players are betting that these new models will succeed—so get ready.