State of the Market Update: Speech and Voice Recognition

Sergey Oplanchuk Alamy Stock Vector.jpg

Image: Sergey Oplanchuk - Alamy Stock Vector

Enterprise Connect 2022 is a few short weeks away. Barring any last-minute pandemic flareups, we can finally shift from virtual to in-person. Aside from getting to see colleagues and clients again, there’s work to be done. I’ll be doing my fifth installment of the state of the Speech and Voice Recognition market update for speech technology and artificial intelligence (AI), with a particular focus on the enterprise. AI is everywhere these days, and when applied to speech tech, some interesting things start to happen.

While most of the attention has been on customer experience, AI-driven speech tech has brought new forms of value to the workplace, especially for collaboration. That’s the ground I’ll be covering at Enterprise Connect, and here’s a preview of my session, Where Speech Tech Is Today—and Where It’s Heading. I hope you’ll join me to see the full presentation. As a heads-up, I’m in the kickoff slot Monday, March 21, at 8 a.m.

While you’re there, I encourage you to check out other sessions from my fellow BCStrategies colleagues; Blair Pleasant, Michael Finneran, Thomas Brannen, Kevin Kieller, Dave Michels, and others.

State of the Market Update

Analysts love to talk about this, and although AI is moving a mile a minute, there isn’t a lot that’s new for enterprise speech tech. That doesn’t mean you shouldn’t come see my session, as interesting things are happening—they’re just not as sexy or dynamic as what the contact center space is going through. More importantly, I’ll continue a recent theme of my talks, namely how speech tech is just part of bigger story for how AI is transforming everything about work, including collaboration.

The first state of the market message is that we’re simply seeing more of the same, but better. Most of the innovation around enterprise speech tech came during the last two years, which I covered extensively in my previous talks—real time transcription, translation, captioning, summary notes, voice biometrics, noise suppression, etc.

Not much to see here, as these have largely become mainstream UCaaS application. More importantly, though, is how these applications keep improving, and that’s what makes this more about AI than speech tech. AI technologies are iterative by nature, meaning that the more we use them—and the larger the data sets become—the more accurate the applications become at mimicking human behavior.

Here’s what the “learning” in machine learning, and the “intelligence” in AI is all about, and for enterprise use cases, the performance has become good enough for everyday use. The best indicator of that is merger and acquisition activity, where speech tech start-ups are constantly being acquired. I’ll have an update on that during my talk.

At the risk of boring you to tears, there is one big change from last year in this space—and it’s where we all must pay attention. CAI is the acronym du jour for speech tech—conversational AI. If you don’t believe me, consider this: Gartner now has a CAI Magic Quadrant, so it must be true, right? CAI represents a major evolution for chatbots. It also takes speech tech as a whole to a whole new level.

As AI keeps improving the capabilities of speech tech, we now have a two-way dialog with bots, greatly increasing their utility. Rather than engaging with bots to issue commands or respond to closed-ended prompts, the dialog becomes conversational, with bots trying to emulate human speech, even trying to inject empathy and emotion. These bots aren’t trying to dupe us into thinking we’re talking to another human. However, the thinking is that the more human-like the conversation, the more likely we’ll express our true feelings. Not only does this yield better outcomes, but as trust builds when using CAI, so does the potential for AI to automate more workflows, tasks, interactions, etc.

In the contact center, we call these chatbots or virtual agents. In the workplace, we call them digital assistants. In either case, the potential for task automation and personal productivity becomes much greater, allowing us to finally move on from the bad rap that first-generation chatbots carry with them.

Where Speech Tech is Heading: All Bets Are On

If you like the current state of enterprise speech tech, you might love where it’s heading, although that probably depends on which side of the digital immigrant/native spectrum you fall. In short, the future belongs to gamers, and that’s a pretty good clue as to what’s coming.

A core theme of this year’s talk is that speech tech is just one of many AI applications, and at this point, we have solved the big problems around speech recognition. The story was similar to the first generation of unified communications, where the big challenge was getting all the disparate communications applications to interwork on a common platform. Once solved—thanks primarily to the cloud—anybody could offer UCaaS, and now you can embed real-time voice in just about anything. Nobody talks about the challenges of supporting telephony from the cloud anymore.

The same holds for enterprise speech tech, where voice is becoming more of a means than an end. Just as voice over Internet Protocol (VoIP) made telephony another data application, the broader digital transformation trend makes voice more valuable as a source of metadata around conversations— and to interact with “machines”—than as a medium for person-to-person communications.

In this context, enterprise speech tech will become part of a bigger, AI-led transition from in-person work to virtual and immersive experiences. I’ll be citing how Webex Hologram, Mesh for Teams, and yes, the “metaverse” are leading examples of this brave new world. While these may be highly visual forms of collaboration, speech tech will absolutely be central to the experience. These new models may or may not succeed. But, the major players are betting heavily on it, and where they go, speech tech will follow.

Robotic choreography mimics Charlie Watts on drums.

YouTube

As a music enthusiast, I couldn't end my article without mentioning how music intertwines with speech technologies. But what on earth could the Rolling Stones have to do with enterprise speech tech? You’ll have to attend my session for details, but I’ll give you a clue. Recently, the iconic band partnered with Boston Dynamics robotics design company to commemorate 40 years of its album Tattoo You (see image above).

As I always do, my update will end with some questions about how AI can head in the right and wrong directions, and we need to consider both when making technology decisions. As Mick Jagger once sang, “what’s puzzling you is the nature of my game.” That’s a pretty good reflection of the cautions I’ll be stressing as key takeaways. See you at Enterprise Connect!

This post is written on behalf of BCStrategies, an industry resource for enterprises, vendors, system integrators, and anyone interested in the growing business communications arena. A supplier of objective information on business communications, BCStrategies is supported by an alliance of leading communication industry advisors, analysts, and consultants who have worked in the various segments of the dynamic business communications market.