In part one of our conversation with Javed Khan, Senior Vice President and General Manager of Cisco Collaboration, No Jitter talked with him about the need for AI to work seamlessly so it's unobtrusive and therefore more likely to become part of a person's workflow. In this section of the conversation, we discussed how Cisco's audio data is the foundation of its generative AI work now, the challenge of teaching an AI to detect different types of dogs, and the ongoing work of training AI to detect different accents and dialects.
No Jitter (NJ): Earlier in our conversation, you pointed out that good transcription happens as a result of good audio data. And good audio data happens as a result of the technologies that Cisco has built up. It all comes down to having really good data. Do you feel like Cisco has addressed the biggest data problems around AI or are there still big data problems left to solve?
Javed Khan (JK): So our bread and butter, where we do have an advantage, is with audio and video, we've always had a lot of data and that's why we had this head start on noise transcription. What we don't do is we train on customer data unless they want to use it to make their own experience.
NJ: When you say, "With audio and video we've always had a lot of data," how far back does "always" stretch?
JK: It really started with the Voicea and Babblelabs acquisitions. Those were two companies that we bought and they were really young companies, three, four years ago.
(Editor's note: Cisco completed its acquisition of transcription and voice search company Voicea in 2019 and voice-detection AI company Babblelabs in 2020.)
They are the underpinning for transcription and translation technology. Babble Labs had noise removal and noise cancellation technologies. The teams (at both companies) had been purchasing data but they also – there is this concept of artificial data where you can generate data, which you can then use for training.
NJ: You guys were definitely the first ones to realize that accurate transcription unlocks everything else?
JK: Absolutely. In some ways, most of the industry noticed LLMs in the last 18 months. So these things are beautiful and powerful. Yes, [they] can summarize. But what people forgot, was that "Oh, but how do I get to the transcript?" And that was a problem we had been working for on in previous years. And that's when you realize there are accents, dialects, background noise, and just network noise.
NJ:I'm so glad you mentioned accents and dialects. I have a Scottish friend who jokes that Siri never works for him because of his brogue. When Cisco is analyzing data for accents and dialects, how do you overlay that onto a word? For example, let's use the word 'roof.' Because I'm from the south we used to say 'ruhf.' And then I left the south and noticed people saying 'roooof.' It's the same word. How do you train or how do you log that data in such a way where the same word, same contextual meaning, sounds different, depending on who's saying it?
JK: So typically, we go through this very time intensive process. [First,] data is acquired, [maybe] purchased or people have [allowed us] to use it. Then there is the human process of labeling it, and that [labeled data] becomes input to these models. And increasingly, technology is getting better at labeling [the data] too. But there is always a human aspect to validating because if I don't have a baseline of what the word really means, it won’t be as effective. So that is really the secret sauce here.
NJ: Which is that people are still listening in training–
JK: —which takes time. We have an 'early learn,' which is when we get customers' permission to find out what kind of additional trainings we need.
We even had to do this for our noise elimination. It's a funny story, but the background noise removal with dogs was the biggest kind of ask during the pandemic, and guess what? Different dogs have different kinds of barks. So we had to collect different samples. But that's an example of nuances in action.
NJ: Both voice and video data seem incomprehensibly complex. And then what about something like American Sign Language, for example, where you add another layer on top of that, since that's a very physical language. It's dependent on facial expression as well as body positioning.
JK: That would be video intelligence. That is actually surprisingly, from a technology standpoint, an easier problem to solve because there's a relatively finite number of signs. Accents complicate things significantly.