Ask an Analyst: What is Voice AI?
Voice AI technologies provide a human-sounding veneer for enterprises. Here’s an overview of how they work.

At Enterprise Connect in 2023, Google Cloud demonstrated AI-generated voice in a contact center setting. The human caller, who was experiencing car trouble, interacted entirely via voice with an AI-powered virtual agent which, sounding nearly human, dispatched roadside assistance to the person and scheduled a service appointment at the car dealer.
At the time, the Google Cloud demonstration sounded like science fiction. Two years later, generative AI-powered virtual voice agents have proliferated such that they are within reach of smaller organizations that need a human-sounding virtual receptionist to large enterprises who might want to replace their robot-sounding IVR with a voice that’s easier on the ears.
Our interview with Adrian Lee, VP analyst with Gartner, provides an overview of what voice AI is and how it works. As Lee suggests, although voice AI sounds promising technical and implementation hurdles remain, such as the need to minimize latency (the delay before the synthetic voice responds) and integration to existing systems.
Part two focuses on why organizations might want voice AI and who some of the key players are.
No Jitter (NJ): What is generative AI-generated voice?
Adrian Lee (Lee): Gen AI voice (or commonly referred to as voice AI) combines advanced machine learning algorithms with large datasets of human speech to create synthetic voice outputs that mimic natural speech patterns.
The process typically involves components such as:
Text-to-Speech (TTS) Conversion: AI models take textual input and convert it to phonetic and prosodic representations that are then articulated as sound. Advanced TTS systems may incorporate accent, intonation, and emotional tone to produce more expressive voices
Voice Cloning: With as little as a few seconds of a human voice sample, these systems can generate a synthetic version that replicates the speaker’s unique vocal traits. For example, Microsoft’s Vall-E model can imitate voice patterns with minimal inputs
Generative Adversarial Networks (GANs) and Diffusion Models: These techniques support the creation of high-quality, realistic voice outputs by learning the underlying distribution of authentic speech. They have been instrumental in rapidly advancing the quality of synthetic data, including audio
The overall workflow involves feeding prompt inputs (whether text, voice, or even video audio) into a generative model, which then processes the information to produce a digital, synthesized voice. This output can be tailored for real-time interaction, dubbing, translation, and other applications across diverse industries.
NJ: How does voice AI work?
Lee: Gen AI voice is a subset of multimodal generative AI, which is the ability to combine multiple types of data inputs and outputs in generative models, such as images, videos, audio or speech, text and numerical data.
Multimodality augments the usability of AI by allowing models to interact with and create outputs across various modalities. The impact of multimodality is not limited to specific industries or use cases and can be applied at any touchpoint between AI and humans. Today, text, voice and image are common modalities. Over the next few years, models will evolve to support additional modalities that are more emerging, domain-specific, or complex.
NJ: What are some of the key tech challenges still to be solved?
Lee: Despite significant advancements, these challenges remain for voice AI:
Accuracy and Naturalness: Achieving a level of natural intonation, nuance, and expressiveness that rival human speech remains a persistent challenge. Ensuring that generated voices capture subtle emotional cues and context-appropriate modulations is still in development.
Bias and Ethical Concerns: Ensuring that synthetic voices do not reproduce or amplify gender, racial, or accent-based biases is a critical concern. Researchers also face the challenge of mitigating deepfake risks.
Security and Content Control: Uncontrolled generation of synthetic voice can lead to misuse, such as unauthorized voice cloning or generating misleading information. Robust mechanisms for monitoring and verifying authenticity are necessary.
Latency and Integration Complexity: For real-time applications such as interactive voice response (IVR) systems or conversational AI in call centers, minimizing latency and ensuring seamless integration with existing infrastructure are vital.
Read more about:
GartnerAbout the Author
You May Also Like


