Let's have a conversation – about Conversational AI! How can voice-enabled AI agents change the game for customer experience?

Written by
Megan Mariano
Introduction
In recent years, the landscape of Conversational AI has changed rapidly — even more so in the past few months. The term “Conversational AI” can even be confusing or vague, and leave people wondering what the solution actually does or how it works. In this post, we cover the three different types of conversational AI systems, and we discuss what makes audio-native solutions stand out.
Traditional ASR & ML-based systems
Older conversational systems like Google Dialogflow and Amazon Lex rely on traditional machine learning approaches such as intent classification and entity recognition to navigate a tree of pre-defined paths and options. These systems are more flexible than dial tone-based IVR systems, but lack flexibility, adaptability, and are often frustrating to end-users.
ASR + LLM-based systems
More recently, text-based AI models known as LLMs (Large Language Models) unlocked an entirely new paradigm for building conversational AI solutions. By creating ASR pipelines with LLMs, it became possible to build conversational AI solutions that were much more fluid and adaptable than traditional systems. These pipelines usually include the following steps:
Voice activity detection (VAD) detects when the user is finished speaking, and triggers the transcription of the audio
Transcription converts the user’s spoken words to text that the LLM can understand.
LLM inference combines the user’s transcribed words with a prompt, and generates output text
Text-to-Speech (TTS) converts the text generated by the LLM into speech that is played back to the end-user.
While LLM-based systems are more flexible and adaptable than traditional systems, they still have significant issues including high end-to-end latency, poor handling of user interruptions, and lack of semantic understanding. Since the LLM “brain” of the pipeline isn’t usually processing or generating audio natively, it can’t understand the user’s tone-of-voice in live-time, modulate it’s own tone and prosody, or grasp deeper meanings — it’s at the mercy of the transcription and text-to-speech engines.
Audio-Native Conversational AI
Even more recently, advancements in the AI space now allow AI models to understand and generate audio natively! This means that the AI can actually “hear” you - it can pick up on your tone of voice, infer emotional cues, and detect tonal shifts. And it can even modulate its own tone, prosody and pace to deliver responses that are more natural and fluent.
Why we’re betting on audio-native conversational AI
Lower latency means more natural conversations. One of the biggest wins for audio-native conversational AI is that it has lower end-to-end latency than the other approaches. It takes much less time from the time that the user stops speaking for the system to start playing the AI-generated speech back to the user. This means it creates a more natural and fluid conversational with fewer pauses and less waiting time for end-users. That’s a big win for customer experience!
Multi-linguality. Traditional systems are usually limited to one language — or, the language must be pre-defined for the transcription and TTS engines. Audio-native conversational AI solutions are much better at handling language switching, and they do it quite fluently. They can switch in the middle of a conversation!
Personalization. Since audio-native conversational AI systems understand the user’s tone-of-voice and emotional cues, it can adapt its’ tone, style and responses to match the user’s emotional state, making the experience more personalized and engaging.
Problem resolution. Similarly, audio-native conversational agents can better understand user frustration, confusion and dissatisfaction to address user pain points, to express empathy or apologize, and to hand-off to a human if things are escalating poorly.
Adaptability. Compared to traditional solutions, AI-powered ones are more powerful and more successful at resolving user queries. They can take advantage of multiple different tools and capabilities to try to solve the user’s problem, instead of following pre-defined scripts or trees. For example, an AI customer support agent could iteratively execute searches against a knowledge base until it decides it found the information that the user is looking for
Closing Thoughts
Modern, audio-native conversational AI solutions represent a monumental leap forward from traditional ML- and ASR-based pipelines, and even over pipelines powered by text-based LLMs. With their ability to understand not just words, but emotion, tone and intentions, these AI tools are re-shaping how humans communicate with machines.
They represent enormous opportunities to improve customer experience and customer satisfaction, to increase efficiency and improve automation, and to autonomously handle more tasks than ever before!
Curious about what modern conversational AI can do for your business? Let’s chat!
TAGS
Conversational AI
