The Voice Awakens: Is Expressive Speech AI the final frontier for customer experience?

A close look at Sesame's Conversational Speech Model and its potential impact on voice AI automation: design, experience and ethics.

Apr 11, 2025

For all the automation potential we have in the enterprise today, the voice channel has always been a stickler. Voice is pure conversation. There’s no hiding from poor design or ineffective technology. Large language models are beginning to show promise in overcoming both of these limitations, but there’s one thing missing. The cherry on top of the cake. The only thing that users actually experience… That’s the way it sounds.

Thankfully, there is some amazing progress happening in the voice synthesis space that’s exciting, even if it leaves us with both an efficacy and ethical question. Both of which we’ll explore here in an effort to help you figure out whether you need the most human-like and realistic sounding voices to make your contact centre automation as effective as it can be.

It takes two to tango: the importance of design and voice selection

The biggest limitation in our ability to provide seamless voice experiences over the years has been a technical one. I don’t think there’s any doubt at this point that the NLU technologies underpinning most voice automation solutions leave room for improvement. I remember in 2018, the Google Assistant team found that there are over 7,000 ways that someone can phrase “Set an alarm”. With this level of breadth for one simple request, it’s no wonder voice systems (and NLU systems more broadly) often fail.

Understanding is the first and most important part of any conversational user interface. It’s arguably the definition of conversation: a shared pursuit of understanding. 90% of the design requirement in NLU systems is anticipating and scripting all of the different ways something can (and will) go wrong. If the system doesn’t understand you, it can’t be helpful, regardless of how it sounds.

And the reason why how it sounds is important is that, in a voice user interface, especially one that’s experienced over the phone, the voice is the interface. From an end-user’s perspective, the voice is all they have. There’s nothing else. You speak and it speaks back. That’s it.

From a designer’s perspective, it’s pure conversation. You don’t have any guardrails or anything to fall back on. You can’t use buttons or images or date pickers or anything that you can use in chat. You can’t display tables or lists and you certainly can’t hit the user with a wall of text and leave them to take their time to read it. Every utterance is ephemeral. Once you’ve said it, it’s gone. If you didn’t communicate clearly, or the user wasn’t paying attention, you’re screwed. Attention spans are short, people are distracted and you’re in the world of navigating short term memory.

This means you have to deal with scenarios in voice that you don’t have to in chat. All of the ‘what did you says’ and ‘say that agains’ and ‘hang on one minutes’ and ‘Jesus Christ, just let the dog out’ background noises that make life for a designer very difficult. And the only things that you have to guide the user through all this messiness are the words you say and how you say them.

Advancements in voice technologies that matter

There have been a number of fairly recent advancements in the voice space that have the potential overcome some of the challenges listed above; namely large language models and speech synthesis.

How large language models impact voice user interfaces

I’m not going to go too deep into how large language models impact the voice experience because the point of this piece is to address the how it sounds element. For a deeper dive on that, check out this podcast with Alan Nichol where we go deep into the specific use cases and benefits of large language models for conversational AI. Suffice it to say that large (and small) language models help us to overcome the fundamental limitations of NLU systems and so understanding is less of a problem now.

There are still limitations on the voice channel, namely related to latency, some memory issues I’ve witnessed in some systems and others don’t actually manage dialogue that effectively, but I’m pretty happy with what’s possible today generally.

The voice that speaks back: your options

So if a conversation is our ability to find shared understanding, and large language models are helping us do this, then the end result still needs to be that something is spoken. It all still culminates in the voice that speaks back.

Typically, spoken words in the voice channel have come in two forms:

Pre-recorded using human actors
Synthesised using text-to-speech (TTS)

When it comes to voice automation, in the NLU world, it was possible to use both.

Pre-recorded human actors

Because you scripted every response, it was feasible to record a person speaking all of your lines and you’d have a pretty natural sounding experience. The challenge there is dealing with dynamic data that changes, such as the weather or anything numerical like dates and times. You couldn’t possibly pre-record enough dialogue to handle data that changes regularly.

Past speech synthesis systems

Text-to-speech went one-up on pre-recorded audio as it’s far more scalable. HMM (Hidden Markov Model)-based synthesis was used widely for most of the naughties in PA systems, IVR phone trees and even in the earlier version of Alexa. These voices, while scalable, sounded ropey, a little buzzy and clearly computerised.

Then, Deep Learning-based RNN (Recurrent Neural Network) TTS voices started to bring more of the tonality and ‘naturalness’ of human voices. This is what Google’s earlier WaveNet voices were based on and started to sound much better but still not ‘human’.

Contemporary speech synthesis systems

Today, most of the contemporary TTS systems, like OpenAI, Play.ht, Eleven Labs and the like, use the same architecture as large language models; Transformers. This enables models to consider how speech should be modelled over a longer period of time, based on contextual data.

This means that they can maintain proper rhythm and pitch across a long utterance, handle punctuation, pauses and create emphasis more than HMM or RNN-based models. For example, knowing that a sentence ends with a question mark can influence how the start (or end) of the sentence is intoned.

All that is to say that we now have speech synthesis that sounds a lot more human. But it can still go further.

Advancements in speech synthesis: the missing link?

You might have seen Sesame’s Conversational Speech Model (CSM) that was announced on February 27th 2025. This is a speech synthesis model that takes contextual data from incoming speech (what a user says and how they said it) to inform how the generated speech should sound. Not only is it a high quality sounding voice like the others mentioned, but it’s also able to reflect the tone of the person speaking, mirroring speaking style and responding to contextual nuances. They call it “voice presence”.

For example, let’s say that I’m excited and I say “I’ve just got a new job”, Sesame’s could respond with a reaction of excitement and happiness as it says “Oh, well done!” Alternatively, if you’re not so happy, it can respond with more empathy and mirror your vibe.

How Sesame CSM works

How it works is pretty clever.

A separate automatic speech recognition (ASR) will transcribe the audio to text, as it typically does in any other voice user interface.
Then, the audio is ‘tokenised’. This converts the audio into semantic tokens (to capture meaning and phonemes) and acoustic tokens (to capture tone, style, emotion and prosody).
The text part is sent to your dialogue management system or LLM pipeline to generate a response, as would typically occur in any other voice user interface.
CSM takes both the transcription (text) and audio tokens (how the user said it), as well as your generated text response and the conversation history, and it uses all of this to generate speech that responds appropriately; in context, with the right expressivity, timing and tone.

The idea is that CSM outputs are intended to be expressive and more natural-sounding, based on what’s been said how it was said, and how to respond.

Why does advanced speech synthesis matter?

This could matter to anyone wanting to create more natural sounding interfaces because, in a real conversation, tone can change everything.

Consider this scene in The One with the Butt; Friends, Series 1, Episode 6, where Joey performs in a play. A terrible play. Afterwards, an agent leaves Joey her card. When announcing this to the group, Pheobe, shocked at how someone could possible be given an agent’s card after such a terrible performance in a awful play, responds confused, saying; “Based on THIS play?”. Joey gives her a disapproving stare, which causes Pheobe to change her message to a more positive, surprising and supportive tone; “Based on this play!”. Same exact words. Totally different meaning, based purely on the tone of voice.

Check out the video below for an example of the difference.

This means that you can now, in theory, create more genuinely empathetic sounding dialogue. Imagine the tone of your human agents when handling a phone call about a lost credit card compared to one about checking to see whether a large payment has entered their account. Or a conversation that involved giving bad news to a customer, such as a shipping delay or a refund refusal based on your return window having been surpassed.

This means:

If the customer sounds stressed, the AI can respond calmly.
If the customer is excited, the AI can match their energy.
If the conversation calls for empathy, the AI can deliver with softness.

In theory, you can now create the same experiences using AI and turn those monotone, static, robotic voices that can only be marginally more expressive through the manipulation of SSML, into more realistic and engaging, natural sounding voices.

Instead of cold, neutral responses, voice agents could:

Apologise sincerely
Offer reassurance with warmth
Explain policies with confidence
Escalate issues with urgency

When I spoke to Rana Gujral, CEO, Behavioral Signals on the VUX World podcast, he shared the results of a study with me that I think could add weight to the potential value of a solution like Sesame. The study found that, by pairing callers with contact centre agents that shared the same speaking style resulted in:

12-17% revenue improvement
+8% Call Success Ratio
+10% Customer Satisfaction

Imagine being able to do that at scale for every interaction?

You could consider this the missing link that will finally enable us to create great voice experiences that understand users perfectly and can respond in a way that sounds just like a human, including mirroring tonality and emotion.

The ethical minefield we’re about to walk through

Now, that said, this does throw-up some pretty big ethical questions that probably shouldn’t be left unanswered before you run off and start Sesame-ing everything. Namely:

Will users may mistakenly believe they’re speaking to a human?
Does that matter, as long as their issue is solved successfully or is that dishonest?
What does that say for consent? Data is being captured and stored from the get-go, including codifying how you’re speaking. Is this acceptable or needed in order to provide the service?
Do we disclose that this is an AI application or just let the user crack on?
How do we avoid emotional manipulation and sinister use cases?
What level of realism aligns with your brand values?
What does this do for the user’s mental model and their expectations of the experience?
Would it create unrealistic expectations about the system’s intelligence?
Are we confident that we can design systems that are so successful that they warrant (or can live up to) this level of expectation?

I’m sure plenty of people can come up with a much longer list than this.

Now, I’m not saying that you shouldn’t consider aiming for an experience like Sesame seems to be able to promise, but you certainly should consider the above questions and be able to rationalise its use.

The potential of purposefully synthetic voices

I think there’s also an argument here for using purposefully synthetic voices. Ones that are clearly artificial, but still expressive and charming. Not like Alexa, which is trying to be human-like, but more like C-3PO or something. Something that is quite clearly a robot and is not trying to hide the fact. These voices could set expectations clearly and could still feel trustworthy and delightful. They could still leverage a technology like Seseme and the realism of Eleven Labs, but just embody the indistinguishable tone of an actual robot.

Something to ponder.

Your AI Ultimatum

When creating voice automation solutions, which will be a requirement now or at some point in the future, you have a choice to make in how you go about that. Do you want something that is indistinguishable from your human colleagues? Or do you want something that’s clearly an AI system? How will you approach transparency and disclosure? Will you make it clear or will you default into it? Some interesting things to ponder.

The tech is coming. The opportunity is real. The responsibility is yours.