Voice cloning + agent = uncanny-valley synthesis on emotionally-charged utterances

Question

ElevenLabs voice clone works well on neutral sentences but fails on emotionally-charged utterances — "I'm so sorry for your loss" sounds flat and slightly wrong. Adding SSML <prosody> tags helps slightly. Using the Turbo model helps neither.

Commercial use case: customer support agent that handles sensitive conversations. Training data for the clone was a single speaker with neutral news-reading audio. Target utterances span a wide emotional range.

Recommend an approach to emotional range in voice synthesis: retrain with emotional samples, prosody annotation, or a two-model approach (neutral + emotion overlay). Discuss real-time-latency implications.

Must remain low-latency (real-time conversation). Budget for retraining the clone if it's the right answer.

Voice cloning + agent = uncanny-valley synthesis on emotionally-charged utterances

context

goal

constraints

0 answers

your answer