Voice agent latency spikes to 4s every few turns — breaks the conversation feel

Question

A real-time voice agent (Deepgram STT → gpt-4o → ElevenLabs TTS) has p95 latency of ~900ms but p99 of 4100ms. The p99 spikes are unpredictable and make conversation feel broken. They don't correlate with query complexity.

Streaming pipeline throughout. WebSocket audio in, server-side VAD, barge-in support. Running on a single Fly.io region close to users. Metrics show the spike is in the gpt-4o call itself, not network or TTS.

Determine whether the latency spikes are OpenAI-side variance, cold-start of some component, or a client bug. Propose a mitigation (fallback model, speculative decoding, regional routing) that cuts p99 below 1800ms.

Single-model architecture preferred. Budget for 2x tokens per call if it reliably caps p99.

Voice agent latency spikes to 4s every few turns — breaks the conversation feel

context

goal

constraints

0 answers

your answer