Barge-in is the hardest part of a voice agent
Most voice-agent demos fail the moment a real caller interrupts. The agent keeps talking over them, loses the turn, or cuts itself off awkwardly. Barge-in — letting the user cut in mid-response without the system falling apart — is where production voice AI earns or loses trust.
What "barge-in" actually involves
Not just "stop talking when the user speaks." Production barge-in needs:
- Voice-activity detection with low false-positive rates on noise, laughter, and backchannels ("uh-huh", "okay"). False barge-ins destroy the rhythm of conversation.
- Streaming ASR that emits partials fast enough to cancel agent TTS within ~120ms of true speech onset. Cloud-provider round-trips kill this if you don't stream.
- Context preservation — when the user interrupts, the LLM needs to know what the agent was about to say so the next turn picks up coherently, not a hard reset.
- TTS fade-out, not hard cut. A 40–80ms fade with a short silence feels like a polite pause; a hard cut feels like a dropped call.
Latency budget
For sub-second turn-taking (the informal benchmark from academic conversational-AI work — see Skantze's turn-taking survey), the round-trip from user-speech-end to agent-speech-start has to stay under ~1.2s p95. That budget fills fast:
- ASR endpoint detection: 200–400ms (tune silence threshold per domain)
- LLM first-token latency: 250–600ms (model + prompt size dependent)
- TTS time-to-first-audio: 150–400ms
- Network + jitter: 50–150ms
Two decisions that pay back immediately: stream the LLM output token-by-token into TTS (don't wait for the full response), and co-locate the ASR/LLM/TTS stack in one region with the telephony provider. Deepgram's streaming docs and ElevenLabs' WebSocket API are written with this architecture in mind.
Barge-in policy, not just barge-in detection
Two rules we apply every build:
- Grace window on agent-initiated questions. If the agent just asked a question, don't barge-in on the user's thinking-pause ("uh, so…"). Treat the first 600ms of user audio as continuation of their turn, not interruption.
- No barge-in during confirmations. When the agent is reading back a critical number ("your order total is ₹4,820"), suppress barge-in until the number is fully spoken. Callers interrupt confirmations out of habit; missing a digit is expensive.
What breaks in production that demos hide
- Phone-network audio is 8kHz narrowband. ASR models trained primarily on 16kHz wideband data lose 5–15% WER. Pick models tuned for telephony or add a bandwidth-extension step.
- Indian English, Hindi-English code-switching, and regional accents degrade commercial ASR significantly. We benchmark on real call transcripts, not vendor demo audio.
- DTMF tones, hold music from transfers, and IVR menus all show up mid-call and must be filtered before ASR.
The checklist
Before a voice agent goes live, we verify:
- p95 turn-taking latency < 1.4s on the target telephony stack
- Barge-in false-positive rate < 2% on a 500-call holdout
- Context recovery on interruption — agent resumes coherently on 95%+ of test cases
- Critical-number readback is non-interruptible
- Graceful handoff to human with full conversation context (transcript + LLM summary)
- Eval harness replays production calls weekly for regression checks
Voice agents that feel natural aren't smarter models. They're tighter loops with honest latency budgets and barge-in policies written by someone who listened to the call logs.
