Barge-in is the hardest part of a voice agent

By Operonn TeamApril 15, 20266 min readVOICELATENCYENGINEERING

0:00

3:32

Listen

Most voice-agent demos fail the moment a real caller interrupts. The agent keeps talking over them, loses the turn, or cuts itself off awkwardly. Barge-in — letting the user cut in mid-response without the system falling apart — is where production voice AI earns or loses trust.

What "barge-in" actually involves

Not just "stop talking when the user speaks." Production barge-in needs:

Voice-activity detection with low false-positive rates on noise, laughter, and backchannels ("uh-huh", "okay"). False barge-ins destroy the rhythm of conversation.
Streaming ASR that emits partials fast enough to cancel agent TTS within ~120ms of true speech onset. Cloud-provider round-trips kill this if you don't stream.
Context preservation — when the user interrupts, the LLM needs to know what the agent was about to say so the next turn picks up coherently, not a hard reset.
TTS fade-out, not hard cut. A 40–80ms fade with a short silence feels like a polite pause; a hard cut feels like a dropped call.

Latency budget

For sub-second turn-taking (the informal benchmark from academic conversational-AI work — see Skantze's turn-taking survey), the round-trip from user-speech-end to agent-speech-start has to stay under ~1.2s p95. That budget fills fast:

ASR endpoint detection: 200–400ms (tune silence threshold per domain)
LLM first-token latency: 250–600ms (model + prompt size dependent)
TTS time-to-first-audio: 150–400ms
Network + jitter: 50–150ms

Two decisions that pay back immediately: stream the LLM output token-by-token into TTS (don't wait for the full response), and co-locate the ASR/LLM/TTS stack in one region with the telephony provider. Deepgram's streaming docs and ElevenLabs' WebSocket API are written with this architecture in mind.

Barge-in policy, not just barge-in detection

Two rules we apply every build:

Grace window on agent-initiated questions. If the agent just asked a question, don't barge-in on the user's thinking-pause ("uh, so…"). Treat the first 600ms of user audio as continuation of their turn, not interruption.
No barge-in during confirmations. When the agent is reading back a critical number ("your order total is ₹4,820"), suppress barge-in until the number is fully spoken. Callers interrupt confirmations out of habit; missing a digit is expensive.

What breaks in production that demos hide

Phone-network audio is 8kHz narrowband. ASR models trained primarily on 16kHz wideband data lose 5–15% WER. Pick models tuned for telephony or add a bandwidth-extension step.
Indian English, Hindi-English code-switching, and regional accents degrade commercial ASR significantly. We benchmark on real call transcripts, not vendor demo audio.
DTMF tones, hold music from transfers, and IVR menus all show up mid-call and must be filtered before ASR.

The checklist

Before a voice agent goes live, we verify:

p95 turn-taking latency < 1.4s on the target telephony stack
Barge-in false-positive rate < 2% on a 500-call holdout
Context recovery on interruption — agent resumes coherently on 95%+ of test cases
Critical-number readback is non-interruptible
Graceful handoff to human with full conversation context (transcript + LLM summary)
Eval harness replays production calls weekly for regression checks

Voice agents that feel natural aren't smarter models. They're tighter loops with honest latency budgets and barge-in policies written by someone who listened to the call logs.

ShareX LinkedIn Email

What "barge-in" actually involves

Latency budget

Barge-in policy, not just barge-in detection

What breaks in production that demos hide

The checklist

Working on something like this?