Audio Preprocessing & Barge-In
Audio Preprocessing & Barge-In
When noise suppression and echo cancellation help speech-to-text — and when they hurt.
Audio Preprocessing & Barge-In
When noise suppression and echo cancellation help speech-to-text — and when they hurt.
Audio preprocessing — noise suppression, echo cancellation, and related signal processing — is one of the most common questions we hear from customers building with Deepgram. The short answer: it depends on your use case. For voice agents, preprocessing can improve conversational flow. For transcription accuracy, it often makes things worse.
This guide explains the tradeoffs and provides concrete recommendations based on real-world experience across hundreds of production deployments.
Modern speech-to-text models like Nova and Flux are trained on diverse, real-world audio — including noisy environments, background chatter, and imperfect microphones. These models extract patterns from the full acoustic signal, including parts that humans might consider “noise.”
When you apply noise suppression before sending audio to Deepgram:
The result: enterprise customers consistently report lower transcription accuracy after applying noise suppression, particularly in domain-specific use cases where subtle speech cues matter (healthcare, legal, financial services).
For a deeper dive into the science behind this, read our blog post: The Noise Reduction Paradox: Why It May Hurt Speech-to-Text Accuracy.
Noise suppression and echo cancellation provide the most value in voice agent scenarios where audio quality affects conversational flow:
Noise suppression doesn’t always degrade transcription — but the risk is highest in these scenarios:
In all of these cases, the impact depends on the specific audio, environment, and suppression algorithm. The only way to know for certain is to A/B test with and without preprocessing on representative samples from your production audio.
Sample rate: Deepgram models are trained across the full range of audio quality. The sweet spot is 16 kHz — there is no accuracy gain above this. If your telephony audio is band-limited to 8 kHz, upsampling to a higher rate provides no benefit.
Always test without preprocessing first. Deepgram’s models are trained on real-world audio and handle noise natively. Many customers who A/B test find that removing noise suppression improves transcription accuracy. Only add preprocessing if you can measure a clear improvement on your own audio.
Based on extensive testing across voice agent and transcription deployments, here are our recommendations in priority order:
Use what the operating system or browser gives you. Platform-level echo cancellation has direct access to both the microphone and speaker output, so time alignment is handled automatically. This is the easiest and most effective first step.
Platform-native echo cancellation isn’t perfect, but it’s closest to the audio source and essentially free. Start here before adding anything else.
AEC algorithms are dynamic — they “learn” the acoustic environment when audio starts flowing. Having the agent speak an opening greeting helps the AEC calibrate before the customer responds, which reduces echo in the critical first seconds of a conversation.
If you’re building a voice agent and experiencing false barge-ins or degraded conversational flow, consider adding noise suppression with a tool like Krisp.
What matters most is the signal-to-noise ratio (SNR) reaching the microphone — which is affected by background noise, but also by mic distance, placement, room acoustics, and hardware quality. This is why no single suppression level works for every deployment.
Key guidelines:
For pure transcription use cases (pre-recorded or streaming), we recommend skipping noise suppression entirely. Send unaltered audio to Deepgram for the best accuracy.
Even with good echo cancellation, some echo will make it through to the speech-to-text engine. Rather than adding more aggressive audio processing, use Deepgram’s Keyterm feature to handle known patterns.
For example, if the AI agent’s greeting bleeds through and gets transcribed, you can map known phrases to empty strings or use keyterm boosting to ensure domain-specific vocabulary is transcribed correctly despite minor echo artifacts.
Another approach: compare STT output against the known TTS text your agent just spoke. If the transcript matches what the agent said, discard it — this is a lightweight, software-level echo cancellation technique that requires no audio pipeline changes.
Both approaches are fast to implement and can be iterated on as you discover patterns in production.
Server-side echo cancellation is powerful but non-trivial. Consider it only if you have:
If you’re on iOS or web and attempting server-side AEC, you need to capture TTS playback as a reference stream and time-align it with the microphone input — this is significantly more complex.
If you go this route:
We strongly recommend using Deepgram’s built-in speech detection rather than an external Voice Activity Detector (VAD) for triggering barge-in. Client-side VADs are sensitive to any audio energy — background noise, TV speech, door slams — which leads to frequent false positives (unintentional interruptions). Deepgram’s approach operates at the model level, so it understands speech content rather than just audio energy, resulting in significantly fewer false triggers.
The tradeoff: Deepgram’s model-level detection may have slightly higher latency than a client-side VAD, since it waits for enough audio to confirm real speech. In practice, this produces a better user experience because the agent interrupts less often by mistake.
Flux has integrated, semantically-aware turn detection built into the model itself. It produces StartOfTurn and EndOfTurn events as part of the decode process — not from a separate VAD pipeline.
To trigger barge-in with Flux, listen for a StartOfTurn message. Every StartOfTurn is guaranteed to contain a non-empty transcript, which means you won’t get false triggers from silence or noise. This is the most reliable approach for barge-in, especially in noisy environments.
For barge-in with Nova-3 streaming, listen for non-empty interim result transcripts. Our recommended approach is to require at least 2 words in an interim result before triggering, or wait for a final result with at least 1 word. This balances responsiveness with accuracy and avoids false triggers from single-word noise artifacts.
If you need more sensitivity, you can trigger on any interim result with at least 1 word — but expect a higher rate of false positives.
Use a standalone VAD (such as Silero VAD) only if you need the lowest possible barge-in latency and can tolerate a higher rate of false positives. This is a deliberate tradeoff — you gain responsiveness but lose precision. In most voice agent deployments, Deepgram’s model-level approach produces a better overall experience.