Utterance End
utterance_end_ms
string
The utterance end feature can be used for speech detection and can be enabled to help detect the end of speech while transcribing live streaming audio.
Utterance end analyzes your interim and final results to identify a gap of the configured length after the last finalized word. The feature operates by analyzing interim and final transcripts to detect a sufficient silence gap following the last finalized word, requiring interim results to identify gaps that meet the configured duration. Utterance end provides a convenient server-side implementation of gap detection that could alternatively be implemented client-side by analyzing the timing of transcription results, allowing developers to choose the approach that best fits their application architecture.
Enable Feature
To enable this feature, add utterance_end_ms=1000
to your request. Replace 1000
with the number of milliseconds you want Deepgram to wait before sending the UtteranceEnd message. Utterance end analyzes your interim and final results to detect when there is a gap of the configured length after the last finalized word, then sends the UtteranceEnd message.
For example, if you set utterance_end_ms=1000
, Deepgram will wait for a 1000 millisecond gap between transcribed words before sending the UtteranceEnd message.
How It Works: A Concrete Example
Hereโs how utterance end works with interim and finalized results:
- Speaker says: โHello thereโ (pauses for 1.5 seconds) โHow are you?โ
- With
utterance_end_ms=1000
:- 0.5s: Interim result:
"Hello"
- 1.0s: Interim result:
"Hello there"
- 2.0s: Final result:
"Hello there"
with word timings:- โHelloโ: start=0.1s, end=0.6s
- โthereโ: start=0.7s, end=1.2s
- ๐ Utterance end clock starts: At 1.2s (end time of last finalized word โthereโ)
- 2.2s: 1000ms gap reached โ UtteranceEnd message sent (
last_word_end=1.2
) - 3.5s: New speech detected, interim result:
"How are you?"
- 0.5s: Interim result:
The utterance end โclockโ only starts counting after receiving the end timestamp of the last finalized word, ensuring accurate gap detection.
Technical Notes
While utterance end provides convenient server-side gap detection, there are some technical considerations to keep in mind:
Gap Detection Within Final Results
Utterance end only analyzes gaps that occur after finalized words. It does not detect gaps that are contained entirely within a single final result. This design extends beyond just internal word gaps: if a final resultโs last word ends at 7.5 seconds but the result itself doesnโt end until 10.0 seconds, utterance end will wait for an additional interim result before considering the gapโbecause the entire gap is contained within that single final result.
This means you could potentially get faster gap detection by implementing client-side analysis that includes gaps within final transcripts.
For example, if a final result contains โHelloโฆ thereโ with a 2-second pause represented in the word timings, utterance end would not fire based on that internal gapโit only analyzes gaps that occur after the final result is processed.
Voice Agent Use Case Considerations
Utterance end fires based on detecting a gap even if it determines that speech is continuing after the gap. This can make it less ideal for voice agent applications where you want to wait for truly complete utterances.
Example with utterance_end_ms=2000
:
In this scenario, utterance end fires after detecting the 2-second gap (between when the last word ended at 3.4s and when new speech began at 5.5s), but the speaker was actually continuing their thought. For voice agents that need to wait for truly complete utterances, client-side implementation with additional logic may be more appropriate.
When to Use Server-Side vs Client-Side Implementation
Use Deepgramโs utterance end when:
- You need simple, reliable gap detection after finalized words
- You want to minimize client-side processing complexity
- Youโre building transcription or note-taking applications
Consider client-side implementation when:
- You need to detect gaps within final results for faster response times
- Youโre building voice agents that require precise utterance boundary detection
- You want to add additional logic (e.g., analyzing speech patterns, semantic completeness)
- You need to customize gap detection behavior beyond what the server provides
Configuration Requirements:
Note for Self-Hosted and Deepgram Dedicated Users: If your endpoint has a modified step size configuration, the minimum value becomes that step size instead of 1,000 ms. For example:
- Step size configured for 0.2 (200 ms) โ minimum
utterance_end_ms
value is 200 - Step size configured for 1.5 (1500 ms) โ minimum
utterance_end_ms
value is 1500
To learn more about Deepgram Dedicated or Self-Hosted offerings, reach out to your Deepgram account representative or contact our sales team. For technical details on configuring custom endpoints, see our Custom Endpoints documentation.
UtteranceEnd relies on Deepgramโs interim_results
feature and Deepgramโs Interim Results are typically sent every second, so using a value of less 1000ms for utterance_end_ms
will not offer you any benefits.
When using utterance_end_ms
, setting interim_results=true
is also required.
Results
The UtteranceEnd JSON message will look similar to this:
- The
type
field is alwaysUtteranceEnd
for this event. - The
channel
field is interpreted as[A,B]
, whereA
is the channel index, andB
is the total number of channels. The above example is channel 0 of single-channel audio. - The
last_word_end
field is the time at which end of speech was detected.
If you compare this to the Results response below, you will see that the last_word_end
from the UtteranceEnd response matches the data in the alternatives[0].words[1].end
field of the Results response. This is due to the gap identified after the final word.
In addition, you can see is_final=true
, which is sent because of the interim_results
feature.