Utterance End

Utterance End sends a message when the end of speech is detected in live streaming audio.

utterance_end_ms string

Pre-recorded Streaming All available languages

The utterance end feature can be used for speech detection and can be enabled to help detect the end of speech while transcribing live streaming audio.

Utterance end analyzes your interim and final results to identify a gap of the configured length after the last finalized word. The feature operates by analyzing interim and final transcripts to detect a sufficient silence gap following the last finalized word, requiring interim results to identify gaps that meet the configured duration. Utterance end provides a convenient server-side implementation of gap detection that could alternatively be implemented client-side by analyzing the timing of transcription results, allowing developers to choose the approach that best fits their application architecture.

Enable Feature

To enable this feature, add utterance_end_ms=1000 to your request. Replace 1000 with the number of milliseconds you want Deepgram to wait before sending the UtteranceEnd message. Utterance end analyzes your interim and final results to detect when there is a gap of the configured length after the last finalized word, then sends the UtteranceEnd message.

For example, if you set utterance_end_ms=1000, Deepgram will wait for a 1000 millisecond gap between transcribed words before sending the UtteranceEnd message.

How It Works: A Concrete Example

Hereโ€™s how utterance end works with interim and finalized results:

  1. Speaker says: โ€œHello thereโ€ (pauses for 1.5 seconds) โ€œHow are you?โ€
  2. With utterance_end_ms=1000:
    • 0.5s: Interim result: "Hello"
    • 1.0s: Interim result: "Hello there"
    • 2.0s: Final result: "Hello there" with word timings:
      • โ€œHelloโ€: start=0.1s, end=0.6s
      • โ€œthereโ€: start=0.7s, end=1.2s
    • ๐Ÿ•’ Utterance end clock starts: At 1.2s (end time of last finalized word โ€œthereโ€)
    • 2.2s: 1000ms gap reached โ†’ UtteranceEnd message sent (last_word_end=1.2)
    • 3.5s: New speech detected, interim result: "How are you?"

The utterance end โ€œclockโ€ only starts counting after receiving the end timestamp of the last finalized word, ensuring accurate gap detection.

Technical Notes

While utterance end provides convenient server-side gap detection, there are some technical considerations to keep in mind:

Gap Detection Within Final Results

Utterance end only analyzes gaps that occur after finalized words. It does not detect gaps that are contained entirely within a single final result. This design extends beyond just internal word gaps: if a final resultโ€™s last word ends at 7.5 seconds but the result itself doesnโ€™t end until 10.0 seconds, utterance end will wait for an additional interim result before considering the gapโ€”because the entire gap is contained within that single final result.

This means you could potentially get faster gap detection by implementing client-side analysis that includes gaps within final transcripts.

For example, if a final result contains โ€œHelloโ€ฆ thereโ€ with a 2-second pause represented in the word timings, utterance end would not fire based on that internal gapโ€”it only analyzes gaps that occur after the final result is processed.

Voice Agent Use Case Considerations

Utterance end fires based on detecting a gap even if it determines that speech is continuing after the gap. This can make it less ideal for voice agent applications where you want to wait for truly complete utterances.

Example with utterance_end_ms=2000:

[Interim] Hello there. This
[Interim] Hello there. This is a test.
[Final] Hello there. This is a test. (last_word_end = 3.4s)
[Interim] (result_end = 4.7s)
[Interim] I'm going to continue (first_word_start = 5.5s)
โ†‘ UtteranceEnd fires here, even though speech continues

In this scenario, utterance end fires after detecting the 2-second gap (between when the last word ended at 3.4s and when new speech began at 5.5s), but the speaker was actually continuing their thought. For voice agents that need to wait for truly complete utterances, client-side implementation with additional logic may be more appropriate.

When to Use Server-Side vs Client-Side Implementation

Use Deepgramโ€™s utterance end when:

  • You need simple, reliable gap detection after finalized words
  • You want to minimize client-side processing complexity
  • Youโ€™re building transcription or note-taking applications

Consider client-side implementation when:

  • You need to detect gaps within final results for faster response times
  • Youโ€™re building voice agents that require precise utterance boundary detection
  • You want to add additional logic (e.g., analyzing speech patterns, semantic completeness)
  • You need to customize gap detection behavior beyond what the server provides

Configuration Requirements:

ParameterValue
Minimum1,000 ms (default)
Maximum5,000 ms
Step sizeAny integer value within range

Note for Self-Hosted and Deepgram Dedicated Users: If your endpoint has a modified step size configuration, the minimum value becomes that step size instead of 1,000 ms. For example:

  • Step size configured for 0.2 (200 ms) โ†’ minimum utterance_end_ms value is 200
  • Step size configured for 1.5 (1500 ms) โ†’ minimum utterance_end_ms value is 1500

To learn more about Deepgram Dedicated or Self-Hosted offerings, reach out to your Deepgram account representative or contact our sales team. For technical details on configuring custom endpoints, see our Custom Endpoints documentation.

UtteranceEnd relies on Deepgramโ€™s interim_results feature and Deepgramโ€™s Interim Results are typically sent every second, so using a value of less 1000ms for utterance_end_ms will not offer you any benefits.

When using utterance_end_ms, setting interim_results=true is also required.

Python
1# see https://github.com/deepgram/deepgram-python-sdk/blob/main/examples/streaming/async_microphone/main.py
2# for complete example code
3
4 options: LiveOptions = LiveOptions(
5 model="nova-3",
6 language="en-US",
7 # Apply smart formatting to the output
8 smart_format=True,
9 # Raw audio format details
10 encoding="linear16",
11 channels=1,
12 sample_rate=16000,
13 # To get UtteranceEnd, the following must be set:
14 interim_results=True,
15 utterance_end_ms="1000",
16 vad_events=True,
17 # Time in milliseconds of silence to wait for before finalizing speech
18 endpointing=300
19 )

Results

The UtteranceEnd JSON message will look similar to this:

JSON
1{
2 "channel": [
3 0,
4 1
5 ],
6 "last_word_end": 2.395,
7 "type": "UtteranceEnd"
8}
  • The type field is always UtteranceEnd for this event.
  • The channel field is interpreted as [A,B], where A is the channel index, and B is the total number of channels. The above example is channel 0 of single-channel audio.
  • The last_word_end field is the time at which end of speech was detected.

If you compare this to the Results response below, you will see that the last_word_end from the UtteranceEnd response matches the data in the alternatives[0].words[1].end field of the Results response. This is due to the gap identified after the final word.

In addition, you can see is_final=true, which is sent because of the interim_results feature.

JSON
1{
2 "channel": {
3 "alternatives": [
4 {
5 "confidence": 0.77905273,
6 "transcript": "Testing. 123.",
7 "words": [
8 {
9 "confidence": 0.69189453,
10 "end": 1.57,
11 "punctuated_word": "Testing.",
12 "start": 1.07,
13 "word": "testing"
14 },
15 {
16 "confidence": 0.77905273,
17 "end": 2.395,
18 "punctuated_word": "123.",
19 "start": 1.895,
20 "word": "123"
21 }
22 ]
23 }
24 ]
25 },
26 "channel_index": [
27 0,
28 1
29 ],
30 "duration": 1.65,
31 "is_final": true,
32 "metadata": {
33 ...
34 "type": "Results"
35}