Utterance End | Deepgram's Docs

utterance_end_ms string

Pre-recorded Streaming All available languages

The UtteranceEnd feature can be used for speech detection and can be enabled to help detect the end of speech while transcribing live streaming audio.

UtteranceEnd complements Voice Activity Detection (VAD) by analyzing word timings in both interim and finalized transcripts to detect gaps between words, marking the end of spoken utterances and notifying users of speech endpoint detection.

Enable Feature

To enable this feature, add utterance_end_ms=000 to your request. Replace 000 with the number of milliseconds you want Deepgram to wait before sending the UtteranceEnd message.

For example, if you set utterance_end_ms=1000, Deepgram will wait for a 1000 millisecond gap between transcribed words before sending the UtteranceEnd message.

It is recommended that you set the value of utterance_end_ms to be 1000 ms or higher.

UtteranceEnd relies on Deepgram’s interim_results feature and Deepgram’s Interim Results are typically sent every second, so using a value of less 1000ms for utterance_end_ms will not offer you any benefits.

When using utterance_end_ms, setting interim_results=true is also required.

Python

1 # see https://github.com/deepgram/deepgram-python-sdk/blob/main/examples/streaming/async_microphone/main.py
2 # for complete example code
3 
4    options: LiveOptions = LiveOptions(
5             model="nova-3",
6             language="en-US",
7             # Apply smart formatting to the output
8             smart_format=True,
9             # Raw audio format details
10             encoding="linear16",
11             channels=1,
12             sample_rate=16000,
13             # To get UtteranceEnd, the following must be set:
14             interim_results=True,
15             utterance_end_ms="1000",
16             vad_events=True,
17             # Time in milliseconds of silence to wait for before finalizing speech
18             endpointing=300
19         )

Results

The UtteranceEnd JSON message will look similar to this:

JSON

1 {
2   "channel": [
3     0,
4     1
5   ],
6   "last_word_end": 2.395,
7   "type": "UtteranceEnd"
8 }

The type field is always UtteranceEnd for this event.
The channel field is interpreted as [A,B], where A is the channel index, and B is the total number of channels. The above example is channel 0 of single-channel audio.
The last_word_end field is the time at which end of speech was detected.

If you compare this to the Results response below, you will see that the last_word_end from the UtteranceEnd response matches the data in the alternatives[0].words[1].end field of the Results response. This is due to the gap identified after the final word.

In addition, you can see is_final=true, which is sent because of the interim_results feature.

JSON

1 {
2   "channel": {
3     "alternatives": [
4       {
5         "confidence": 0.77905273,
6         "transcript": "Testing. 123.",
7         "words": [
8           {
9             "confidence": 0.69189453,
10             "end": 1.57,
11             "punctuated_word": "Testing.",
12             "start": 1.07,
13             "word": "testing"
14           },
15           {
16             "confidence": 0.77905273,
17             "end": 2.395,
18             "punctuated_word": "123.",
19             "start": 1.895,
20             "word": "123"
21           }
22         ]
23       }
24     ]
25   },
26   "channel_index": [
27     0,
28     1
29   ],
30   "duration": 1.65,
31   "is_final": true,
32   "metadata": {
33    ...
34   "type": "Results"
35 }