Utterance End

Utterance End sends a message when the end of speech is detected in live streaming audio.

utterance_end_ms string.

The UtteranceEnd feature provided by Deepgram offers a solution to detect the end of speech while transcribing live streaming audio.

Rather than relying solely on the Voice Activity Detector (VAD), which analyzes tonal nuances to differentiate between silent and non-silent audio segments, UtteranceEnd examines the word timings of both finalized and interim transcripts. By identifying sufficiently long gaps between transcribed words, the system determines the conclusion of spoken utterances, triggering the generation of an UtteranceEnd message to notify users of speech endpoint detection.

Enable Feature

To enable this feature, add utterance_end_ms=N to your request. Replace N with the number of milliseconds you want Deepgram to wait before sending the UtteranceEnd message:

utterance_end_ms=1000

For example, if you set utterance_end_ms=1000, Deepgram will wait for a 1000 millisecond gap between transcribed words before sending the UtteranceEnd message.

It is recommended that you set the value of utterance_end_ms to be 1000 ms or higher. This is because UtteranceEnd relies on Deepgram's interim_results feature. Deepgram's Interim Results are typically sent every one second, so using a value of less than one second for utterance_end_ms will not offer any benefits.

📘

When using utterance_end_ms, setting interim_results=true is also required.

Results

The UtteranceEnd JSON message will look similar to this:

{
  "channel": [
    0,
    1
  ],
  "last_word_end": 2.395,
  "type": "UtteranceEnd"
}
  • The type field is always UtteranceEnd for this event.
  • The channel field is interpreted as [A,B], where A is the channel index, and B is the total number of channels. The above example is channel 0 of single-channel audio.
  • The last_word_end field is the time at which end of speech was detected.

If you compare this to the Results response below, you will see that the last_word_end from the UtteranceEnd response matches the data in the alternatives[0].words[1].end field of the Results response. This is due to the gap identified after the final word.

In addition, you can see is_final=true, which is sent because of the interim_results feature.

{
  "channel": {
    "alternatives": [
      {
        "confidence": 0.77905273,
        "transcript": "Testing. 123.",
        "words": [
          {
            "confidence": 0.69189453,
            "end": 1.57,
            "punctuated_word": "Testing.",
            "start": 1.07,
            "word": "testing"
          },
          {
            "confidence": 0.77905273,
            "end": 2.395,
            "punctuated_word": "123.",
            "start": 1.895,
            "word": "123"
          }
        ]
      }
    ]
  },
  "channel_index": [
    0,
    1
  ],
  "duration": 1.65,
  "is_final": true,
  "metadata": {
   ...
  "type": "Results"
}

Comparing Utterance End and Endpointing

UtteranceEnd and Endpointing are both designed to address the challenge of detecting the end of speech during live streaming transcription, but they do so by different means.

Endpointing relies on Deepgram's Voice Activity Detector (VAD) to monitor audio streams and detect periods of silence. This approach is effective in many scenarios but can be susceptible to background noise interference.

On the other hand, UtteranceEnd leverages word timings within transcripts to identify significant gaps between transcribed words, signaling the conclusion of spoken utterances. By analyzing word-level data, UtteranceEnd provides a nuanced method for detecting speech endpoints, even in the presence of background noise.