To pinpoint the end of speech post-speaking more effectively, immediate notification of speech detection is preferred over relying on the initial transcribed word inference. This is achieved through a Voice Activity Detector (VAD), which gauges the tonal nuances of human speech and can better differentiate between silent and non-silent audio.

Limitations of Endpointing

Deepgram’s Endpointing and Interim Results features are designed to detect when a speaker finishes speaking.

Deepgram’s Endpointing feature uses an audio-based Voice Activity Detector (VAD) to determine when a person is speaking and when there is silence. When the state of the audio goes from speech to a configurable duration of silence (set by the endpointing query parameter), Deepgram will chunk the audio and return a transcript with the speech_final flag set to true.

For more information, see Understanding Endpointing and Interim Results When Transcribing Live Streaming Audio).

In a quiet room with little background noise, Deepgram’s Endpointing feature works well. In environments with significant background noise such as playing music, a ringing phone, or at a fast food drive thru, the background noise can cause the VAD to trigger and prevent the detection of silent audio. Since endpointing only fires after a certain amount of silence has been detected, a significant amount of background noise may prevent the speech_final=true flag from being sent.

In rare situations, such as when speaking a phone number, Deepgram may purposefully wait for additional audio from the speaker so it can properly format the transcript (this only occurs when using smart_format=true).

Using UtteranceEnd

To address the limitations described above, Deepgram offers the UtteranceEnd feature. The UtteranceEnd feature looks at the word timings of both finalized and interim results to determine if a sufficiently long gap in words has occurred. If it has, Deepgram will send a JSON message over the websocket with following shape:

JSON

1 {"type":"UtteranceEnd", "channel": [0,2], "last_word_end": 3.1}

Your app can wait for this message to be sent over the websocket to identify when the speaker has stopped talking, even if significant background noise is present.

The "channel" field is interpreted as [A,B], where A is the channel index, and B is the total number of channels. The above example is channel 0 of two-channel audio.

The "last_word_end" field is the end timestamp of the last word spoken before the utterance ended on the channel. This timestamp can be used to match against the earlier word-level transcript to identify which word was last spoken before the utterance end message was triggered.

To enable this feature, add the query parameter utterance_end_ms=1234 to your websocket URL and replace 1234 with the number of milliseconds you want Deepgram to wait before sending the UtteranceEnd message.

For example, if you set utterance_end_ms=1000 Deepgram will wait for a 1000 ms gaps between transcribed words before sending the UtteranceEnd message. Since this feature relies on word timings in the message transcript, it ignores non-speech audio such as: door knocking, a phone ringing or street noise.

You should set the value of utterance_end_ms to be 1000 ms or higher. Deepgram’s Interim Results are sent every 1 second, so using a value of less than 1 second will not offer any benefits.

When using utterance_end_ms, setting interim_results=true is also required.

Using UtteranceEnd and Endpointing

You can use both the Endpointing and UtteranceEnd features. They operate completely independently from one another, so it is possible to use both at the same time. When using both features in your app, you may want to trigger your “speaker has finished speaking” logic using the following rules:

trigger when a transcript with speech_final=true is received (which may be followed by an UtteranceEnd message which can be ignored),
trigger if you receive an UtteranceEnd message with no preceding speech_final=true message and send the last-received transcript for further processing.

Additional Consideration

Ultimately, any approach to determine when someone has finished speaking is a heuristic one and may fail in rare situations. Since humans can resume talking at any time for any reason, detecting when a speaker has finished speaking or completed their thought is very difficult. To mitigate these concerns for your product, you may need to determine what constitutes “end of thought” or “end of speech” for your customers. For example, a voice-journaling app may need to allow for long pauses before processing the text, but a food ordering app may need to process the audio every few words.