End of Speech Detection While Live Streaming
Learn how to use End of Speech when transcribing live streaming audio with Deepgram.
To pinpoint the end of speech post-speaking more effectively, immediate notification of speech detection is preferred over relying on the initial transcribed word inference. This is achieved through a Voice Activity Detector (VAD), which gauges the tonal nuances of human speech and can better differentiate between silent and non-silent audio.
Limitations of Endpointing
Deepgram's Endpointing and Interim Results features are designed to detect when a speaker finishes speaking.
Deepgram's Endpointing feature uses an audio-based Voice Activity Detector (VAD) to determine when a person is speaking and when there is silence. When the state of the audio goes from speech to a configurable duration of silence (set by the endpointing
query parameter), Deepgram will chunk the audio and return a transcript with the speech_final
flag set to true
.
For more information, see Understanding Endpointing and Interim Results When Transcribing Live Streaming Audio).
In a quiet room with little background noise, Deepgram's Endpointing feature works well. In environments with significant background noise such as playing music, a ringing phone, or at a fast food drive thru, the background noise can cause the VAD to trigger and prevent the detection of silent audio. Since endpointing only fires after a certain amount of silence has been detected, a significant amount of background noise may prevent the speech_final=true
flag from being sent.
In rare situations, such as when speaking a phone number, Deepgram may purposefully wait for additional audio from the speaker so it can properly format the transcript (this only occurs when using
smart_format=true
).
Using UtteranceEnd
To address the limitations described above, Deepgram offers the UtteranceEnd feature. The UtteranceEnd feature looks at the word timings of both finalized and interim results to determine if a sufficiently long gap in words has occurred. If it has, Deepgram will send a JSON message over the websocket with following shape:
{"type":"UtteranceEnd", "channel": [0,2], "last_word_end": 3.1}
Your app can wait for this message to be sent over the websocket to identify when the speaker has stopped talking, even if significant background noise is present.
The "channel"
field is interpreted as [A,B]
, where A
is the channel index, and B
is the total number of channels. The above example is channel 0 of two-channel audio.
The "last_word_end"
field is the end timestamp of the last word spoken before the utterance ended on the channel. This timestamp can be used to match against the earlier word-level transcript to identify which word was last spoken before the utterance end message was triggered.
To enable this feature, add the query parameter utterance_end_ms=1234
to your websocket URL and replace 1234
with the number of milliseconds you want Deepgram to wait before sending the UtteranceEnd
message.
For example, if you set utterance_end_ms=1000
Deepgram will wait for a 1000 ms gaps between transcribed words before sending the UtteranceEnd
message. Since this feature relies on word timings in the message transcript, it ignores non-speech audio such as: door knocking, a phone ringing or street noise.
You should set the value of utterance_end_ms
to be 1000
ms or higher. Deepgram's Interim Results are sent every 1 second, so using a value of less than 1 second will not offer any benefits.
When using
utterance_end_ms
, settinginterim_results=true
is also required.
Using UtteranceEnd and Endpointing
You can use both the Endpointing and UtteranceEnd features. They operate completely independently from one another, so it is possible to use both at the same time. When using both features in your app, you may want to trigger your "speaker has finished speaking" logic using the following rules:
- trigger when a transcript with
speech_final=true
is received (which may be followed by anUtteranceEnd
message which can be ignored), - trigger if you receive an
UtteranceEnd
message with no precedingspeech_final=true
message and send the last-received transcript for further processing.
Additional Consideration
Ultimately, any approach to determine when someone has finished speaking is a heuristic one and may fail in rare situations. Since humans can resume talking at any time for any reason, detecting when a speaker has finished speaking or completed their thought is very difficult. To mitigate these concerns for your product, you may need to determine what constitutes "end of thought" or "end of speech" for your customers. For example, a voice-journaling app may need to allow for long pauses before processing the text, but a food ordering app may need to process the audio every few words.
Updated about 2 months ago