Understand Endpointing

Last updated 04/14/2021

Deepgram uses speech endpoint detection to infer when a user has finished speaking or paused for a significant amount of time, indicating the completion of an idea. When Deepgram detects an endpoint, it assumes that no additional data will improve its prediction, so it immediately finalizes its results for the processed time range and returns the transcript with a speech_final parameter set to true.

Voice Activity Detection

Endpointing relies on voice activity detection (VAD), which monitors the incoming audio and triggers when a sufficiently long pause is detected.

By default, the length of time Deepgram uses for voice activity detection (VAD) is 10 ms, but you can configure this value using the vad_turnoff parameter. To learn more, see Speech Engine API Reference: Real-time Streaming - vad_turnoff.

Enable Endpointing

We turned endpointing off in the previous examples; by default, endpointing is enabled.

Let’s run our script again with endpointing enabled. To enable endpointing, edit one of the following lines, depending on your environment:

On line 21 of deepgram-streaming-example.py, remove the endpointing=false parameter:

21 async with websockets.connect('wss://brain.deepgram.com/v2/listen/stream', extra_headers=extra_headers) as ws:

On line 3 of deepgram-streaming-example.js, remove the endpointing=false parameter:

3 const ws = new WebSocket('wss://brain.deepgram.com/v2/listen/stream',

Analyze the Transcript

When run, the script prints out the transcript for each response it receives and shows a new key called speech_final.

{
  "channel_index":[
    0,
    1
  ],
  "duration":1.039875,
  "start":0.0,
  "is_final":false,
  "speech_final":false,
  "channel":{
    "alternatives":[
      {
        "transcript":"another big",
        "confidence":0.9600255,
        "words":[
          {
            "word":"another",
            "start":0.2971154,
            "end":0.7971154,
            "confidence":0.9588303
          },
          {
            "word":"big",
            "start":0.85173076,
            "end":1.039875,
            "confidence":0.9600255
          }
        ]
      }
    ]
  }
}
...

When speech_final is set to true, Deepgram has detected an endpoint and immediately finalized its results for the processed time range.