Endpointing

Endpointing returns transcripts when pauses in speech are detected.

endpointing string.

Deepgram’s Endpointing feature monitors incoming streaming audio and detects sufficiently long pauses that are likely to represent an endpoint in speech. When Deepgram detects an endpoint, it assumes that no additional data will improve its prediction, so it immediately finalizes its results for the processed time range and returns the transcript with a speech_final parameter set to true.

Endpointing relies on a Voice Activity Detector, which monitors the incoming audio and triggers when a sufficiently long pause is detected.

You can customize the length of time used to detect whether a speaker has finished speaking by setting the endpointing parameter to an integer value. By default, Deepgram uses 10 milliseconds.

ℹ️

Endpointing can be used with Deepgram's Interim Results feature. To compare and contrast these features, and to explore best practices for using them together, see Using Endpointing and Interim Results with Live Streaming Audio.

Enable Feature

By default, endpointing is enabled and will return transcripts after detecting 10 milliseconds of silence. When endpointing is enabled, once a speaker finishes speaking, no transcripts will be sent back until the speech resumes or the required amount of silence has been detected. Once either of those conditions are met, a transcript with speech_final=true will be sent back.

The period of silence required for endpointing may be configured. When you call Deepgram’s API, add an endpointing parameter set to an integer by setting endpointing to an integer representing a millisecond value:

endpointing=500

This will wait until 500 milliseconds of silence has passed to finalize and return transcripts.

Endpointing may be disabled by setting endpointing=false. If endpointing is disabled, transcriptions will be returned at a cadence determined by Deepgram's chunking algorithms.

📘

For an example of audio streaming, see Getting Started with Streaming Audio.

Results

When enabled, the transcript for each received streaming response shows a key called speech_final.

{
  "channel_index":[
    0,
    1
  ],
  "duration":1.039875,
  "start":0.0,
  "is_final":false,
  "speech_final":false,
  "channel":{
    "alternatives":[
      {
        "transcript":"another big",
        "confidence":0.9600255,
        "words":[
          {
            "word":"another",
            "start":0.2971154,
            "end":0.7971154,
            "confidence":0.9588303
          },
          {
            "word":"big",
            "start":0.85173076,
            "end":1.039875,
            "confidence":0.9600255
          }
        ]
      }
    ]
  }
}
...

When speech_final is set to true, Deepgram has detected an endpoint and immediately finalized its results for the processed time range.

In Practice

Examples for using Endpointing:

  • When set to lower values, returning finalized transcripts as soon as possible when a break in speech is detected.
  • When set to higher values, indicating that the speaker may have ended their thought.

ℹ️

By default, Deepgram applies its general AI model, which is a good, general purpose model for everyday situations. To learn more about the customization possible with Deepgram's API, check out the Deepgram API Reference.