Endpointing

Endpointing returns transcripts when pauses in speech are detected.

endpointing string.

Pre-recorded Streaming:Nova All available languages

Deepgram’s Endpointing feature can be used for speech detection by monitoring incoming streaming audio and relies on a Voice Activity Detector (VAD), which monitors the incoming audio and triggers when a sufficiently long pause is detected.

Endpointing helps to detects sufficiently long pauses that are likely to represent an endpoint in speech. When an endpoint is detected the model assumes that no additional data will improve it’s prediction of the endpoint.

The transcript results are then finalized for the process time range and the JSON response is returned with a speech_final parameter set to true.

You can customize the length of time used to detect whether a speaker has finished speaking by setting the endpointing parameter to an integer value.

Endpointing can be used with Deepgram’s Interim Results feature. To compare and contrast these features, and to explore best practices for using them together, see Using Endpointing and Interim Results with Live Streaming Audio.

Enable Feature

Endpointing is enabled by default and set to 10 milliseconds. and will return transcripts after detecting 10 milliseconds of silence.

The period of silence required for endpointing may also be configured. When you call Deepgram’s API, add an endpointing parameter set to an integer by setting endpointing to an integer representing a millisecond value:

endpointing=500

This will wait until 500 milliseconds of silence has passed to finalize and return transcripts.

Endpointing may be disabled by setting endpointing=false. If endpointing is disabled, transcriptions will be returned at a cadence determined by Deepgram’s chunking algorithms.

1 # For more Python SDK migration guides, visit:
2 # https://github.com/deepgram/deepgram-python-sdk/tree/main/docs
3 
4    with client.listen.v1.connect(
5             model="nova-3",
6             language="en-US",
7             # Apply smart formatting to the output
8             smart_format=True,
9             # Raw audio format details
10             encoding="linear16",
11             channels=1,
12             sample_rate=16000,
13             # To get UtteranceEnd, the following must be set:
14             interim_results=True,
15             utterance_end_ms="1000",
16             vad_events=True,
17             # Time in milliseconds of silence to wait for before finalizing speech
18             endpointing=300
19    ) as connection:

Results

When enabled, the transcript for each received streaming response shows a key called speech_final.

JSON

1 {
2   "channel_index":[
3     0,
4     1
5   ],
6   "duration":1.039875,
7   "start":0.0,
8   "is_final":false,
9   "speech_final":true,
10   "channel":{
11     "alternatives":[
12       {
13         "transcript":"another big",
14         "confidence":0.9600255,
15         "words":[
16           {
17             "word":"another",
18             "start":0.2971154,
19             "end":0.7971154,
20             "confidence":0.9588303
21           },
22           {
23             "word":"big",
24             "start":0.85173076,
25             "end":1.039875,
26             "confidence":0.9600255
27           }
28         ]
29       }
30     ]
31   }
32 }
33 ...