Endpointing & Interim Results With Live Streaming
When using Deepgram to transcribe live streaming audio, two of the features you can use include endpointing and interim results.

Both of these features monitor incoming live streaming audio and can indicate the end of a type of processing, but they are used in very different ways:
Using Endpointing (speech_final)
Just like a human listener who evaluates pauses within a conversation to detect when it is appropriate to respond to another speaker, Deepgramโs Endpointing feature looks for any deviation in the natural flow of speech and returns a finalized response. These responses can then be used as input for a downstream natural language processor (NLP) to indicate that a partial utterance may be ready for processing. Because speakers may pause for varying amounts of time, the system that processes Deepgram responses should be prepared to handle incomplete thoughts as well as a speaker suddenly resuming a thought.
When Deepgram identifies that an endpoint has been reached, the response is marked as speech_final: true. The only purpose of the speech_final marker is to tell you where an endpoint has been found, which indicates that a significantly long pause has been detected.
As an example, consider a speaker responding โyesโ on a live call. Although Deepgramโs default algorithm automatically sends final results every few seconds, you might want your system to react to the โyesโ immediately. In this case, Endpointing would be useful because it would prompt Deepgram to detect the pause after โyesโ and quickly send back a finalized result (marked with speech_final: true).
Controlling Endpointing
By default, Deepgram identifies an endpoint after 10 milliseconds (ms) of silence. You can adjust this length of time. To do so, send the endpointing parameter in the request to the Deepgram API to the value you would like to use.
Some examples of use cases for which you might want to adjust endpointing include:
-
A virtual agent or chatbot that expects users to speak in short utterances and needs to engage with them as soon as they stop speaking. In this case, you would likely want to choose a low value for
endpointingor leave it as the default of 10 ms. -
An agent assist provider that needs to give suggestions to a human customer service agent while they are speaking with a customer. In this case, the agent assist provider is working with a population that tends to pause more frequently mid-thought. To account for short pauses in the middle of a thought, they may want to set
endpointingto a high value, such as 500ms.
Endpointing is not intended to return complete utterances or to be a reliable indicator of a complete thought or sentence. It is intended to hint to low-latency NLP engines that intermediate processing may be useful. For utterance segmentation, youโll need to implement a simple timing heuristic appropriate to your domain. To learn more, see Best Practices for Utterance Segmentation.
Using Interim Results (is_final)
Deepgramโs Interim Results feature provides you with preliminary results while Deepgram is working through live streaming audio and correcting and improving its transcription. When Deepgram eventually reaches a point at which it believes its transcript has reached maximum accuracy, it will send a corrected, finalized transcript for the preliminary results it has received for a span of time.
When a response contains the finalized transcript for a piece of processed audio, it is marked as is_final: true. The only purpose of the is_final marker is to identify a corrected, finalized transcript for the processed piece of audio; it does not indicate that an endpoint has been found or that an utterance has completed.
The span of time contained within a finalized transcript is determined by the Deepgram algorithm. Because providing preliminary results while working with live streaming audio is an iterative process, each finalized transcript contains only audio processed since a previous finalized transcript (marked as is_final: true) was sent. To see this in action, check out the first example in the Using Endpointing with Interim Results section.
Using Endpointing with Interim Results
When using endpointing with interim results, remember:
- When
speech_finalis true, thenis_finalwill always also be true. - When
is_finalis true, thenspeech_finalmay or may not be true.
Sometimes speech_final and is_final wonโt match, as in this example, where a customer is giving their credit card number to a customer service agent:
Note that a credit card number normally consists of four utterances with natural pauses between each utterance. However, if you download the sample audio file, you will hear that this speaker has omitted some of the natural pauses.
In this response, you see that:
- On lines 1 through 4, the transcripts contain
"is_final": false, indicating that they are interim results. As more data pours into Deepgram, you see the transcripts get longer. - On line 5,
is_finalis set totrue, indicating that this is a finalized transcript for this section of audio; Deepgram believes it has reached optimal accuracy for this section of audio and will not return any additional transcripts covering this span of time. (Note that the length of the processed piece of audio is corrected accordingly.) However,speech_finalis set tofalse, indicating that Deepgram has not yet identified an endpoint. The speaker has not yet completed their utterance; Deepgram sees that more credit card numbers are coming. - On line 6, the transcript contains
"is_final": falseagain, indicating that this is an interim result for a new segment of audio that starts where the finalized transcript ended. - On line 7,
is_finalis set totrueagain, indicating that Deepgram will not return any additional transcripts for this section of audio. But nowspeech_finalis also set totruebecause Deepgram has identified an endpointโthe speaker has completed an utterance, which ends with the first eight digits of the credit card number. - On line 8 through 11, the transcripts contain matching
is_finalandspeech_finalvalues because the size of the pieces of audio being processed by Deepgramโs interim results feature happen to match the placement of the speakerโs natural pauses, which Deepgram detects as ending an utterance and indicating an endpoint.
Getting Final Transcripts
When both endpointing and interim results are enabled, do not use speech_final: true to pull a full, final transcript. If a speaker vocalizes a long utterance, Deepgramโs algorithm may respond with some finalized transcripts of preliminary results (marked as is_final: true) before the end of the utterance (marked as speech_final: true) is detected.
In this case, to gather a full transcript, you must concatenate all responses marked is_final: true until you reach a response marked speech_final: true.
Letโs work off of the previous example where a customer is giving their credit card number to a customer service agent:
On line 5, we see that is_final is set to true, indicating that this is a finalized transcript for the preliminary results in this section of audio. However, speech_final is set to false, indicating that Deepgram has not yet identified an endpoint. The speaker has not yet completed their utterance; Deepgram sees that more digits of the credit card number are coming. Deepgramโs algorithm does not detect a sufficiently long pause to indicate an endpoint until line 7.
To gather a full transcript for this utterance, you would need to concatenate all responses marked is_final: true until you reached a response marked speech_final: true.
Best Practices for Utterance Segmentation
Remember that endpointing is not intended to return complete utterances or to be a reliable indicator of a complete thought or sentence. It is intended to hint to low-latency natural language processing (NLP) engines that intermediate processing may be useful.
For utterance segmentation, youโll need to implement a simple timing heuristic appropriate to your domain. Adding this type of timing code allows you to optimize to your particular domain, based on the kinds of pauses you expect in speech (for example, dictation versus two-way conversation) and background noise (for example, augmenting utterance heuristics with a confidence cut to remove background speakers).
If your downstream NLP processing requires complete utterances and cannot tolerate on-the-fly updates to the transcript, then the safest thing to do is to only consider is_final: true responses and break at punctuation terminators (if you are also passing punctuate: true to Deepgram as a parameter) or after a large enough gap in end-to-start times between adjacent words, whichever comes first. This works well, but can introduce a latency in your processing corresponding to the time you need to wait for is_final: true.
If your downstream processing requires complete utterances but you need lower latency, then you can consider all responses, waiting until a punctuation terminator (if you are also passing punctuate: true to Deepgram as a parameter), a gap in timings between adjacent words, or until the time between the end of the last word to the end of the utterance exceeds a tolerance, whichever comes first. This reduces latency, but can produce false positives if an updated (interim) transcript that conflicts with this heuristic is returned.