Understanding Endpointing and Interim Results When Transcribing Live Streaming Audio

When using Deepgram to transcribe live streaming audio, two of the features you can use include endpointing and interim results. Both of these features monitor incoming live streaming audio and can indicate the end of a type of processing, but they are used in very different ways:

EndpointingInterim Results
Identifies deviations in the natural flow of speech and returns a finalized response at these placesProvides preliminary results during live streaming, then sends a definitive transcript of the portion of audio that has been processed so far
Turned on by defaultTurned off by default
Returns speech_final: true parameter when Deepgram identifies a deviation in the natural flow of speech and the response contains the finalized transcriptReturns is_final: true parameter when Deepgram has finished processing audio for a time range and the response contains the finalized transcript for that time range

Using Endpointing (speech_final)

Just like a human listener who evaluates pauses within a conversation to detect when it is appropriate to respond to another speaker, Deepgram’s Endpointing feature looks for any deviation in the natural flow of speech and returns a finalized response. These responses can then be used as input for a downstream natural language processor (NLP) to indicate that a partial utterance may be ready for processing. Because speakers may pause for varying amounts of time, the system that processes Deepgram responses should be prepared to handle incomplete thoughts as well as a speaker suddenly resuming a thought.

When Deepgram identifies that an endpoint has been reached, the response is marked as speech_final: true. The only purpose of the speech_final marker is to tell you where an endpoint has been found, which indicates that a significantly long pause has been detected.

As an example, consider a speaker responding “yes” on a live call. Although Deepgram’s default algorithm automatically sends final results every few seconds, you might want your system to react to the “yes” immediately. In this case, Endpointing would be useful because it would prompt Deepgram to detect the pause after “yes” and quickly send back a finalized result (marked with speech_final: true).

Endpointing is not intended to return complete utterances or to be a reliable indicator of a complete thought or sentence. It is intended to hint to low-latency NLP engines that intermediate processing may be useful. For utterance segmentation, you’ll need to implement a simple timing heuristic appropriate to your domain. To learn more, see Best Practices for Utterance Segmentation.

Using Interim Results (is_final)

Deepgram’s Interim Results feature provides you with preliminary results while Deepgram is working through live streaming audio and correcting and improving its transcription. When Deepgram eventually reaches a point at which it believes its transcript has reached maximum accuracy, it will send a corrected, finalized transcript for the preliminary results it has received for a span of time.

When a response contains the finalized transcript for a piece of processed audio, it is marked as is_final: true. The only purpose of the is_final marker is to identify a corrected, finalized transcript for the processed piece of audio; it does not indicate that an endpoint has been found or that an utterance has completed.

The span of time contained within a finalized transcript is determined by the Deepgram algorithm. Because providing preliminary results while working with live streaming audio is an iterative process, each finalized transcript contains only audio processed since a previous finalized transcript (marked as is_final: true) was sent. To see this in action, check out the first example in the Using Endpointing with Interim Results section.

Using Endpointing with Interim Results

When using endpointing with interim results, remember:

  • When speech_final is true, then is_final will always also be true.
  • When is_final is true, then speech_final may or may not be true.

Sometimes speech_final and is_final won’t match, as in this example, where a customer is giving their credit card number to a customer service agent:

1 0.000-1.100 ["is_final": false] ["speech_final": false] yeah so
2 0.000-2.200 ["is_final": false] ["speech_final": false] yeah so my credit card number
3 0.000-3.200 ["is_final": false] ["speech_final": false] yeah so my credit card number is two two
4 0.000-4.300 ["is_final": false] ["speech_final": false] yeah so my credit card number is two two two two three
5 0.000-3.260 ["is_final": true ] ["speech_final": false] yeah so my credit card number is two two
6 3.260-5.100 ["is_final": false] ["speech_final": false] two two three three three three
7 3.260-5.500 ["is_final": true ] ["speech_final": true ] two two three three three three
8 5.500-6.600 ["is_final": false] ["speech_final": false] four four or four four
9 5.500-6.860 ["is_final": true ] ["speech_final": true ] four four four four
10 6.860-7.900 ["is_final": false] ["speech_final": false] five five five five
11 6.860-8.090 ["is_final": true ] ["speech_final": true ] five five five five

Note that a credit card number normally consists of four utterances with natural pauses between each utterance. However, if you download the sample audio file, you will hear that this speaker has omitted some of the natural pauses.

In this response, you see that:

  • On lines 1 through 4, the transcripts contain "is_final": false, indicating that they are interim results. As more data pours into Deepgram, you see the transcripts get longer.
  • On line 5, is_final is set to true, indicating that this is a finalized transcript for this section of audio; Deepgram believes it has reached optimal accuracy for this section of audio and will not return any additional transcripts covering this span of time. (Note that the length of the processed piece of audio is corrected accordingly.) However, speech_final is set to false, indicating that Deepgram has not yet identified an endpoint. The speaker has not yet completed their utterance; Deepgram sees that more credit card numbers are coming.
  • On line 6, the transcript contains "is_final": false again, indicating that this is an interim result for a new segment of audio that starts where the finalized transcript ended.
  • On line 7, is_final is set to true again, indicating that Deepgram will not return any additional transcripts for this section of audio. But now speech_final is also set to true because Deepgram has identified an endpoint—the speaker has completed an utterance, which ends with the first eight digits of the credit card number.
  • On line 8 through 11, the transcripts contain matching is_final and speech_final values because the size of the pieces of audio being processed by Deepgram’s interim results feature happen to match the placement of the speaker’s natural pauses, which Deepgram detects as ending an utterance and indicating an endpoint.

Getting Final Transcripts

When both endpointing and interim results are enabled, do not use speech_final: true to pull a full, final transcript. If a speaker vocalizes a long utterance, Deepgram’s algorithm may respond with some finalized transcripts of preliminary results (marked as is_final: true) before the end of the utterance (marked as speech_final: true) is detected.

In this case, to gather a full transcript, you must concatenate all responses marked is_final: true until you reach a response marked speech_final: true.

Let’s work off of the previous example where a customer is giving their credit card number to a customer service agent:

1 0.000-1.100 ["is_final": false] ["speech_final": false] yeah so
2 0.000-2.200 ["is_final": false] ["speech_final": false] yeah so my credit card number
3 0.000-3.200 ["is_final": false] ["speech_final": false] yeah so my credit card number is two two
4 0.000-4.300 ["is_final": false] ["speech_final": false] yeah so my credit card number is two two two two three
5 0.000-3.260 ["is_final": true ] ["speech_final": false] yeah so my credit card number is two two
6 3.260-5.100 ["is_final": false] ["speech_final": false] two two three three three three
7 3.260-5.500 ["is_final": true ] ["speech_final": true ] two two three three three three

On line 5, we see that is_final is set to true, indicating that this is a finalized transcript for the preliminary results in this section of audio. However, speech_final is set to false, indicating that Deepgram has not yet identified an endpoint. The speaker has not yet completed their utterance; Deepgram sees that more digits of the credit card number are coming. Deepgram’s algorithm does not detect a sufficiently long pause to indicate an endpoint until line 7.

To gather a full transcript for this utterance, you would need to concatenate all responses marked is_final: true until you reached a response marked speech_final: true.

Best Practices for Utterance Segmentation

Remember that endpointing is not intended to return complete utterances or to be a reliable indicator of a complete thought or sentence. It is intended to hint to low-latency natural language processing (NLP) engines that intermediate processing may be useful.

For utterance segmentation, you’ll need to implement a simple timing heuristic appropriate to your domain. Adding this type of timing code allows you to optimize to your particular domain, based on the kinds of pauses you expect in speech (for example, dictation versus two-way conversation) and background noise (for example, augmenting utterance heuristics with a confidence cut to remove background speakers).

If your downstream NLP processing requires complete utterances and cannot tolerate on-the-fly updates to the transcript, then the safest thing to do is to only consider is_final: true responses and break at punctuation terminators (if you are also passing punctuate: true to Deepgram as a parameter) or after a large enough gap in end-to-start times between adjacent words, whichever comes first. This works well, but can introduce a latency in your processing corresponding to the time you need to wait for is_final: true.

If your downstream processing requires complete utterances but you need lower latency, then you can consider all responses, waiting until a punctuation terminator (if you are also passing punctuate: true to Deepgram as a parameter), a gap in timings between adjacent words, or until the time between the end of the last word to the end of the utterance exceeds a tolerance, whichever comes first. This reduces latency, but can produce false positives if an updated (interim) transcript that conflicts with this heuristic is returned.

Share your feedback

Thank you! Can you tell us what you liked about it? (Optional)

Thank you. What could we have done better? (Optional)

We may also want to contact you with updates or questions related to your feedback and our product. If don't mind, you can optionally leave your email address along with your comments.

Thank you!

We appreciate your response.