Using Interim Results | Deepgram's Docs

Deepgram’s Interim Results monitors streaming audio and provides interim transcripts, which are preliminary results provided during the real-time streaming process which can help with speech detection.

Below you will learn more about how to use interim results.

for information refer to the Interim Results feature page.

Running The Example

Download our final Python example script and run the example code:

SHELL

1 python3 show-final.py -k 'YOUR_DEEPGRAM_API_KEY' /PATH/TO/AUDIO.wav

After execution, the script prints out the transcript for each response it receives and shows the is_final status for each message:

JSON

1 Channels = 2, Sample Rate = 48000 Hz, Sample width = 2 bytes, Size = 18540124 bytes
2   1 0.000-1.100 ["is_final": false] another big
3   2 0.000-2.100 ["is_final": false] another big problem
4   3 0.000-3.100 ["is_final": false] another big problem in the speech analyst
5   4 0.000-4.100 ["is_final": false] another big problem in the speech analytics space
6   5 0.000-5.100 ["is_final": false] another big problem in the speech analytics space when custom
7   6 0.000-6.100 ["is_final": false] another big problem in the speech analytics space when customers first bring the
8   7 0.000-7.100 ["is_final": false] another big problem in the speech analytics space when customers first bring the software were on
9   8 0.000-8.100 ["is_final": false] another big problem in the speech analytics space when customers first bring the software were on is that they
10   9 0.000-9.100 ["is_final": false] another big problem in the speech analytics space when customers first bring the software on is that they they
11  10 0.000-8.490 ["is_final": true ] another big problem in the speech analytics space when customers first bring the software were on is that they
12  11 8.490-10.100 ["is_final": false] they are
13  12 8.490-11.100 ["is_final": false] they are blown away by the
14  ...

In this response, we see that:

On lines 1 through 9, the transcripts contain "is_final": false, indicating that they are interim transcripts. As more data passes to Deepgram, you see the transcripts is getting longer.
Between lines 3 and 4, Deepgram corrects its prediction of the word “analyst,” turning it into “analytics”. This is an example of interim results in action.
Between lines 5 and 6, Deepgram corrects its prediction of the word “custom”, turning it into “customer”. Another example of interim results in action.
On line 10, is_final is set to true, indicating that Deepgram will not return any additional transcripts covering that span of time (from 0.000 to 8.490 seconds) because it believes it has reached optimal accuracy for this section of the transcript.
On line 9, the transcript covers a span of time from 0.000 to 9.100 seconds, which is longer than the completed transcript issued on line 10. If you listen to this moment in the example audio, you will hear the speaker repeat the word “they”. After processing the repeated word, Deepgram decided it had reached optimal accuracy for the first section of the transcript, and split the transcript between the repeated words. Notice one “they” stayed with the first section (line 10), but the other “they” moved into the next section (line 11), which starts at 8.490 seconds.

Tips for working with transcripts

When handling real-time streaming results, the most accurate transcripts are available in the final transcripts, but the final transcripts may split the message.

If you need the best transcript possible and can tolerate some delay, rely on final transcripts; they are most accurate and aren’t likely to change.
If you need the fastest transcript possible, ignore final transcripts; instead, track timings and confidences to determine whether to keep waiting before committing to the current interim transcript. This usually works well because most content does not change between consecutive interim transcripts.

Identify Completed Audio Processing

To identify whether the audio stream is completely processed, send an empty binary WebSocket message to the Deepgram server and then continue to process server responses until the server gracefully closes the connection.

Frequently Asked Questions

How do I measure latency with interim results?

In general terms, real-time streaming latency is the time delay between when a transfer of data begins and when a system begins processing it. In mathematical terms, it is the difference between the audio cursor (the number of seconds of audio you have currently submitted; we’ll call this X) and the latest transcript cursor (start + duration; we’ll call this Y). Latency is X-Y.

However, remember that to give you best accuracy, final transcripts may end early (see lines 9 and 10 in the example above), which means you’ve already received more data than what is reflected in the final transcript.

The final transcripts are meant for situations where you need the highest confidence levels, whereas the latest interim transcript has the lowest latency. It’s recommended to always ignore final transcripts when calculating latency.

To learn more, see Measuring Streaming Latency.

How do I measure word error rates (WER) with interim results?

To calculate WER, concatenate all final transcripts and compare to your base transcript. Because final transcripts are the most accurate, they should be preferred over interim transcripts, which prioritize speed over accuracy. And because a single final transcript does not guarantee that the audio stream is complete, you will need to be certain you have collected all final transcripts before performing your calculation.

Let’s look at an example. Download our WER Python example script, prepare an audio file (or use our sample WAV file), and run the example code:

SHELL

1 python3 concat-final.py -k 'YOUR_DEEPGRAM_API_KEY' /PATH/TO/audio.wav

When run, the script concatenates the final transcripts returned by Deepgram and prints the result:

JSON

1 Channels = 2, Sample Rate = 48000 Hz, Sample width = 2 bytes, Size = 18540124 bytes
2 another big problem in the speech analytics space when customers first bring the software where is that they they are blown away...

You can compare this result with your base transcript to calculate WER.