Deepgram’s Interim Results monitors streaming audio and provides interim transcripts, which are preliminary results provided during the real-time streaming process which can help with speech detection.
Below you will learn more about how to use interim results.
for information refer to the Interim Results feature page.
Download our final Python example script and run the example code:
After execution, the script prints out the transcript for each response it receives and shows the is_final status for each message:
In this response, we see that:
"is_final": false, indicating that they are interim transcripts. As more data passes to Deepgram, you see the transcripts is getting longer.is_final is set to true, indicating that Deepgram will not return any additional transcripts covering that span of time (from 0.000 to 8.490 seconds) because it believes it has reached optimal accuracy for this section of the transcript.0.000 to 9.100 seconds, which is longer than the completed transcript issued on line 10. If you listen to this moment in the example audio, you will hear the speaker repeat the word “they”. After processing the repeated word, Deepgram decided it had reached optimal accuracy for the first section of the transcript, and split the transcript between the repeated words. Notice one “they” stayed with the first section (line 10), but the other “they” moved into the next section (line 11), which starts at 8.490 seconds.When handling real-time streaming results, the most accurate transcripts are available in the final transcripts, but the final transcripts may split the message.
If you need the best transcript possible and can tolerate some delay, rely on final transcripts; they are most accurate and aren’t likely to change.
If you need the fastest transcript possible, ignore final transcripts; instead, track timings and confidences to determine whether to keep waiting before committing to the current interim transcript. This usually works well because most content does not change between consecutive interim transcripts.
To identify whether the audio stream is completely processed, send an empty binary WebSocket message to the Deepgram server and then continue to process server responses until the server gracefully closes the connection.
In general terms, real-time streaming latency is the time delay between when a transfer of data begins and when a system begins processing it. In mathematical terms, it is the difference between the audio cursor (the number of seconds of audio you have currently submitted; we’ll call this X) and the latest transcript cursor (start + duration; we’ll call this Y). Latency is X-Y.
However, remember that to give you best accuracy, final transcripts may end early (see lines 9 and 10 in the example above), which means you’ve already received more data than what is reflected in the final transcript.
The final transcripts are meant for situations where you need the highest confidence levels, whereas the latest interim transcript has the lowest latency. It’s recommended to always ignore final transcripts when calculating latency.
To learn more, see Measuring Streaming Latency.
To calculate WER, concatenate all final transcripts and compare to your base transcript. Because final transcripts are the most accurate, they should be preferred over interim transcripts, which prioritize speed over accuracy. And because a single final transcript does not guarantee that the audio stream is complete, you will need to be certain you have collected all final transcripts before performing your calculation.
Let’s look at an example. Download our WER Python example script, prepare an audio file (or use our sample WAV file), and run the example code:
When run, the script concatenates the final transcripts returned by Deepgram and prints the result:
You can compare this result with your base transcript to calculate WER.