To solve the need for immediate results and high level accuracy, Deepgram provides interim transcripts, which are preliminary results provided during the real-time streaming process. Periodically during this process, Deepgram will identify a point at which its transcript has reached maximum accuracy and send a definitive, or final, transcript of all audio up to that point. It will then continue to process audio.
When working with real-time audio, streams flow from your capture source (e.g., microphone, browser, telephony system) to Deepgram's servers in irregular pieces. Maybe the collected audio ends abruptly—perhaps even mid-word—which means that Deepgram’s predictions, particularly for words near the tip of the audio stream, are more likely to be wrong. Though predictions are less accurate at this point, because you still need the latest transcript, Deepgram guesses about the words being spoken and sends these guesses to you as interim transcripts. As more audio enters the server, Deepgram then corrects and improves its transcriptions, increasing its accuracy, until it reaches the end of the stream, at which point it sends one last, cumulative transcript.
Let’s look at some interim transcripts and analyze their content.
To recreate the transcripts we use in this example, download our simple Python example script, prepare an audio file (or use our sample WAV file), and run the example code:
python$ python simple-example.py -u 'USER:PASS' /PATH/TO/AUDIO.wav
When run, the script sends the audio to Deepgram's real-time streaming endpoint and prints the output to the screen:
jsonChannels = 2, Sample Rate = 48000 Hz, Sample width = 2 bytes, Size = 18540124 bytes {"channel_index":[0,1],"duration":1.099875,"start":0.0,"is_final":false,"channel":{"alternatives":[{"transcript":"another big", "confidence":0.99640507,"words":[{"word":"another","start":0.29196426,"end":0.7919643,"confidence":0.99640507},{"word":"big", "start":0.7980356,"end":1.099875,"confidence":0.99221236}]}]}} {"channel_index":[0,1],"duration":2.099875,"start":0.0,"is_final":false,"channel":{"alternatives":[{"transcript":"another big problem", "confidence":0.99709547,"words":[{"word":"another","start":0.29575473,"end":0.79575473,"confidence":0.99858046},{"word":"big", "start":0.8083963,"end":1.3083963,"confidence":0.99709547},{"word":"problem","start":1.5576415,"end":2.0576415, "confidence":0.99631387}]}]}} ...
Our first interim result has the following content:
json{ "channel_index": [ 0, 1 ], "duration": 1.279875, "start": 0, "is_final": false, "channel": { "alternatives": [ { "transcript": "hello", "confidence": 0.9662911, "words": [ { "word": "hello", "start": 1.0914062, "end": 1.279875, "confidence": 0.9662911 } ] } ] } }
In this response, we see that:
start
(the number of seconds into the audio stream) is 0
, indicating that this is the very beginning of the real-time stream.start
+ duration
(the entire length of this response) is 1.279875
seconds, and the word "hello" ends at 1.279875
seconds (which matches the duration
value), indicating that the stream cuts off the word.confidence
for the word "hello" is approximately 97%, indicating that even though the word is cut off, Deepgram is still pretty certain that its prediction is correct.is_final
is false
, indicating that Deepgram will continue waiting to see if more data will improve its predictions.The next interim response has the following content:
json{ "channel_index": [ 0, 1 ], "duration": 1.535875, "start": 0, "is_final": false, "channel": { "alternatives": [ { "transcript": "hello this", "confidence": 0.99687374, "words": [ { "word": "hello", "start": 1.0788461, "end": 1.3926922, "confidence": 0.99687374 }, { "word": "this", "start": 1.3926922, "end": 1.535875, "confidence": 0.9928282 } ] } ] } }
In this response, we see that:
start
(the number of seconds into the audio stream) is 0, indicating that this is the very beginning of the real-time stream.start
+ duration
(the entire length of this response) is 1.535875
seconds, and the word "this" ends at 1.535875
seconds (which matches the duration
value), indicating that the stream cuts off the word.confidence
for the word "hello" has improved to almost 100%.end
timestamp for "hello" now indicates that the word has not been cut off.confidence
for the word "this" is approximately 99%, so can likely be trusted.is_final
is false
, indicating that Deepgram will continue waiting to see if more data will improve its predictions.When Deepgram is satisfied that it has produced the best possible transcript of all audio up to a certain point, it sends a final transcript in which the is_final
value is set to true
.
"is_final": true
does not mean that the entire audio stream is done being processed, nor does it mean that the speaker has finished speaking, so avoid using it for endpointing or utterance estimation. It only means that the processed time range (from start
to start
+ duration
) will not be returned in future messages.
If you want to review a transcript, consider confidences or timings, or make any downstream changes, do it when "is_final": true
; the next message will continue from where the current message ended.
Let’s look at an example. Download our final Python example script, prepare an audio file (or use our sample WAV file), and run the example code:
python$ python show-final.py -u 'USER:PASS' /PATH/TO/AUDIO.wav
When run, the script prints out the transcript for each response it receives and shows the is_final
status for each message:
jsonChannels = 2, Sample Rate = 48000 Hz, Sample width = 2 bytes, Size = 18540124 bytes 1 0.000-1.100 ["is_final": false] another big 2 0.000-2.100 ["is_final": false] another big problem 3 0.000-3.100 ["is_final": false] another big problem in the speech analyst 4 0.000-4.100 ["is_final": false] another big problem in the speech analytics space 5 0.000-5.100 ["is_final": false] another big problem in the speech analytics space when custom 6 0.000-6.100 ["is_final": false] another big problem in the speech analytics space when customers first bring this 7 0.000-7.100 ["is_final": false] another big problem in the speech analytics space when customers first bring the software where is 8 0.000-8.100 ["is_final": false] another big problem in the speech analytics space when customers first bring the software on is that they 9 0.000-9.100 ["is_final": false] another big problem in the speech analytics space when customers first bring the software is that they they 10 0.000-8.485 ["is_final": true ] another big problem in the speech analytics space when customers first bring the software where is that they 11 8.485-10.100 ["is_final": false] they are 12 8.485-11.100 ["is_final": false] they are blown away by the fact ...
In this response, we see that:
"is_final": false
, indicating that they are interim transcripts. As more data pours into Deepgram, you see the transcripts get longer.is_final
is set to true
, indicating that Deepgram will not return any additional transcripts covering that span of time (from 0.000
to 8.485
seconds) because it believes it has reached optimal accuracy for this section of the transcript.0.000
to 9.100
seconds, which is longer than the completed transcript issued on line 10. If you listen to this moment in the example audio, you will hear the speaker repeat the word "they". After processing the repeated word, Deepgram decided it had reached optimal accuracy for the first section of the transcript, splitting the transcript between the repeated words. Notice the one "they" stayed with the first section (line 10), but the other "they" moved into the next section (line 11), which starts at 8.485
seconds.When handling real-time streaming results, the most accurate transcripts are available in the final transcripts, but the final transcripts may split the message. Some tips:
To identify whether the audio stream is completely processed, send an empty binary WebSocket message to the Deepgram server and then continue to process server responses until the server gracefully closes the connection.
In general terms, real-time streaming latency is the time delay between when a transfer of data begins and when a system begins processing it. In mathematical terms, it is the difference between the audio cursor (the number of seconds of audio you have currently submitted; we’ll call this X) and the latest transcript cursor (start
+ duration
; we’ll call this Y). Latency is X-Y.
However, remember that to give you best accuracy, final transcripts may end early (see lines 9 and 10 in the example above), which means you’ve already received more data than what is reflected in the final transcript. Remember that final transcripts are meant for conservative pipelines that require the highest confidence levels, whereas the latest interim transcript has the lowest latency; always ignore final transcripts when calculating latency.
To learn more, see Measure Latency.
To calculate WER, concatenate all final transcripts and compare to your base transcript. Because final transcripts are the most accurate, they should be preferred over interim transcripts, which prioritize speed over accuracy. And because a single final transcript does not guarantee that the audio stream is complete, you will need to be certain you have collected all final transcripts before performing your calculation.
Let’s look at an example. Download our WER Python example script, prepare an audio file (or use our sample WAV file), and run the example code:
python$ python concat-final.py -u 'USER:PASS' /PATH/TO/audio.wav
When run, the script concatenates the final transcripts returned by Deepgram and prints the result:
jsonChannels = 2, Sample Rate = 48000 Hz, Sample width = 2 bytes, Size = 18540124 bytes another big problem in the speech analytics space when customers first bring the software where is that they they are blown away...
You can compare this result with your base transcript to calculate WER.
To avoid receiving interim results, send interim_results=false
as a query parameter when you call Deepgram's real-time streaming endpoint.