Understanding Word Confidence Scores
Use word-level confidence scores to detect transcription errors and assess transcript quality.
Use word-level confidence scores to detect transcription errors and assess transcript quality.
Every word in Deepgram’s transcription response includes a confidence value — a floating point between 0 and 1 representing the model’s estimated probability that the word was transcribed correctly. This per-word score appears in the words array within each alternatives object and is distinct from the transcript-level confidence field, which represents overall transcript reliability.
Deepgram’s word confidence is a calibrated probability. A confidence of 0.93 means the model estimates a 93% chance the word is correct.
“Calibrated” means the scores are statistically honest: if you took all words the model scored at 0.93, approximately 93% of them would actually be correct. The model’s probability outputs are naturally well-calibrated.
Confidence score definitions and calibration vary across STT providers. You cannot directly compare raw confidence distributions between providers. A provider whose scores are spread evenly across 0–1 is not necessarily providing “more informative” scores — they may simply be poorly calibrated, meaning their stated confidence does not match actual accuracy.
On typical audio, most words will have confidence scores above 0.90. This is expected and correct behavior — it reflects the model accurately predicting that it got most words right.
For example, if Deepgram achieves 95% word accuracy on your audio, you should expect the average confidence to be around 0.95, with most words clustering near 1.0. The roughly 5% of words the model is less sure about will have lower scores.
A flat or uniform distribution of confidence scores across 0–1 would actually indicate poor calibration — it would mean the model is equally uncertain about every word, which does not reflect reality for a high-accuracy model.
The concentration of scores near 1.0 does not mean the scores lack discriminatory power. Words with lower scores are meaningfully more likely to be errors. The signal is in the tail of the distribution, not the center.
The simplest approach: choose a confidence cutoff and flag all words below it as potential errors.
With Deepgram’s well-calibrated scores, a threshold around 0.65 works well as an error detector — words below this threshold are very likely to be genuine errors (high precision). The tradeoff: a low threshold catches only the most obvious errors. Raising it catches more errors but also flags some correct words.
Evaluate precision and recall at multiple thresholds on a sample of your own data to find the right balance for your use case.
Adapt the threshold automatically based on audio difficulty:
errors ≈ (1 - mean_confidence) × total_words.Noisy audio produces lower mean confidence, which shifts the threshold accordingly.
Flag utterances where any word drops below a threshold for human review. Useful for call centers, legal transcription, and medical documentation where accuracy is critical.
Cross-check confidence on detected entities (names, numbers, addresses). Low confidence on an entity word is a stronger signal to escalate than low confidence on a filler word like “um” or “the.”
Use mean word confidence across a full transcript as a quality proxy. Automatically escalate low-quality transcripts to human review and accept high-quality ones without intervention.
In real-time applications, track confidence trends across a stream. A sustained drop in confidence may indicate audio quality degradation (background noise, connection issues) that warrants alerting.
In streaming mode with interim_results=true, interim transcripts may show lower confidence on words near the audio boundary (the “tip” of the stream). The model has less surrounding context for these words, so its predictions are less certain.
As more audio arrives, interim confidence values typically improve. Final transcripts (is_final: true) will have higher and more reliable confidence scores.
Use confidence values from final transcripts for any downstream decision-making. Use interim confidence only for low-latency display purposes where you expect corrections. See Interim Results for details on how interim and final transcripts work.
Word confidence appears in the words array within each alternatives object in the API response:
There are two distinct confidence fields in the response:
confidence (in alternatives): Overall reliability of the full transcript.confidence (in each words entry): Per-word probability of correctness.For Nova-3 streaming results, the transcript-level confidence value in alternatives is calculated as the median of the word-level confidence scores for all words in that chunk’s final transcript. This means the transcript-level score is robust to individual outlier words — a single low-confidence word will not significantly affect the overall score, but a pattern of low-confidence words will pull it down.
If you threshold on transcript-level confidence for routing decisions (for example, escalating to human review), be aware that a transcript with one very low-confidence entity word might still pass a high transcript-level threshold because the median is unaffected by a single outlier. Supplement with word-level checks on critical entities.