Understanding Word Confidence Scores
Use word-level confidence scores to detect transcription errors and assess transcript quality.
Every word in Deepgramโs transcription response includes a confidence value โ a floating point between 0 and 1 representing the modelโs estimated probability that the word was transcribed correctly. This per-word score appears in the words array within each alternatives object and is distinct from the transcript-level confidence field, which represents overall transcript reliability.
What Confidence Means
Deepgramโs word confidence is a calibrated probability. A confidence of 0.93 means the model estimates a 93% chance the word is correct.
โCalibratedโ means the scores are statistically honest: if you took all words the model scored at 0.93, approximately 93% of them would actually be correct. The modelโs probability outputs are naturally well-calibrated.
Confidence score definitions and calibration vary across STT providers. You cannot directly compare raw confidence distributions between providers. A provider whose scores are spread evenly across 0โ1 is not necessarily providing โmore informativeโ scores โ they may simply be poorly calibrated, meaning their stated confidence does not match actual accuracy.
Why Most Scores Are High
On typical audio, most words will have confidence scores above 0.90. This is expected and correct behavior โ it reflects the model accurately predicting that it got most words right.
For example, if Deepgram achieves 95% word accuracy on your audio, you should expect the average confidence to be around 0.95, with most words clustering near 1.0. The roughly 5% of words the model is less sure about will have lower scores.
A flat or uniform distribution of confidence scores across 0โ1 would actually indicate poor calibration โ it would mean the model is equally uncertain about every word, which does not reflect reality for a high-accuracy model.
The concentration of scores near 1.0 does not mean the scores lack discriminatory power. Words with lower scores are meaningfully more likely to be errors. The signal is in the tail of the distribution, not the center.
Using Confidence for Error Detection
Fixed Threshold
The simplest approach: choose a confidence cutoff and flag all words below it as potential errors.
With Deepgramโs well-calibrated scores, a threshold around 0.65 works well as an error detector โ words below this threshold are very likely to be genuine errors (high precision). The tradeoff: a low threshold catches only the most obvious errors. Raising it catches more errors but also flags some correct words.
Evaluate precision and recall at multiple thresholds on a sample of your own data to find the right balance for your use case.
Dynamic Threshold
Adapt the threshold automatically based on audio difficulty:
- Compute the mean confidence across all words in a transcript.
- Estimate expected error count:
errors โ (1 - mean_confidence) ร total_words. - Sort words by confidence (ascending) and flag that many words as potential errors.
Noisy audio produces lower mean confidence, which shifts the threshold accordingly.
Common Use Cases
QA and Compliance Review
Flag utterances where any word drops below a threshold for human review. Useful for call centers, legal transcription, and medical documentation where accuracy is critical.
Entity Validation
Cross-check confidence on detected entities (names, numbers, addresses). Low confidence on an entity word is a stronger signal to escalate than low confidence on a filler word like โumโ or โthe.โ
Transcript Quality Scoring
Use mean word confidence across a full transcript as a quality proxy. Automatically escalate low-quality transcripts to human review and accept high-quality ones without intervention.
Streaming Confidence Monitoring
In real-time applications, track confidence trends across a stream. A sustained drop in confidence may indicate audio quality degradation (background noise, connection issues) that warrants alerting.
Confidence in Streaming vs. Pre-recorded
In streaming mode with interim_results=true, interim transcripts may show lower confidence on words near the audio boundary (the โtipโ of the stream). The model has less surrounding context for these words, so its predictions are less certain.
As more audio arrives, interim confidence values typically improve. Final transcripts (is_final: true) will have higher and more reliable confidence scores.
Use confidence values from final transcripts for any downstream decision-making. Use interim confidence only for low-latency display purposes where you expect corrections. See Interim Results for details on how interim and final transcripts work.
Limitations
- No alternatives: Confidence tells you how sure the model is about its top prediction, but does not surface the modelโs second-best guess. You cannot use it to get suggested corrections.
- Not a WER guarantee: High average confidence does not guarantee low word error rate on any specific transcript. It is a statistical property across many predictions.
- Model mismatch: If you use the wrong language model or the audio contains heavy accents not well-represented in training data, the model can be confidently wrong. Confidence reflects the modelโs internal estimate, which is only as good as the modelโs fit to the audio domain.
- Cross-provider comparison: Do not compare raw confidence score distributions between STT providers. Different providers may use different calibration methods, temperature scaling, or definitions of โconfidence.โ A meaningful comparison requires evaluating precision and recall of error detection at various thresholds on the same evaluation dataset.
API Reference
Word confidence appears in the words array within each alternatives object in the API response:
There are two distinct confidence fields in the response:
- Transcript-level
confidence(inalternatives): Overall reliability of the full transcript. - Word-level
confidence(in eachwordsentry): Per-word probability of correctness.
For Nova-3 streaming results, the transcript-level confidence value in alternatives is calculated as the median of the word-level confidence scores for all words in that chunkโs final transcript. This means the transcript-level score is robust to individual outlier words โ a single low-confidence word will not significantly affect the overall score, but a pattern of low-confidence words will pull it down.
If you threshold on transcript-level confidence for routing decisions (for example, escalating to human review), be aware that a transcript with one very low-confidence entity word might still pass a high transcript-level threshold because the median is unaffected by a single outlier. Supplement with word-level checks on critical entities.
Related Resources
- Interim Results โ How interim and final transcripts work in streaming.
- Utterances โ Segment speech into meaningful semantic units.
- Pre-recorded Audio Getting Started โ Transcribe audio files with Deepgram.
- Streaming Audio Getting Started โ Real-time transcription with WebSockets.