Understanding Word Confidence Scores

Use word-level confidence scores to detect transcription errors and assess transcript quality.

Every word in Deepgramโ€™s transcription response includes a confidence value โ€” a floating point between 0 and 1 representing the modelโ€™s estimated probability that the word was transcribed correctly. This per-word score appears in the words array within each alternatives object and is distinct from the transcript-level confidence field, which represents overall transcript reliability.

What Confidence Means

Deepgramโ€™s word confidence is a calibrated probability. A confidence of 0.93 means the model estimates a 93% chance the word is correct.

โ€œCalibratedโ€ means the scores are statistically honest: if you took all words the model scored at 0.93, approximately 93% of them would actually be correct. The modelโ€™s probability outputs are naturally well-calibrated.

Confidence score definitions and calibration vary across STT providers. You cannot directly compare raw confidence distributions between providers. A provider whose scores are spread evenly across 0โ€“1 is not necessarily providing โ€œmore informativeโ€ scores โ€” they may simply be poorly calibrated, meaning their stated confidence does not match actual accuracy.

Why Most Scores Are High

On typical audio, most words will have confidence scores above 0.90. This is expected and correct behavior โ€” it reflects the model accurately predicting that it got most words right.

For example, if Deepgram achieves 95% word accuracy on your audio, you should expect the average confidence to be around 0.95, with most words clustering near 1.0. The roughly 5% of words the model is less sure about will have lower scores.

A flat or uniform distribution of confidence scores across 0โ€“1 would actually indicate poor calibration โ€” it would mean the model is equally uncertain about every word, which does not reflect reality for a high-accuracy model.

The concentration of scores near 1.0 does not mean the scores lack discriminatory power. Words with lower scores are meaningfully more likely to be errors. The signal is in the tail of the distribution, not the center.

Using Confidence for Error Detection

Fixed Threshold

The simplest approach: choose a confidence cutoff and flag all words below it as potential errors.

With Deepgramโ€™s well-calibrated scores, a threshold around 0.65 works well as an error detector โ€” words below this threshold are very likely to be genuine errors (high precision). The tradeoff: a low threshold catches only the most obvious errors. Raising it catches more errors but also flags some correct words.

Evaluate precision and recall at multiple thresholds on a sample of your own data to find the right balance for your use case.

Python
1# Flag words below a confidence threshold
2threshold = 0.65
3low_confidence_words = [
4 word for word in response["results"]["channels"][0]["alternatives"][0]["words"]
5 if word["confidence"] < threshold
6]
7
8for word in low_confidence_words:
9 print(f" '{word['word']}' (confidence: {word['confidence']:.3f}, "
10 f"time: {word['start']:.2f}s - {word['end']:.2f}s)")

Dynamic Threshold

Adapt the threshold automatically based on audio difficulty:

  1. Compute the mean confidence across all words in a transcript.
  2. Estimate expected error count: errors โ‰ˆ (1 - mean_confidence) ร— total_words.
  3. Sort words by confidence (ascending) and flag that many words as potential errors.

Noisy audio produces lower mean confidence, which shifts the threshold accordingly.

Python
1import statistics
2
3words = response["results"]["channels"][0]["alternatives"][0]["words"]
4confidences = [w["confidence"] for w in words]
5
6mean_conf = statistics.mean(confidences)
7expected_errors = int((1 - mean_conf) * len(words))
8
9# Flag the N lowest-confidence words as potential errors
10sorted_words = sorted(words, key=lambda w: w["confidence"])
11flagged = sorted_words[:expected_errors]
12
13print(f"Mean confidence: {mean_conf:.3f}")
14print(f"Expected errors: {expected_errors} out of {len(words)} words")
15for word in flagged:
16 print(f" '{word['word']}' (confidence: {word['confidence']:.3f})")

Common Use Cases

QA and Compliance Review

Flag utterances where any word drops below a threshold for human review. Useful for call centers, legal transcription, and medical documentation where accuracy is critical.

Entity Validation

Cross-check confidence on detected entities (names, numbers, addresses). Low confidence on an entity word is a stronger signal to escalate than low confidence on a filler word like โ€œumโ€ or โ€œthe.โ€

Transcript Quality Scoring

Use mean word confidence across a full transcript as a quality proxy. Automatically escalate low-quality transcripts to human review and accept high-quality ones without intervention.

Streaming Confidence Monitoring

In real-time applications, track confidence trends across a stream. A sustained drop in confidence may indicate audio quality degradation (background noise, connection issues) that warrants alerting.

Confidence in Streaming vs. Pre-recorded

In streaming mode with interim_results=true, interim transcripts may show lower confidence on words near the audio boundary (the โ€œtipโ€ of the stream). The model has less surrounding context for these words, so its predictions are less certain.

As more audio arrives, interim confidence values typically improve. Final transcripts (is_final: true) will have higher and more reliable confidence scores.

Use confidence values from final transcripts for any downstream decision-making. Use interim confidence only for low-latency display purposes where you expect corrections. See Interim Results for details on how interim and final transcripts work.

Limitations

  • No alternatives: Confidence tells you how sure the model is about its top prediction, but does not surface the modelโ€™s second-best guess. You cannot use it to get suggested corrections.
  • Not a WER guarantee: High average confidence does not guarantee low word error rate on any specific transcript. It is a statistical property across many predictions.
  • Model mismatch: If you use the wrong language model or the audio contains heavy accents not well-represented in training data, the model can be confidently wrong. Confidence reflects the modelโ€™s internal estimate, which is only as good as the modelโ€™s fit to the audio domain.
  • Cross-provider comparison: Do not compare raw confidence score distributions between STT providers. Different providers may use different calibration methods, temperature scaling, or definitions of โ€œconfidence.โ€ A meaningful comparison requires evaluating precision and recall of error detection at various thresholds on the same evaluation dataset.

API Reference

Word confidence appears in the words array within each alternatives object in the API response:

JSON
1{
2 "results": {
3 "channels": [
4 {
5 "alternatives": [
6 {
7 "transcript": "the quick brown fox",
8 "confidence": 0.9876,
9 "words": [
10 {
11 "word": "the",
12 "start": 0.08,
13 "end": 0.32,
14 "confidence": 0.998,
15 "punctuated_word": "The"
16 },
17 {
18 "word": "quick",
19 "start": 0.32,
20 "end": 0.64,
21 "confidence": 0.965,
22 "punctuated_word": "quick"
23 },
24 {
25 "word": "brown",
26 "start": 0.64,
27 "end": 0.88,
28 "confidence": 0.991,
29 "punctuated_word": "brown"
30 },
31 {
32 "word": "fox",
33 "start": 0.88,
34 "end": 1.12,
35 "confidence": 0.943,
36 "punctuated_word": "fox"
37 }
38 ]
39 }
40 ]
41 }
42 ]
43 }
44}

There are two distinct confidence fields in the response:

  • Transcript-level confidence (in alternatives): Overall reliability of the full transcript.
  • Word-level confidence (in each words entry): Per-word probability of correctness.

For Nova-3 streaming results, the transcript-level confidence value in alternatives is calculated as the median of the word-level confidence scores for all words in that chunkโ€™s final transcript. This means the transcript-level score is robust to individual outlier words โ€” a single low-confidence word will not significantly affect the overall score, but a pattern of low-confidence words will pull it down.

If you threshold on transcript-level confidence for routing decisions (for example, escalating to human review), be aware that a transcript with one very low-confidence entity word might still pass a high transcript-level threshold because the median is unaffected by a single outlier. Supplement with word-level checks on critical entities.