Diarization

Last updated 09/14/2021
PRE-RECORDED
STREAMING

Deepgram’s Diarize feature recognizes speaker changes and assigns a speaker to each word in the transcript.

Use Cases

An example of a use case for Diarization includes:

Customers who use audio with multiple speakers and want transcripts to appear in a more readable format.

Enable Feature

To enable diarization, when you call Deepgram’s API, add a diarize parameter set to true in the query string:

diarize=true

To transcribe audio from a file on your computer, run the following cURL command in a terminal or your favorite API client.

Be sure to replace YOUR_DEEPGRAM_API_KEY with your Deepgram API Key. You can create an API Key in the Deepgram Console.

curl \
  --request POST \
  --header 'Authorization: Token YOUR_DEEPGRAM_API_KEY' \
  --header 'Content-Type: audio/wav' \
  --data-binary @youraudio.wav \
  --url 'https://api.deepgram.com/v1/listen?diarize=true'

Analyze Response

For this example, we use an MP3 audio file that contains the beginning of a customer call with Premier Phone Services. If you would like to follow along, you can download it.

When the file is finished processing (often after only a few seconds), you’ll receive a JSON response that has the following basic structure:

{
  "metadata": {
    "transaction_key": "string",
    "request_id": "string",
    "sha256": "string",
    "created": "string",
    "duration": 0,
    "channels": 0
  },
  "results": {
    "channels": [
      {
        "alternatives":[]
      }
    ]
  }

Let's look more closely at the alternatives object:

...
"alternatives":[
  {
    "transcript": "alright so see here hello and thank you for calling premier phone service please be aware that this call may be recorded for quality training purposes my name is beth and i will be assisting you today how are you doing not too bad how are you today i'm doing well thank you may i please have your name my name is blake...",
    "confidence": 0.9917229,
    "words": [
      {"word":"alright","start": 1.5759487,"end": 1.6956409,"confidence": 0.9019588,"speaker": 0},
      {"word":"so","start": 1.8951281,"end": 2.2143075,"confidence": 0.24105293,"speaker": 0},
      {"word":"see","start":2.2143075,"end": 2.3738973,"confidence": 0.56319493,"speaker": 0},
      {"word":"here","start": 2.3738973,"end": 2.5733845,"confidence": 0.6261203,"speaker": 0},
      {"word":"hello","start": 15.139425, "end": 15.298915, "confidence": 0.96515656,"speaker":1},
      {"word":"and","start": 15.378659,"end": 15.578021,"confidence": 0.99532,"speaker": 1},
      {"word": "thank","start": 15.578021,"end": 15.737511,"confidence": 0.9995727,"speaker": 1},
      {"word": "you","start": 15.737511,"end": 15.897,"confidence": 0.99323964,"speaker": 1},
      {"word": "for","start": 15.897,"end": 16.05649,"confidence": 0.99926585,"speaker": 1},
      {"word": "calling","start": 16.05649,"end": 16.295723,"confidence": 0.99977976,  "speaker": 1},
      {"word": "premier","start": 16.495085,"end": 16.654575,"confidence": 0.26034957,"speaker": 1},
      {"word": "phone","start": 16.654575,"end": 16.814064,"confidence": 0.87480307,"speaker": 1},
      {"word": "service","start": 16.933681,"end": 17.292532,"confidence": 0.9798285,"speaker": },
      {"word": "please","start": 17.531765,"end": 17.691256,"confidence": 0.99711716,"speaker": 1},
      {"word": "be","start": 17.691256,"end": 17.890617,"confidence": 0.9722504,"speaker": 1},
      {"word": "aware","start": 17.890617,"end": 18.08998,"confidence": 0.9976533,"speaker": 1},
      {"word": "that","start": 18.08998,"end": 18.209597,"confidence": 0.99919087,"speaker": 1},
      {"word": "this","start": 18.209597,"end": 18.369085,"confidence": 0.99952435,"speaker": 1},
      {"word": "call","start": 18.369085, "end": 18.568447,"confidence": 0.99982494,"speaker": 1},
      {"word": "may","start": 18.568447,"end": 18.727936,"confidence": 0.98355246,"speaker": 1},
      {"word": "be","start": 18.727936,"end": 18.96717,"confidence": 0.99631196,"speaker": 1},
      {"word": "recorded","start": 18.96717,"end": 19.246277,"confidence": 0.9996917,"speaker": 1},
      {"word": "for","start": 19.246277,"end": 19.405766,"confidence": 0.95565933,"speaker": 1},
      {"word": "quality","start": 19.405766, "end": 19.80449,"confidence": 0.9973219,"speaker": 1},
      {"word": "training","start": 19.80449,"end": 20.083595,"confidence": 0.99996436,"speaker": 1},
      {"word": "purposes","start": 20.083595,"end": 20.482319,"confidence": 0.9998878,"speaker": 1},
      {"word": "my","start": 21.00066,"end": 21.160149,"confidence": 0.99965894,"speaker": 1},
      {"word": "name","start": 21.160149,"end": 21.319637,"confidence": 0.9998049,"speaker": 1},
      {"word": "is","start": 21.319637,"end": 21.399384,"confidence": 0.99522287,"speaker": 1},
      {"word": "beth","start": 21.479128,"end": 21.638617,"confidence": 0.9997403,"speaker": 1},
      {"word": "and","start": 21.758234,"end": 21.877851,"confidence": 0.9899832,"speaker": 1},
      {"word": "i","start": 21.957596,"end": 22.077213,"confidence": 0.9965701,"speaker": 1},
      {"word": "will","start": 22.077213,"end": 22.19683,"confidence": 0.9863131,"speaker": 1},
      {"word": "be","start": 22.19683,"end": 22.436064,"confidence": 0.9958853,"speaker": 1},
      {"word": "assisting","start": 22.436064,"end": 22.755043,"confidence": 0.9999304,"speaker": 1},
      {"word": "you","start": 22.755043,"end": 22.954405,"confidence": 0.99982965,"speaker": 1},
      {"word": "today","start": 22.954405,"end": 23.113894,"confidence": 0.9921113,"speaker": 1},
      {"word": "how","start": 23.472744,"end": 23.55249,"confidence": 0.9921525,"speaker": 1},
      {"word": "are","start": 23.55249,"end": 23.672108,"confidence": 0.9914528,"speaker": 1},
      {"word": "you","start": 23.672108,"end": 23.791723,"confidence": 0.99728036,"speaker": 1},
      {"word": "doing","start": 23.791723,"end": 23.991085,"confidence": 0.9335857,"speaker": 1},
      {"word": "not","start": 25.283688,"end": 25.523354,"confidence": 0.9707693,"speaker": 2},
      {"word": "too","start": 25.523354,"end": 25.723074,"confidence": 0.9114953,"speaker": 2},
      {"word": "bad","start": 25.723074,"end": 26.042627,"confidence": 0.2638601,"speaker": 2},
      {"word": "how","start": 26.482012,"end": 26.561901,"confidence": 0.9993493,"speaker": 2},
      {"word": "are","start": 26.561901,"end": 26.681732,"confidence": 0.99502474,"speaker": 2},
      {"word": "you","start": 26.681732,"end": 26.961342,"confidence": 0.99820864,"speaker": 2},
      {"word": "today","start": 26.961342,"end": 27.161062,"confidence": 0.99570376,"speaker": 2},
      {"word": "i'm","start": 28.039833,"end": 28.239555,"confidence": 0.9935423,"speaker": 1},
      {"word": "doing","start": 28.239555,"end": 28.47922,"confidence": 0.9994615,"speaker": 1},
      {"word": "well","start": 28.47922,"end": 28.599052,"confidence": 0.9997105,"speaker": 1},
      {"word": "thank","start": 28.718884,"end": 28.918606,"confidence": 0.9982809,"speaker": 1},
      {"word": "you","start": 28.918606,"end": 29.038437,"confidence": 0.991257,"speaker": 1},
      {"word": "may","start": 29.6376,"end": 29.717487,"confidence": 0.9996562,"speaker": 1},
      {"word": "i","start": 29.797375,"end": 29.957153,"confidence": 0.99980444,"speaker": 1},
      {"word": "please","start": 29.957153,"end": 30.156872,"confidence": 0.9994448,"speaker": 1},
      {"word": "have","start": 30.156872,"end": 30.276705,"confidence": 0.9995801,"speaker": },
      {"word": "your","start": 30.276705,"end": 30.436481,"confidence": 0.99976915,"speaker": 1},
      {"word": "name","start": 30.436481,"end": 30.596258,"confidence": 0.9994388,"speaker": 1},
      {"word": "my","start": 31.68942,"end": 31.848904,"confidence": 0.99273324,"speaker": 2},
      {"word": "name","start": 31.848904,"end": 32.008385,"confidence": 0.9930663,"speaker": 2},
      {"word": "is","start": 32.008385,"end": 32.28748,"confidence": 0.9944564,"speaker": 2},
      {"word": "blake","start": 32.407093,"end": 32.88554,"confidence": 0.82569283,"speaker": 2},
    ...
    ]
  }
]

In this response, we see that each alternative contains:

  • transcript: Transcript for the audio being processed.
  • confidence: Floating point value between 0 and 1 that indicates overall transcript reliability. Larger values indicate higher confidence.
  • words: Object containing each word in the transcript, along with its start time and end time (in seconds) from the beginning of the audio stream, a confidence value, and a speaker identifier.

By default, Deepgram applies its general AI model, which is a good, general purpose model for everyday situations. To learn more about the customization possible with Deepgram's API, check out the Deepgram API Reference.

To improve readability, you can use a JSON processor to parse the JSON. In this example, we use JQ and further improve readability by turning on Deepgram’s punctuation and utterances features:

Be sure to replace YOUR_DEEPGRAM_API_KEY with your Deepgram API Key. You can create an API Key in the Deepgram Console.

curl \
  --request POST \
  --url 'https://api.deepgram.com/v1/listen?diarize=true&punctuate=true&utterances=true' \
  --header 'Authorization: Token YOUR_DEEPGRAM_API_KEY' \
  --header 'content-type: audio/mp3' \
  --data-binary @Premier_broken-phone_numbers.mp3 | jq -r '.results.utterances[] | "[Speaker:\(.speaker)] \(.transcript)"'

When the file is finished processing, you’ll receive the following response:

[Speaker:0] Alright. So see here. 
[Speaker:1] Hello, and thank you for calling premier phone service. Please be aware that this call may be recorded for quality and training purposes.
[Speaker:1] My name is Beth, and I will be assisting you today. How are you doing?
[Speaker:2] Not too bad. How are you today?
[Speaker:1] I'm doing well. Thank you. May I please have your name?
[Speaker:2] My name is Blake.