Diarization
Deepgram’s Diarize feature recognizes speaker changes and assigns a speaker to each word in the transcript.
To learn more about diarization and multichannel audio, and to learn when to use Deepgram's Diarization or Multichannel feature, see Understanding when to Use the Multichannel and Diarization Features.
Use Cases
An example of a use case for Diarization includes:
Customers who use audio with multiple speakers and want transcripts to appear in a more readable format.
Enable Feature
To enable Diarization, when you call Deepgram’s API, add a diarize
parameter set to true
in the query string:
diarize=true
We've recently released an improved version of Diarization (PRE-RECORDED only). Its advanced speaker separation accurately identifies speakers in complex audio streams, reducing errors where two speakers are identified as one. Additionally, the new diarizer can identify and count speakers more accurately, reducing instances where one speaker is split between two labels. The result is more readable transcripts.
With the release of improved diarization, we will be deprecating the
diarize_version
parameter and will be retiring the old diarizer on April 3, 2023.Currently, you can call the old diarizer using the below URL:
https://api.deepgram.com/v1/listen?tier=enhanced&diarize=true&diarize_version=2021-07-14.0
To access the latest diarizer, all you need to do is add diarize=true to your URL:
https://api.deepgram.com/v1/listen?diarize=true
We encourage you to switch to the improved Diarizer as soon as possible to ensure that you are taking advantage of the latest advancements in our technology.
To transcribe audio from a file on your computer, run the following cURL command in a terminal or your favorite API client.
Be sure to replace
YOUR_DEEPGRAM_API_KEY
with your Deepgram API Key. You can create an API Key in the Deepgram Console.
curl \
--request POST \
--header 'Authorization: Token YOUR_DEEPGRAM_API_KEY' \
--header 'Content-Type: audio/wav' \
--data-binary @youraudio.wav \
--url 'https://api.deepgram.com/v1/listen?diarize=true'
Analyze Response
For this example, we use an MP3 audio file that contains the beginning of a customer call with Premier Phone Services. If you would like to follow along, you can download it.
When the file is finished processing (often after only a few seconds), you’ll receive a JSON response that has the following basic structure:
{
"metadata": {
"transaction_key": "string",
"request_id": "string",
"sha256": "string",
"created": "string",
"duration": 0,
"channels": 0
},
"results": {
"channels": [
{
"alternatives":[]
}
]
}
Let's look more closely at the alternatives
object:
...
"alternatives":[
{
"transcript": "hello and thank you for calling premier phone service please be aware that this call may be recorded for quality and training purposes my name is beth and i will be assisting you today how are you doing not too bad how are you today i'm doing well thank you may i please have your name my name is blake...",
"confidence": 0.9917229,
"words": [
{"word":"hello","start":15.259043,"end":15.338787,"confidence":0.9721591,"speaker":0,"speaker_confidence":0.5853265},
{"word":"and","start":15.418532,"end":15.617893,"confidence":0.99877876,"speaker":0,"speaker_confidence":0.5853265},
{"word":"thank","start":15.617893,"end":15.777383,"confidence":0.99722916,"speaker":0,"speaker_confidence":0.5853265},
{"word":"you","start":15.777383,"end":15.9368725,"confidence":0.9976786,"speaker":0,"speaker_confidence":0.5853265},
{"word":"for","start":15.9368725,"end":16.096361,"confidence":0.9996244,"speaker":0,"speaker_confidence":0.5853265},
{"word":"calling","start":16.096361,"end":16.495085,"confidence":0.9992551,"speaker":0,"speaker_confidence":0.5853265},
{"word":"premier","start":16.495085,"end":16.73432,"confidence":0.916555,"speaker":0,"speaker_confidence":0.5853265},
{"word":"phone","start":16.73432,"end":16.973553,"confidence":0.89162886,"speaker":0,"speaker_confidence":0.5853265},
{"word":"service","start":16.973553,"end":17.372276,"confidence":0.985968,"speaker":0,"speaker_confidence":0.5853265},
{"word":"please","start":17.571638,"end":17.731129,"confidence":0.99930406,"speaker":0,"speaker_confidence":0.5853265},
{"word":"be","start":17.731129,"end":17.930489,"confidence":0.9928549,"speaker":0,"speaker_confidence":0.5853265},
{"word":"aware","start":17.930489,"end":18.08998,"confidence":0.9999629,"speaker":0,"speaker_confidence":0.5853265},
{"word":"that","start":18.08998,"end":18.209597,"confidence":0.99647444,"speaker":0,"speaker_confidence":0.5853265},
{"word":"this","start":18.209597,"end":18.408958,"confidence":0.999246,"speaker":0,"speaker_confidence":0.5853265},
{"word":"call","start":18.408958,"end":18.608318,"confidence":0.999718,"speaker":0,"speaker_confidence":0.5853265},
{"word":"may","start":18.608318,"end":18.727936,"confidence":0.91868997,"speaker":0,"speaker_confidence":0.5853265},
{"word":"be","start":18.727936,"end":19.007042,"confidence":0.99454945,"speaker":0,"speaker_confidence":0.5853265},
{"word":"recorded","start":19.007042,"end":19.286148,"confidence":0.99981517,"speaker":0,"speaker_confidence":0.5853265},
{"word":"for","start":19.286148,"end":19.445639,"confidence":0.9999294,"speaker":0,"speaker_confidence":0.5853265},
{"word":"quality","start":19.445639,"end":19.684872,"confidence":0.9993426,"speaker":0,"speaker_confidence":0.5853265},
{"word":"and","start":19.684872,"end":19.844362,"confidence":0.88343453,"speaker":0,"speaker_confidence":0.5853265},
{"word":"training","start":19.844362,"end":20.083595,"confidence":0.9999572,"speaker":0,"speaker_confidence":0.5853265},
{"word":"purposes","start":20.083595,"end":20.562063,"confidence":0.9995696,"speaker":0,"speaker_confidence":0.5853265},
{"word":"my","start":21.040531,"end":21.160149,"confidence":0.9997398,"speaker":0,"speaker_confidence":0.5853265},
{"word":"name","start":21.160149,"end":21.319637,"confidence":0.9984106,"speaker":0,"speaker_confidence":0.5853265},
{"word":"is","start":21.319637,"end":21.399384,"confidence":0.99928075,"speaker":0,"speaker_confidence":0.5853265},
{"word":"beth","start":21.558872,"end":21.67849,"confidence":0.99869114,"speaker":0,"speaker_confidence":0.5853265},
{"word":"and","start":21.798107,"end":21.917725,"confidence":0.99258536,"speaker":0,"speaker_confidence":0.5853265},
{"word":"i","start":21.957596,"end":22.077213,"confidence":0.99329174,"speaker":0,"speaker_confidence":0.5853265},
{"word":"will","start":22.077213,"end":22.236702,"confidence":0.9709169,"speaker":0,"speaker_confidence":0.5853265},
{"word":"be","start":22.236702,"end":22.436064,"confidence":0.99613035,"speaker":0,"speaker_confidence":0.45310462},
{"word":"assisting","start":22.436064,"end":22.755043,"confidence":0.99984825,"speaker":0,"speaker_confidence":0.45310462},
{"word":"you","start":22.755043,"end":22.994278,"confidence":0.9999144,"speaker":0,"speaker_confidence":0.45310462},
{"word":"today","start":22.994278,"end":23.113894,"confidence":0.99716824,"speaker":0,"speaker_confidence":0.45310462},
{"word":"how","start":23.472744,"end":23.55249,"confidence":0.9945734,"speaker":0,"speaker_confidence":0.45310462},
{"word":"are","start":23.55249,"end":23.672108,"confidence":0.9951912,"speaker":0,"speaker_confidence":0.45310462},
{"word":"you","start":23.672108,"end":23.791723,"confidence":0.99860495,"speaker":0,"speaker_confidence":0.45310462},
{"word":"doing","start":23.791723,"end":24.030956,"confidence":0.9998969,"speaker":0,"speaker_confidence":0.45310462},
{"word":"not","start":25.283688,"end":25.563297,"confidence":0.6391793,"speaker":1,"speaker_confidence":0.57565314},
{"word":"too","start":25.563297,"end":25.842907,"confidence":0.66280407,"speaker":1,"speaker_confidence":0.57565314},
{"word":"bad","start":25.842907,"end":26.322235,"confidence":0.8838786,"speaker":1,"speaker_confidence":0.57565314},
{"word":"how","start":26.482012,"end":26.601845,"confidence":0.998323,"speaker":1,"speaker_confidence":0.57565314},
{"word":"are","start":26.601845,"end":26.721678,"confidence":0.9984762,"speaker":1,"speaker_confidence":0.57565314},
{"word":"you","start":26.721678,"end":27.04123,"confidence":0.99331033,"speaker":1,"speaker_confidence":0.57565314},
{"word":"today","start":27.04123,"end":27.121119,"confidence":0.998047,"speaker":1,"speaker_confidence":0.57565314},
{"word":"i'm","start":28.079777,"end":28.2795,"confidence":0.9888536,"speaker":0,"speaker_confidence":0.35385597},
{"word":"doing","start":28.2795,"end":28.519163,"confidence":0.99951184,"speaker":0,"speaker_confidence":0.35385597},
{"word":"well","start":28.519163,"end":28.599052,"confidence":0.99951184,"speaker":0,"speaker_confidence":0.35385597},
{"word":"thank","start":28.758827,"end":28.95855,"confidence":0.99407357,"speaker":0,"speaker_confidence":0.35385597},
{"word":"you","start":28.95855,"end":29.45855,"confidence":0.95705205,"speaker":0,"speaker_confidence":0.35385597},
{"word":"may","start":29.677544,"end":29.717487,"confidence":0.99993396,"speaker":0,"speaker_confidence":0.35385597},
{"word":"i","start":29.797375,"end":29.997097,"confidence":0.9842502,"speaker":0,"speaker_confidence":0.35385597},
{"word":"please","start":29.997097,"end":30.156872,"confidence":0.99816424,"speaker":0,"speaker_confidence":0.35385597},
{"word":"have","start":30.156872,"end":30.31665,"confidence":0.9994549,"speaker":0,"speaker_confidence":0.35385597},
{"word":"your","start":30.31665,"end":30.476425,"confidence":0.99891937,"speaker":0,"speaker_confidence":0.35385597},
{"word":"name","start":30.476425,"end":30.596258,"confidence":0.9955912,"speaker":0,"speaker_confidence":0.35385597},
{"word":"my","start":31.72929,"end":31.888773,"confidence":0.9984237,"speaker":0,"speaker_confidence":0.35385597},
{"word":"name","start":31.888773,"end":32.048256,"confidence":0.998847,"speaker":0,"speaker_confidence":0.35385597},
{"word":"is","start":32.048256,"end":32.28748,"confidence":0.996455,"speaker":0,"speaker_confidence":0.35385597},
{"word":"blake","start":32.446964,"end":32.686188,"confidence":0.9848967,"speaker":0,"speaker_confidence":0.35385597},
...
]
}
]
In this response, we see that each alternative contains:
transcript
: Transcript for the audio being processed.confidence
: Floating point value between 0 and 1 that indicates overall transcript reliability. Larger values indicate higher confidence.words
: Object containing each word in the transcript, along with its start time and end time (in seconds) from the beginning of the audio stream, a word confidence value, a speaker identifier, and a speaker confidence value.
By default, Deepgram applies its general AI model, which is a good, general purpose model for everyday situations. To learn more about the customization possible with Deepgram's API, check out the Deepgram API Reference.
Format Response
To improve readability, you can use a JSON processor to parse the JSON. In this example, we use JQ and further improve readability by turning on Deepgram’s punctuation and utterances features:
Be sure to replace
YOUR_DEEPGRAM_API_KEY
with your Deepgram API Key. You can create an API Key in the Deepgram Console.
curl \
--request POST \
--url 'https://api.deepgram.com/v1/listen?diarize=true&punctuate=true&utterances=true' \
--header 'Authorization: Token YOUR_DEEPGRAM_API_KEY' \
--header 'content-type: audio/mp3' \
--data-binary @Premier_broken-phone_numbers.mp3 | jq -r '.results.utterances[] | "[Speaker:\(.speaker)] \(.transcript)"'
When the file is finished processing, you’ll receive the following response:
[Speaker:0] Hello, and thank you for calling premier phone service. Please be aware that this call may be recorded for quality and training purposes.
[Speaker:0] My name is Beth, and I will be assisting you today. How are you doing?
[Speaker:1] Not too bad. How are you today?
[Speaker:0] I'm doing well. Thank you. May I please have your name?
[Speaker:1] My name is Blake...
Updated 3 days ago