When To Use Multichannel and Diarization
Compare Deepgram’s Multichannel and Diarization features to better understand when to use each feature.
When using Deepgram’s API, you have access to our Multichannel and Diarization features, which are useful in different scenarios.
Comparing Multichannel and Diarization
Multichannel and Diarization are useful features when using Deepgram’s speech-to-text.
Multichannel Audio
Multichannel audio is audio that has multiple separate audio channels, and the audio in each channel is distinct.
You may have heard of stereo sound, which is sound produced from two different audio channels—one channel for the left and one channel for the right—and which causes audio to sound wider and as having more depth than mono sound. Stereo sound can be multichannel sound if the left and right channels contain different audio. This could consist of one channel for voices and one for sound effects, one channel for each person’s voice (for example, in a telemedicine visit between a patient and their doctor), or one channel for multiple speakers and another channel for other speakers (for example, in a podcast where multiple interviewers are on one channel and multiple guests are on a second channel).
Multichannel sound can also have more than two channels. When recording multiple people speaking (for example, on a company-wide conference call), separating different speakers’ voices into individual audio channels can make it easier to focus on one speaker when reviewing the audio file.
Diarization
Diarization is the process of separating an audio stream into segments according to speaker identity, regardless of channel. Your audio may have two speakers on one audio channel, one speaker on one audio channel and one on another, or multiple speakers on one audio channel and one speaker on multiple other channels—diarization will identify the speakers regardless of audio channel.
In short, diarization focuses on giving information about different speakers, while multichannel focuses on identifying different audio channels.
Deepgram’s Multichannel Feature
You can use Deepgram’s Multichannel feature by sending multichannel=true
in a request via the API or an SDK. When you do so, you are telling Deepgram to transcribe each audio channel independently, and Deepgram will return a response that contains separate channels for each channel from the audio:
Deepgram’s Diarization Feature
You can use Deepgram’s Diarization feature by sending diarize=true
in a request via the API or an SDK. When you do so, you are telling Deepgram that you want to know which unique person spoke each word in the transcript, and Deepgram will return a response that identifies each word as having been spoken by a different person by labelling it with a speaker
property: speaker: 0
, speaker: 1
, and so on.
Combining Multichannel with Diarization
Combining Deepgram’s Multichannel and Diarization features can provide very specific, useful information about the people speaking in multiple audio channels. For example, if your audio contains two audio channels with several people speaking on one channel and several other people speaking on the second channel, using Multichannel will allow you to split the audio by channel, while Diarization will allow you to identify the different people speaking on each channel.
Before you combine Multichannel and Diarization, it’s important to understand how each feature works individually. Otherwise, you may have difficulty understanding your returned transcript.
For example, if your audio has two different people speaking, each on a different audio channel, using both Multichannel and Diarization will return two distinct transcripts for each channel with both speakers identified as the first speaker. Having both speakers identified as the first speaker may seem unusual, but it is correct—because only one person is speaking on each distinct audio channel, each person is the one speaker (speaker: 0
) on their channel.
Another example: You may have an audio file that you believe is multichannel, so you expect Deepgram to return multiple different transcripts, but you receive a response that contains separate channels with identical transcripts. In this case, you may have encountered a joint stereo audio file. Sometimes, to save file space when creating or converting an audio file, multichannel audio will undergo a process that mixes the channels into one main channel. Deepgram will still identify that the audio contains two channels, but the returned transcript for each channel will be the same (all speaking parts, regardless of how many speakers the audio contains, will be combined as one transcript).
Use Cases
To really understand when to use Multichannel and when to use Diarization, let’s explore some possible scenarios.
Two audio channels with the same person speaking on each channel
A person is doing a sound check to see whether sound is coming from two different inputs
In this scenario, because the same person is speaking on both audio channels, Diarization would not be useful. However, it could be useful to break the transcript into separate audio channels using Deepgram’s Multichannel feature. If you do so, you should see the following transcript returned:
Two audio channels with one person on each channel
A florist is taking an order from a customer
In this scenario, because only one individual is on each channel, Diarization would not be useful (each speaker would be returned as speaker: 0
since they are on separate channels). However, it could be useful to break the transcript into separate audio channels using Deepgram’s Multichannel feature. If you do so, you should see the following transcript returned:
One audio channel with two people
A news broadcast has multiple presenters
In this scenario, because only one audio channel exists, Multichannel will probably not provide you with enough information. However, Diarization could provide information to help you identify each person speaking. In particular, analyzing both start
and end
properties alongside the speaker information can help you find sections of audio where people talk over each other, which commonly occurs in natural conversation. If you use Diarization, you should see the following transcript returned:
Two channels with three people on one channel and one person on the other channel
In this scenario, you could combine Multichannel and Diarization to provide useful information. Here, Multichannel would separate the transcript by audio input channels, and Diarization would help you identify which person was speaking on the first channel.
Troubleshooting
Read on for explanations to some common scenarios that may seem unusual.
When using both the Multichannel and Diarization features with two people, both people are marked as the same speaker
If your audio has two different people speaking, each on a different audio channel, using both Multichannel and Diarization will return two distinct transcripts for each channel with both speakers identified as the first speaker. Having both speakers identified as the first speaker may seem unusual, but it is correct—because only one person is speaking on each distinct audio channel, each person is the one speaker (speaker: 0
) on their specific channel:
When using the Multichannel feature, Deepgram returns the same transcript on each channel
Sometimes when you believe an audio file is multichannel and expect Deepgram to return multiple different transcripts, you receive a response that contains separate channels with identical transcripts. In this case, you may have encountered a joint stereo audio file. Sometimes, to save file space when creating or converting an audio file, multichannel audio will undergo a process that mixes the channels into one main channel. Deepgram will still identify that the audio contains two channels, but the returned transcript for each channel will be the same (all speaking parts, regardless of how many speakers the audio contains, will be combined as one transcript):