When To Use Multichannel and Diarization

When using Deepgram’s API, you have access to our Multichannel and Diarization features, which are useful in different scenarios.

Comparing Multichannel and Diarization

Multichannel and Diarization are useful features when using Deepgram’s speech-to-text.

Multichannel Audio

Multichannel audio is audio that has multiple separate audio channels, and the audio in each channel is distinct.

You may have heard of stereo sound, which is sound produced from two different audio channels—one channel for the left and one channel for the right—and which causes audio to sound wider and as having more depth than mono sound. Stereo sound can be multichannel sound if the left and right channels contain different audio. This could consist of one channel for voices and one for sound effects, one channel for each person’s voice (for example, in a telemedicine visit between a patient and their doctor), or one channel for multiple speakers and another channel for other speakers (for example, in a podcast where multiple interviewers are on one channel and multiple guests are on a second channel).

Multichannel sound can also have more than two channels. When recording multiple people speaking (for example, on a company-wide conference call), separating different speakers’ voices into individual audio channels can make it easier to focus on one speaker when reviewing the audio file.

Diarization

Diarization is the process of separating an audio stream into segments according to speaker identity, regardless of channel. Your audio may have two speakers on one audio channel, one speaker on one audio channel and one on another, or multiple speakers on one audio channel and one speaker on multiple other channels—diarization will identify the speakers regardless of audio channel.

In short, diarization focuses on giving information about different speakers, while multichannel focuses on identifying different audio channels.

Deepgram’s Multichannel Feature

You can use Deepgram’s Multichannel feature by sending multichannel=true in a request via the API or an SDK. When you do so, you are telling Deepgram to transcribe each audio channel independently, and Deepgram will return a response that contains separate channels for each channel from the audio:

JSON

1 "channels": [
2   {
3     alternatives: [
4       {
5         transcript: "parker scarves how may i help you",
6         confidence: ...,
7         words: ...
8       }
9     ]
10   },
11   {
12     alternatives: [
13       {
14         transcript: "i got a scarf online for my wife",
15         confidence: ...,
16         words: ...
17       }
18     ]
19   }
20 ]

Deepgram’s Diarization Feature

You can use Deepgram’s Diarization feature by sending diarize=true in a request via the API or an SDK. When you do so, you are telling Deepgram that you want to know which unique person spoke each word in the transcript, and Deepgram will return a response that identifies each word as having been spoken by a different person by labelling it with a speaker property: speaker: 0, speaker: 1, and so on.

JSON

1 [
2   {
3     alternatives: [
4       {
5         transcript: "parker scarves how may i help you i got a scarf online for my wife",
6         confidence: 0.94873047,
7         words: [
8           {
9             word: 'parker',
10             start: 1.1770647,
11             end: 1.4563681,
12             confidence: 0.7792969,
13             speaker: 0
14           },
15           {
16             word: 'scarves',
17             start: 1.6558706,
18             end: 1.8553731,
19             confidence: 0.5029297,
20             speaker: 0
21           },
22           {
23             word: 'how',
24             start: 2.0548756,
25             end: 2.174577,
26             confidence: 0.99902344,
27             speaker: 0
28           },
29           {
30             word: 'may',
31             start: 2.174577,
32             end: 2.254378,
33             confidence: 0.9995117,
34             speaker: 0
35           },
36           {
37             word: 'i',
38             start: 2.3341792,
39             end: 2.4538805,
40             confidence: 0.9980469,
41             speaker: 0
42           },
43           {
44             word: 'help',
45             start: 2.4538805,
46             end: 2.733184,
47             confidence: 1,
48             speaker: 0
49           },
50           {
51             word: 'you',
52             start: 2.733184,
53             end: 2.892786,
54             confidence: 0.9838867,
55             speaker: 0
56           },
57           {
58             word: 'i',
59             start: 4.089801,
60             end: 4.209502,
61             confidence: 0.54589844,
62             speaker: 1
63           },
64           {
65             word: 'got',
66             start: 4.209502,
67             end: 4.329204,
68             confidence: 0.6279297,
69             speaker: 1
70           },
71           {
72             word: 'a',
73             start: 4.329204,
74             end: 4.6883082,
75             confidence: 0.9580078,
76             speaker: 1
77           },
78           {
79             word: 'scarf',
80             start: 4.6883082,
81             end: 5.1883082,
82             confidence: 0.9760742,
83             speaker: 1
84           },
85           {
86             word: 'online',
87             start: 5.2469153,
88             end: 5.526219,
89             confidence: 0.6933594,
90             speaker: 1
91           },
92           {
93             word: 'for',
94             start: 5.526219,
95             end: 5.6459203,
96             confidence: 0.7602539,
97             speaker: 1
98           },
99           {
100             word: 'my',
101             start: 5.6459203,
102             end: 5.8454227,
103             confidence: 0.98876953,
104             speaker: 1
105           },
106           {
107             word: 'wife',
108             start: 5.8454227,
109             end: 6.044925,
110             confidence: 0.7709961,
111             speaker: 1
112           },

Combining Multichannel with Diarization

Combining Deepgram’s Multichannel and Diarization features can provide very specific, useful information about the people speaking in multiple audio channels. For example, if your audio contains two audio channels with several people speaking on one channel and several other people speaking on the second channel, using Multichannel will allow you to split the audio by channel, while Diarization will allow you to identify the different people speaking on each channel.

Before you combine Multichannel and Diarization, it’s important to understand how each feature works individually. Otherwise, you may have difficulty understanding your returned transcript.

For example, if your audio has two different people speaking, each on a different audio channel, using both Multichannel and Diarization will return two distinct transcripts for each channel with both speakers identified as the first speaker. Having both speakers identified as the first speaker may seem unusual, but it is correct—because only one person is speaking on each distinct audio channel, each person is the one speaker (speaker: 0) on their channel.

Another example: You may have an audio file that you believe is multichannel, so you expect Deepgram to return multiple different transcripts, but you receive a response that contains separate channels with identical transcripts. In this case, you may have encountered a joint stereo audio file. Sometimes, to save file space when creating or converting an audio file, multichannel audio will undergo a process that mixes the channels into one main channel. Deepgram will still identify that the audio contains two channels, but the returned transcript for each channel will be the same (all speaking parts, regardless of how many speakers the audio contains, will be combined as one transcript).

Use Cases

To really understand when to use Multichannel and when to use Diarization, let’s explore some possible scenarios.

Two audio channels with the same person speaking on each channel

A person is doing a sound check to see whether sound is coming from two different inputs

JSON

1 transcript: "hello and welcome to the sound test we're starting from the left channel then follows right channel left channel right channel left channel right channel and once again let channel know alright thank you so much listening to me and have a nice day"

In this scenario, because the same person is speaking on both audio channels, Diarization would not be useful. However, it could be useful to break the transcript into separate audio channels using Deepgram’s Multichannel feature. If you do so, you should see the following transcript returned:

JSON

1 "channels": [
2   {
3     alternatives: [
4       {
5         transcript: "hello and welcome to sound test we're starting from the left channel and follows left channel left channel and once again let channel know thank you so much for listening to me and have a nice day",
6         confidence: 0.9472656,
7         words: [Array]
8       }
9     ]
10   },
11   {
12     alternatives: [
13       {
14         transcript: "hello and welcome to the sound test we're starting from there then follows right channel right channel right channel and once again right channel thank you so much and have a nice day",
15         confidence: 0.9326172,
16         words: [Array]
17       }
18     ]
19   }
20 ]

Two audio channels with one person on each channel

A florist is taking an order from a customer

JSON

1 transcript: "thank you for calling marcus flowers hello i'd like to order flowers and i think you have what i'm looking for i'd be happy to take care of your order may i have your name please",

In this scenario, because only one individual is on each channel, Diarization would not be useful (each speaker would be returned as speaker: 0 since they are on separate channels). However, it could be useful to break the transcript into separate audio channels using Deepgram’s Multichannel feature. If you do so, you should see the following transcript returned:

JSON

1 "channels": [
2   {
3     alternatives: [
4       {
5         transcript: "thank you for calling marcus flowers i'd be happy to take care of your order may i have your name please",
6         confidence: 0.9819336,
7         words: [{
8             word: 'thank',
9             start: 0.94,
10             end: 1.06,
11             confidence: 0.99658203,
12             speaker: 0
13           },
14           ...
15           ]
16       }
17     ]
18   },
19   {
20     alternatives: [
21       {
22         transcript: "hello i'd like to order flowers and i think you have what i'm looking for",
23         confidence: 0.9916992,
24         words: [{
25             word: 'hello',
26             start: 4.0095854,
27             end: 4.049482,
28             confidence: 0.9897461,
29             speaker: 0
30           },
31           ...
32           ]
33       }
34     ]
35   }
36 ]

One audio channel with two people

A news broadcast has multiple presenters

JSON

1 transcript: "from npr news this is all things considered i'm robert siegel and i'm michelle norris"

In this scenario, because only one audio channel exists, Multichannel will probably not provide you with enough information. However, Diarization could provide information to help you identify each person speaking. In particular, analyzing both start and end properties alongside the speaker information can help you find sections of audio where people talk over each other, which commonly occurs in natural conversation. If you use Diarization, you should see the following transcript returned:

JSON

1 "channels": [
2   {
3     alternatives: [
4       {
5         transcript: "from npr news this is all things considered i'm robert siegel and i'm michelle norris",
6         confidence: 0.9794922,
7         words: [
8           {
9             word: 'from',
10             start: 0.81824785,
11             end: 0.8980769,
12             confidence: 0.99658203,
13             speaker: 0
14           },
15           {
16             word: 'npr',
17             start: 1.2573076,
18             end: 1.3770512,
19             confidence: 0.95947266,
20             speaker: 0
21           },
22           {
23             word: 'news',
24             start: 1.4967948,
25             end: 1.736282,
26             confidence: 0.99609375,
27             speaker: 0
28           },
29           {
30             word: 'this',
31             start: 1.9358547,
32             end: 2.0555983,
33             confidence: 0.9897461,
34             speaker: 0
35           },
36           {
37             word: 'is',
38             start: 2.0555983,
39             end: 2.2152565,
40             confidence: 0.9814453,
41             speaker: 0
42           },
43           {
44             word: 'all',
45             start: 2.2152565,
46             end: 2.414829,
47             confidence: 0.9902344,
48             speaker: 0
49           },
50           {
51             word: 'things',
52             start: 2.414829,
53             end: 2.853889,
54             confidence: 0.9941406,
55             speaker: 0
56           },
57           {
58             word: 'considered',
59             start: 2.853889,
60             end: 3.2929487,
61             confidence: 0.9785156,
62             speaker: 0
63           },
64           {
65             word: "i'm",
66             start: 3.452607,
67             end: 3.532436,
68             confidence: 0.9863281,
69             speaker: 0
70           },
71           {
72             word: 'robert',
73             start: 3.6521795,
74             end: 3.8916667,
75             confidence: 0.98876953,
76             speaker: 0
77           },
78           {
79             word: 'siegel',
80             start: 4.01141,
81             end: 4.210983,
82             confidence: 0.49243164,
83             speaker: 0
84           },
85           {
86             word: 'and',
87             start: 4.370641,
88             end: 4.45047,
89             confidence: 0.9794922,
90             speaker: 1
91           },
92           {
93             word: "i'm",
94             start: 4.570214,
95             end: 5.049188,
96             confidence: 0.4260254,
97             speaker: 1
98           },
99           {
100             word: 'michelle',
101             start: 5.049188,
102             end: 5.208846,
103             confidence: 0.69384766,
104             speaker: 1
105           },
106           {
107             word: 'norris',
108             start: 5.32859,
109             end: 5.82859,
110             confidence: 0.9379883,
111             speaker: 1
112           },

Two channels with three people on one channel and one person on the other channel

In this scenario, you could combine Multichannel and Diarization to provide useful information. Here, Multichannel would separate the transcript by audio input channels, and Diarization would help you identify which person was speaking on the first channel.

Troubleshooting

Read on for explanations to some common scenarios that may seem unusual.

When using both the Multichannel and Diarization features with two people, both people are marked as the same speaker

If your audio has two different people speaking, each on a different audio channel, using both Multichannel and Diarization will return two distinct transcripts for each channel with both speakers identified as the first speaker. Having both speakers identified as the first speaker may seem unusual, but it is correct—because only one person is speaking on each distinct audio channel, each person is the one speaker (speaker: 0) on their specific channel:

JSON

1 "channels": [
2   {
3     alternatives: [
4       {
5         transcript: "thank you for calling marcus flowers i'd be happy to take care of your order may i have your name please",
6         confidence: 0.9819336,
7         words: [{
8             word: 'thank',
9             start: 0.94,
10             end: 1.06,
11             confidence: 0.99658203,
12             speaker: 0
13           },
14           ...
15           ]
16       }
17     ]
18   },
19   {
20     alternatives: [
21       {
22         transcript: "hello i'd like to order flowers and i think you have what i'm looking for",
23         confidence: 0.9916992,
24         words: [{
25             word: 'hello',
26             start: 4.0095854,
27             end: 4.049482,
28             confidence: 0.9897461,
29             speaker: 0
30           },
31           ...
32           ]
33       }
34     ]
35   }
36 ]

When using the Multichannel feature, Deepgram returns the same transcript on each channel

Sometimes when you believe an audio file is multichannel and expect Deepgram to return multiple different transcripts, you receive a response that contains separate channels with identical transcripts. In this case, you may have encountered a joint stereo audio file. Sometimes, to save file space when creating or converting an audio file, multichannel audio will undergo a process that mixes the channels into one main channel. Deepgram will still identify that the audio contains two channels, but the returned transcript for each channel will be the same (all speaking parts, regardless of how many speakers the audio contains, will be combined as one transcript):

JSON

1 "channels": [
2   {
3     alternatives: [
4       {
5         transcript:
6           'parker scarves how may i help you i got a scarf online for my wife',
7         confidence: 0.9453125,
8         words: [Array],
9       },
10     ],
11   },
12   {
13     alternatives: [
14       {
15         transcript:
16           'parker scarves how may i help you i got a scarf online for my wife',
17         confidence: 0.9453125,
18         words: [Array],
19       },
20     ],
21   },
22 ]