Multichannel | Deepgram's Docs

multichannel boolean Default: false

Pre-recorded Streaming All available languages

When set to true, you will receive separate transcripts for each channel.

Enable Feature

To enable Multichannel, when you call Deepgram’s API, add a multichannel parameter set to true in the query string:

multichannel=true

You can use a maximum of 20 channels.

To transcribe audio from a file on your computer, run the following curl command in a terminal or your favorite API client.

cURL

$ curl \
>   --request POST \
>   --header 'Authorization: Token YOUR_DEEPGRAM_API_KEY' \
>   --header 'Content-Type: audio/mp3' \
>   --data-binary @youraudio.mp3 \
>   --url 'https://api.deepgram.com/v1/listen?multichannel=true'

Replace YOUR_DEEPGRAM_API_KEY with your Deepgram API Key.

Analyze Response

For this example, we use an MP3 split stereo audio file that contains the first 10 seconds of a customer call with a florist. If you would like to follow along, you can download it.

When the file is finished processing (often after only a few seconds), you’ll receive a JSON response.

Note that the response structure differs depending on whether audio is submitted to our pre-recorded endpoint or our streaming endpoint.

Pre-Recorded Response

JSON

1 {
2 	"metadata": {
3 		...
4 		"channels": 0
5 	},
6 	"results": {
7 		"channels": [
8 			{
9 				"alternatives": []
10 			}
11 		]
12 	}
13 }

For this response, the channels property under metadata will be set to 2 because our sample audio track has two channels.

Let’s look more closely at the channels object under results:

JSON

1 ...
2 "channels":[
3   {
4     "alternatives":[
5       {
6         "transcript":"thank you for calling marcus flowers how history i'd be happy to take care of your order may i have your name please",
7         "confidence":0.98221713,
8         "words":[
9           {"word":"thank","start":0.94,"end":1.06,"confidence":0.9963781},
10 					...
11         ]
12       }
13     ]
14   },
15   {
16     "alternatives":[
17       {
18         "transcript":"hello i'd like to order flowers and i think you have what i'm looking for",
19         "confidence":0.99148595,
20         "words":[
21           {"word":"hello","start":4.0095854,"end":4.049482,"confidence":0.9823143},
22 					...
23         ]
24       }
25     ]
26   }
27 ]
28 ...

In this response, we see that the channels object contains two sub-objects, one for each channel identified in the audio. Within each channel, each alternative contains multiple objects, each of which includes:

transcript: Transcript for the audio being processed.
confidence: Floating point value between 0 and 1 that indicates overall transcript reliability. Larger values indicate higher confidence.
words: Object containing each word in the transcript, along with its start time and end time (in seconds) from the beginning of the audio stream, and a confidence value.

In the first channel object, notice that the word history has an end time of 3.5, while the word i'd has a start time of 7.93305; this is a considerable gap in audio within this channel. Now, notice that in the second channel object, the first word has a start time of 4.0095854 and the last word has an end time of 7.3209844. This time frame falls directly in the middle of the gap in the first channel object.

This makes sense because our sample audio file is a split stereo file with speakers separated on different channels. We can see that one speaker greets another in the first audio channel, waits for a response from the speaker recorded in the second audio channel, and then responds in the first audio channel.

Streaming Response

JSON

1 {
2   "metadata": {
3     ...
4 	},
5     "channel_index": [
6       0,
7       2
8     ],
9     "channel": {
10       "alternatives": []
11     }
12   }

For this response, the channel_index property will be set to [0, 2] . The second number in the array is the number of channels (in this case, 2). The first number is the channel that the transcript in the message belongs to. In our example, you will see messages with [0, 2] representing transcription of the first channel, and [1, 2] for transcription of the second channel.

Let’s look more closely at the results of the stream. Deepgram will send back separate messages for each channel, like so:

JSON

1 {
2   "metadata": {
3     ...
4   },
5   "channel_index": [
6     0,
7     2
8   ],
9   "duration": 3.58,
10   "start": 0,
11   "is_final": true,
12   "speech_final": true,
13   "channel": {
14     "alternatives": [
15       {
16         "transcript": "thank you for calling martha flores how may i assist you",
17         "confidence": 0.99853516,
18         "words": [
19           {
20             "word": "thank",
21             "start": 0.64,
22             "end": 0.79999995,
23             "confidence": 0.99853516
24           },
25           {
26             "word": "you",
27             "start": 0.79999995,
28             "end": 1.04,
29             "confidence": 1
30           },
31           {
32             "word": "for",
33             "start": 1.04,
34             "end": 1.1999999,
35             "confidence": 0.9995117
36           },
37           {
38             "word": "calling",
39             "start": 1.1999999,
40             "end": 1.5999999,
41             "confidence": 1
42           },
43           {
44             "word": "martha",
45             "start": 1.5999999,
46             "end": 2,
47             "confidence": 0.97216797
48           },
49           {
50             "word": "flores",
51             "start": 2,
52             "end": 2.48,
53             "confidence": 0.73010254
54           },
55           {
56             "word": "how",
57             "start": 2.48,
58             "end": 2.6399999,
59             "confidence": 0.9926758
60           },
61           {
62             "word": "may",
63             "start": 2.6399999,
64             "end": 2.8,
65             "confidence": 0.99853516
66           },
67           {
68             "word": "i",
69             "start": 2.8,
70             "end": 2.8799999,
71             "confidence": 0.9995117
72           },
73           {
74             "word": "assist",
75             "start": 2.8799999,
76             "end": 3.1999998,
77             "confidence": 1
78           },
79           {
80             "word": "you",
81             "start": 3.1999998,
82             "end": 3.58,
83             "confidence": 0.9963379
84           }
85         ]
86       }
87     ]
88   }
89 }

This first message represents the transcription of the first channel. It will be followed by another message representing the transcription of the second channel, marked with channel_index: [1, 2]

JSON

1 {
2   "metadata": {
3     ...
4   },
5   "channel_index": [
6     1,
7     2
8   ],
9   "duration": 4.43,
10   "start": 3.02,
11   "is_final": true,
12   "speech_final": true,
13   "channel": {
14     "alternatives": [
15       {
16         "transcript": "hello i'd like to order flowers and i think you have what i'm looking for",
17         "confidence": 0.99902344,
18         "words": [
19           {
20             "word": "hello",
21             "start": 3.6599998,
22             "end": 4.06,
23             "confidence": 0.9436035
24           },
25           {
26             "word": "i'd",
27             "start": 4.06,
28             "end": 4.3,
29             "confidence": 0.99902344
30           },
31           {
32             "word": "like",
33             "start": 4.3,
34             "end": 4.46,
35             "confidence": 0.9995117
36           },
37           {
38             "word": "to",
39             "start": 4.46,
40             "end": 4.7,
41             "confidence": 0.9975586
42           },
43           {
44             "word": "order",
45             "start": 4.7,
46             "end": 5.02,
47             "confidence": 1
48           },
49           {
50             "word": "flowers",
51             "start": 5.02,
52             "end": 5.5,
53             "confidence": 0.8449707
54           },
55           {
56             "word": "and",
57             "start": 5.5,
58             "end": 5.74,
59             "confidence": 0.9995117
60           },
61           {
62             "word": "i",
63             "start": 5.74,
64             "end": 5.8199997,
65             "confidence": 0.99902344
66           },
67           {
68             "word": "think",
69             "start": 5.8199997,
70             "end": 6.06,
71             "confidence": 0.9980469
72           },
73           {
74             "word": "you",
75             "start": 6.06,
76             "end": 6.22,
77             "confidence": 0.9916992
78           },
79           {
80             "word": "have",
81             "start": 6.22,
82             "end": 6.46,
83             "confidence": 0.9995117
84           },
85           {
86             "word": "what",
87             "start": 6.46,
88             "end": 6.62,
89             "confidence": 0.9995117
90           },
91           {
92             "word": "i'm",
93             "start": 6.62,
94             "end": 6.7799997,
95             "confidence": 0.9992676
96           },
97           {
98             "word": "looking",
99             "start": 6.7799997,
100             "end": 7.18,
101             "confidence": 0.9995117
102           },
103           {
104             "word": "for",
105             "start": 7.18,
106             "end": 7.45,
107             "confidence": 0.9416504
108           }
109         ]
110       }
111     ]
112   }
113 }