1. Documentation
  2. Features
  3. Multichannel

Multichannel

PRE-RECORDED
STREAMING

Deepgram’s Multichannel feature recognizes multiple audio channels and transcribes each channel independently. When set to true, you will receive one transcript for each channel.

When processing pre-recorded audio, you can apply a different model to each channel using the model parameter. For example, set model=general:phonecall, which applies the general model to channel 0 and the phonecall model to channel 1 (multichannel=true&model=general:phonecall).

Use Cases

An example of a use case for Multichannel includes:

Customers who use audio with multiple speakers on independent channels and want transcripts to identify each speaker individually.

Enable Feature

To enable Multichannel, when you call Deepgram’s API, add a multichannel parameter set to true in the query string:

multichannel=true

To transcribe audio from a file on your computer, run the following cURL command in a terminal or your favorite API client.

Be sure to replace YOUR_DEEPGRAM_API_KEY with your Deepgram API Key. You can create an API Key in the Deepgram Console.

curl \
  --request POST \
  --header 'Authorization: Token YOUR_DEEPGRAM_API_KEY' \
  --header 'Content-Type: audio/mp3' \
  --data-binary @youraudio.mp3 \
  --url 'https://api.deepgram.com/v1/listen?multichannel=true'

Analyze Response

For this example, we use an MP3 split stereo audio file that contains the first 10 seconds of a customer call with a florist. If you would like to follow along, you can download it.

When the file is finished processing (often after only a few seconds), you’ll receive a JSON response that has the following basic structure:

{
  "metadata": {
    "transaction_key": "string",
    "request_id": "string",
    "sha256": "string",
    "created": "string",
    "duration": 0,
    "channels": 0
  },
  "results": {
    "channels": [
      {
        "alternatives":[]
      },
    ]
  }
}

For this response, the channels property under metadata will be set to 2 because our sample audio track has two channels.

Let's look more closely at the channels object under results:

...
"channels":[
  {
    "alternatives":[
      {
        "transcript":"thank you for calling marcus flowers how history i'd be happy to take care of your order may i have your name please",
        "confidence":0.98221713,
        "words":[
          {"word":"thank","start":0.94,"end":1.06,"confidence":0.9963781},
          {"word":"you","start":1.06,"end":1.18,"confidence":0.99431735},
          {"word":"for","start":1.18,"end":1.3399999,"confidence":0.9995141},
          {"word":"calling","start":1.3399999,"end":1.54,"confidence":0.9642348},
          {"word":"marcus","start":1.66,"end":2.02,"confidence":0.49227273},
          {"word":"flowers","start":2.22,"end":2.4199998,"confidence":0.49624875},
          {"word":"how","start":2.6599998,"end":2.98,"confidence":0.28497696},
          {"word":"history","start":3.1,"end":3.5,"confidence":0.57538134},
          {"word":"i'd","start":7.93305,"end":8.092364,"confidence":0.9472991},
          {"word":"be","start":8.092364,"end":8.21185,"confidence":0.991615},
          {"word":"happy","start":8.21185,"end":8.410994,"confidence":0.9985863},
          {"word":"to","start":8.410994,"end":8.610136,"confidence":0.9588121},
          {"word":"take","start":8.610136,"end":8.8092785,"confidence":0.99981946},
          {"word":"care","start":8.8092785,"end":8.928764,"confidence":0.9996985},
          {"word":"of","start":8.928764,"end":9.04825,"confidence":0.99723876},
          {"word":"your","start":9.04825,"end":9.247393,"confidence":0.9970649},
          {"word":"order","start":9.247393,"end":9.406708,"confidence":0.99819595},
          {"word":"may","start":9.526194,"end":9.566021,"confidence":0.9967169},
          {"word":"i","start":9.60585,"end":9.725336,"confidence":0.95288867},
          {"word":"have","start":9.725336,"end":9.88465,"confidence":0.9997284},
          {"word":"your","start":9.88465,"end":10.123621,"confidence":0.9989493},
          {"word":"name","start":10.123621,"end":10.243107,"confidence":0.9987111},
          {"word":"please","start":10.482079,"end":10.681221,"confidence":0.9997795}
        ]
      }
    ]
  },
  {
    "alternatives":[
      {
        "transcript":"hello i'd like to order flowers and i think you have what i'm looking for",
        "confidence":0.99148595,
        "words":[
          {"word":"hello","start":4.0095854,"end":4.049482,"confidence":0.9823143},
          {"word":"i'd","start":4.169171,"end":4.4085493,"confidence":0.9855652},
          {"word":"like","start":4.4085493,"end":4.6080313,"confidence":0.9956638},
          {"word":"to","start":4.6080313,"end":4.7676167,"confidence":0.9293824},
          {"word":"order","start":4.7676167,"end":5.2676167,"confidence":0.9842541},
          {"word":"flowers","start":5.286269,"end":5.60544,"confidence":0.99148595},
          {"word":"and","start":5.7251296,"end":5.804922,"confidence":0.9973412},
          {"word":"i","start":5.8448186,"end":6.0443006,"confidence":0.99624014},
          {"word":"think","start":6.0443006,"end":6.203886,"confidence":0.90700394},
          {"word":"you","start":6.203886,"end":6.323575,"confidence":0.9772934},
          {"word":"have","start":6.323575,"end":6.5629535,"confidence":0.96995157},
          {"word":"what","start":6.5629535,"end":6.642746,"confidence":0.99273515},
          {"word":"i'm","start":6.6826425,"end":6.8821244,"confidence":0.9940373},
          {"word":"looking","start":6.8821244,"end":7.2012954,"confidence":0.9997274},
          {"word":"for","start":7.2012954,"end":7.3209844,"confidence":0.9995869}
        ]
      }
    ]
  }
]
...

In this response, we see that the channels object contains two sub-objects, one for each channel identified in the audio. Within each channel, each alternative contains multiple objects, each of which includes:

  • transcript: Transcript for the audio being processed.
  • confidence: Floating point value between 0 and 1 that indicates overall transcript reliability. Larger values indicate higher confidence.
  • words: Object containing each word in the transcript, along with its start time and end time (in seconds) from the beginning of the audio stream, and a confidence value.

In the first channel object, notice that the word history has an end time of 3.5, while the word i'd has a start time of 7.93305; this is a considerable gap in audio within this channel. Now, notice that in the second channel object, the first word has a start time of 4.0095854 and the last word has an end time of 7.3209844. This time frame falls directly in the middle of the gap in the first channel object.

This makes sense because our sample audio file is a split stereo file with speakers separated on different channels. We can see that one speaker greets another in the first audio channel, waits for a response from the speaker recorded in the second audio channel, and then responds in the first audio channel.

By default, Deepgram applies its general AI model, which is a good, general purpose model for everyday situations. To learn more about the customization possible with Deepgram's API, check out the Deepgram API Reference.