Use Deepgram's speech-to-text API to transcribe live-streaming audio.

Deepgram provides its customers with real-time, streaming transcription via its streaming endpoints. These endpoints are high-performance, full-duplex services running over the tried-and-true WebSocket protocol, which makes integration with customer pipelines simple due to the wide array of client libraries available.

To use this endpoint, connect to wss://api.deepgram.com/v1/listen. TLS encryption will protect your connection and data. We support a minimum of TLS 1.2.

All audio data is sent to the streaming endpoint as binary-type WebSocket messages containing payloads that are the raw audio data. Because the protocol is full-duplex, you can stream in real-time and still receive transcription responses while uploading data. Streaming buffer sizes should be between 20 milliseconds and 250 milliseconds of audio.

When you are finished, send a JSON message to the server: { "type": "CloseStream" }. The server will interpret it as a shutdown command, which means it will finish processing whatever data is still has cached, send the response to the client, send a summary metadata object, and then terminate the WebSocket connection.

To learn more about working with real-time streaming data and results, see Get Started with Streaming Audio.

Deepgram does not store transcriptions. Make sure to save output or return transcriptions to a callback URL for custom processing.

Query Params

string

Callback URL to provide if you would like your submitted audio to be processed asynchronously. Learn More

boolean

Enable a callback method. Default: false Learn More

int32

Number of independent audio channels contained in submitted streaming audio. Only read when a value is provided for encoding. Learn More

boolean

Indicates whether to recognize speaker changes. When set to true, each word in the transcript will be assigned a speaker number starting at 0. Learn More

string

Indicates the version of the diarization feature to use. Only used when the diarization feature is enabled (diarize=true is passed to the API). Learn More

string

Expected encoding of the submitted streaming audio. If this parameter is set, sample_rate must also be specified. Learn More

boolean

Indicates how long Deepgram will wait to detect whether a speaker has finished speaking (or paused for a significant period of time, indicating the completion of an idea). When Deepgram detects an endpoint, it assumes that no additional data will improve its prediction, so it immediately finalizes the result for the processed time range and returns the transcript with a speech_final parameter set to true. Endpointing may be disabled by setting endpointing=false. Learn More

string

To add an extra parameter in the query string and pass a key-value pair you would like to include in the response. Learn More

boolean

Indicates whether to include filler words like "uh" and "um" in transcript output. When set to true, these words will be included. Defaults to false. Learn More

string

Indicates whether the streaming endpoint should send you updates to its transcription as more audio becomes available. When set to true, the streaming endpoint returns regular updates, which means transcription results will likely change for a period of time. By default, this flag is set to false. Learn More

string

Uncommon proper nouns or other words to transcribe that are not a part of the model's vocabulary. Can send multiple instances in query string (for example, keywords=snuffalupagus:10&keywords=systrom:5.5). Learn More

string

The BCP-47 language tag that hints at the primary spoken language. Learn More

string

AI model used to process submitted audio. Learn More

boolean

Indicates whether to transcribe each audio channel independently. Learn More

boolean

Indicates whether to convert numbers from written format (e.g., one) to numerical format (e.g., 1). Learn More

boolean

Indicates whether to remove profanity from the transcript. Learn More

boolean

Indicates whether to add punctuation and capitalization to the transcript Learn More

string

Indicates whether to redact sensitive information, replacing redacted content with asterisks (*). Can send multiple instances in query string (for example, redact=pci&redact=numbers). Learn More

string

Terms or phrases to search for in the submitted audio and replace. Can send multiple instances in query string (for example, replace=this:that&replace=thisalso:thatalso). Learn More

int32

Sample rate of submitted streaming audio. Required (and only read) when a value is provided for encoding. Learn More

string

Terms or phrases to search for in the submitted audio. Can send multiple instances in query string (for example, search=speech&search=Friday). Learn More

boolean

Indicates whether to apply formatting to transcript output. When set to true, additional formatting will be applied to transcripts to improve readability. Learn More

string

Tag to associate with the request. Learn More

string

Indicates how long Deepgram will wait to send a {"type": "UtteranceEnd"} message after a word has been transcribed. Learn More

boolean

Indicates that speech has started. {"type": "SpeechStarted"} You'll begin receiving messages upon speech starting. Learn More

string

Version of the model to use.Learn More

Responses

Status Description
200 Success Audio submitted for transcription.

Response Schema

{
  "metadata": {
    "transaction_key": "string",
    "request_id": "uuid",
    "sha256": "string",
    "created": "string",
    "duration": 0,
    "channels": 0,
    "models": [
      "string"
    ],
  },
  "type": "Results",
  "channel_index": [
    0,
    0
  ],
  "duration": 0.0,
  "start": 0.0,
  "is_final": boolean,
  "speech_final": boolean,
  "channel": {
    "alternatives": [
      {
        "transcript": "string",
        "confidence": 0,
        "words": [
          {
            "word": "string",
            "start": 0,
            "end": 0,
            "confidence": 0
          }
        ]
      }
    ],
    "search": [
      {
        "query": "string",
        "hits": [
          {
            "confidence": 0,
            "start": 0,
            "end": 0,
            "snippet": "string"
          }
        ]
      }
    ]
  }
}

Stream KeepAlive

By default, the Deepgram streaming connection will time out with a NET-0001 error code if no audio is sent by the client for 12 seconds. (See Error Handling below for more information.)

To keep the websocket open without sending audio data, send the following JSON string:

{ "type": "KeepAlive" }

This will keep the streaming connection open for an additional 12 seconds. If no audio or additional KeepAlive messages are sent within the 12 second window, the streaming connection will close with a NET-0001. To avoid this error and keep the connection open, continue sending KeepAlive messages 3-5 seconds before the 12 second timeout window expires until you are ready to resume sending audio.

📘

Read more about KeepAlive in this comprehensive guide.

Close Stream

To gracefully close a streaming connection, send the following JSON string:

{ "type": "CloseStream" }

This tells Deepgram that no more audio will be sent. Deepgram will close the connection once all audio has finished processing.

Error Handling

If Deepgram encounters an error during real-time streaming, we will return a WebSocket Close frame (WebSocket Protocol specification, section 5.5.1]).

The body of the Close frame will indicate the reason for closing using one of the specification’s pre-defined status codes followed by a UTF-8-encoded payload that represents the reason for the error. Current codes and payloads in use include:

Code Payload Description
1008 DATA-0000 The payload cannot be decoded as audio. Either the encoding is incorrectly specified, the payload is not audio data, or the audio is in a format unsupported by Deepgram.
1011 NET-0000 The service has not transmitted a Text frame to the client within the timeout window. This may indicate an issue internally in Deepgram's systems or could be due to Deepgram not receiving enough audio data to transcribe a frame.
1011 NET-0001 The service has not received a Binary or Text frame from the client within the timeout window. This may indicate an internal issue in Deepgram's systems, the client's systems, or the network connecting them.

To learn about debugging WebSocket errors, see Troubleshooting WebSocket DATA and NET Errors When Live Streaming Audio.

After sending a Close message, the endpoint considers the WebSocket connection closed and will close the underlying TCP connection.