Use Deepgram's speech-to-text API to transcribe live-streaming audio.

Deepgram provides its customers with real-time, streaming transcription via its streaming endpoints. These endpoints are high-performance, full-duplex services running over the WebSocket protocol.

To learn more about working with real-time streaming data and results, see Get Started with Streaming Audio.

Endpoint

Production WebSocket server for Deepgram's real-time transcription with streaming audio. TLS encryption will protect your connection and data. We support a minimum of TLS 1.2.

DetailDescription
Pathwss://api.deepgram.com/v1/listen

Accepts

TypeDescription
Raw Audio File DataUnprocessed or uncompressed binary audio data (such as PCM)
MessagesJSON formatted operations.

Headers

HeaderValueDescription
Sec-WebSocket-Protocoltoken, <DEEPGRAM_API_KEY>Used to establish a WebSocket connection with a specific protocol, include your Deepgram API key for authentication.

Body Params

ParameterTypeDescription
callbackstringCallback URL to provide if you would like your submitted audio to be processed asynchronously. Learn More.
callback_methodstringEnable a callback method. Use put or post. Learn More.
channelsint32Number of independent audio channels contained in submitted streaming audio. Only read when a value is provided for encoding. Learn More.
dictationbooleanDictation automatically formats spoken commands for punctuation into their respective punctuation marks. Learn More.
diarizebooleanIndicates whether to recognize speaker changes. When set to true, each word in the transcript will be assigned a speaker number starting at 0. Learn More.
diarize_versionstringIndicates the version of the diarization feature to use. Only available when the diarization feature is enabled. Learn More.
encodingstringExpected encoding of the submitted streaming audio. If this parameter is set, sample_rate must also be specified. Learn More.
endpointingbooleanIndicates how long Deepgram will wait to detect whether a speaker has finished speaking or pauses for a significant period of time. When set to true, the streaming endpoint immediately finalizes the transcription for the processed time range and returns the transcript with a speech_final parameter set to true. Learn More.
extrastringAdd any extra key-value pairs to the query string to customize the response. Learn More.
filler_wordsbooleanIndicates whether to include filler words like "uh" and "um" in transcript output. When set to true, these words will be included. Learn More.
interim_resultsbooleanSpecifies whether the streaming endpoint should provide ongoing transcription updates as more audio is received. When set to true, the endpoint sends continuous updates, meaning transcription results may evolve over time. Learn More.
keywordsstringUnique proper nouns or specialized terms you want the model to include in its predictions, which aren't part of the model's default vocabulary. Learn More.
languagestringThe BCP-47 language tag that hints at the primary spoken language. Learn More.
modelstringThe AI model used to process submitted audio. Learn More.
multichannelbooleanIndicates whether to transcribe each audio channel independently. Learn More.
numeralsbooleanIndicates whether to convert numbers from written format (e.g., one) to numerical format (e.g., 1). Learn More.
profanity_filterbooleanIndicates whether to remove profanity from the transcript. Learn More.
punctuatebooleanIndicates whether to add punctuation and capitalization to the transcript Learn More.
redactstringIndicates whether to redact sensitive information, replacing redacted content with asterisks *. Learn More.
replacestringTerms or phrases to search for in the submitted audio and replace. Learn More.
sample_rateint32Sample rate of submitted streaming audio. Required (and only read) when a value is provided for encoding. Learn More.
searchstringTerms or phrases to search for in the submitted audio. Learn More.
smart_formatbooleanIndicates whether to apply formatting to transcript output. When set to true, additional formatting will be applied to transcripts to improve readability. Learn More.
tagstringSet a tag to associate with the request. Learn More.
utterance_end_msstringIndicates how long Deepgram will wait to send a Utterance End message after a word has been transcribed. Learn More.
vad_eventsbooleanIndicates that speech has started. You'll begin receiving Speech Started messages upon speech starting. Learn More.
versionstringVersion of the model to use. Learn More.

Sending Audio Data

All audio data is transmitted to the streaming endpoint as binary WebSocket messages, with payloads containing the raw audio data. The full-duplex protocol allows for real-time streaming, enabling you to receive transcription responses simultaneously as you upload data. For optimal performance, each streaming buffer should represent between 20 and 250 milliseconds of audio.

Messages

Keep Alive

Optional

Periodically send KeepAlive messages while streaming can ensure uninterrupted communication and minimizing costs. Learn More .

{ 
  "type": "KeepAlive"
}

Finalize

Optional

Finalize message can be used to handle specific scenarios where you need to force the server to process all unprocessed audio data and immediately return the final results. Learn More.

{ 
  "type": "Finalize" 
}

Close Stream

Optional

The CloseStream message can be sent to the Deepgram server, instructing it close the connection. Learn More.

{ 
  "type": "CloseStream"
}

Responses

Refer to API Errors for more information.

StatusDescription
200Audio submitted for transcription.
400Bad Request.
401Invalid Authorization.
402Payment Required, insufficient credits
403Insufficient permissions.
503Internal server error if the server is temporarily unable to serve requests.

Response Schema

{
  "metadata": {
    "transaction_key": "string",
    "request_id": "uuid",
    "sha256": "string",
    "created": "string",
    "duration": 0,
    "channels": 0,
    "models": [
      "string"
    ],
  },
  "type": "Results",
  "channel_index": [
    0,
    0
  ],
  "duration": 0.0,
  "start": 0.0,
  "is_final": boolean,
  "speech_final": boolean,
  "channel": {
    "alternatives": [
      {
        "transcript": "string",
        "confidence": 0,
        "words": [
          {
            "word": "string",
            "start": 0,
            "end": 0,
            "confidence": 0
          }
        ]
      }
    ],
    "search": [
      {
        "query": "string",
        "hits": [
          {
            "confidence": 0,
            "start": 0,
            "end": 0,
            "snippet": "string"
          }
        ]
      }
    ]
  }
}

Errors & Warnings

If Deepgram encounters an error during real-time streaming, we will return a WebSocket Close frame. The body of the Close frame will indicate the reason for closing using one of the specification’s pre-defined status codes followed by a UTF-8-encoded payload that represents the reason for the error.

Current codes and payloads in use include:

CodePayloadDescription
1000N/ANormal Closure
1008DATA-0000The payload cannot be decoded as audio. Either the encoding is incorrectly specified, the payload is not audio data, or the audio is in a format unsupported by Deepgram.
1011NET-0000The service has not transmitted a Text frame to the client within the timeout window. This may indicate an issue internally in Deepgram's systems or could be due to Deepgram not receiving enough audio data to transcribe a frame.
1011NET-0001The service has not received a Binary or Text frame from the client within the timeout window. This may indicate an internal issue in Deepgram's systems, the client's systems, or the network connecting them.