Transcribe Live Streaming Audio
Deepgram provides its customers with real-time, streaming transcription via its streaming endpoints. These endpoints are high-performance, full-duplex services running over the tried-and-true WebSocket protocol, which makes integration with customer pipelines simple due to the wide array of client libraries available.
To use this endpoint, connect to wss://api.deepgram.com/v1/listen
. TLS encryption will protect your connection and data. We support a minimum of TLS 1.2.
All audio data is sent to the streaming endpoint as binary-type WebSocket messages containing payloads that are the raw audio data. Because the protocol is full-duplex, you can stream in real-time and still receive transcription responses while uploading data. Streaming buffer sizes should be between 20 milliseconds and 250 milliseconds of audio.
When you are finished, send a JSON message to the server: { 'type': 'CloseStream' }
. The server will interpret it as a shutdown command, which means it will finish processing whatever data is still has cached, send the response to the client, send a summary metadata object, and then terminate the WebSocket connection.
To learn more about working with real-time streaming data and results, see Get Started with Streaming Audio.
Deepgram does not store transcriptions. Make sure to save output or return transcriptions to a callback URL for custom processing.
Query Params
AI model used to process submitted audio. Learn More
Level of model you would like to use in your request. Learn More
Version of the model to use.Learn More
The BCP-47 language tag that hints at the primary spoken language. Learn More
Indicates whether to add punctuation and capitalization to the transcript Learn More
Indicates whether to remove profanity from the transcript. Learn More
Indicates whether to redact sensitive information, replacing
redacted content with asterisks (*). Can send multiple
instances in query string (for example,
redact=pci&redact=numbers
). Learn More
Indicates whether to recognize speaker changes. When set to true, each word in the transcript will be assigned a speaker number starting at 0. Learn More
Indicates the version of the diarization feature to use.
Only used when the diarization feature is enabled (diarize=true
is passed to the API).
Learn More
Indicates whether to apply formatting to transcript output. When set to true, additional formatting will be applied to transcripts to improve readability. Learn More
Indicates whether to include filler words like "uh" and "um" in transcript output. When set to true, these words will be included. Defaults to false. Learn More
Indicates whether to transcribe each audio channel independently. Learn More
Maximum number of transcript alternatives to return.
Indicates whether to convert numbers from written format (e.g., one) to numerical format (e.g., 1). Learn More
Terms or phrases to search for in the submitted audio. Can
send multiple instances in query string (for example,
search=speech&search=Friday
). Learn More
Terms or phrases to search for in the submitted audio and
replace. Can send multiple instances in query string (for
example,
replace=this:that&replace=thisalso:thatalso
).
Learn More
Callback URL to provide if you would like your submitted audio to be processed asynchronously. Learn More
Uncommon proper nouns or other words to transcribe that are not a part of the model's vocabulary.
Can send multiple instances in query string (for example,
keywords=snuffalupagus:10&keywords=systrom:5.5
). Learn More
Indicates whether the streaming endpoint should send you updates to its transcription as more audio becomes available. When set to true, the streaming endpoint returns regular updates, which means transcription results will likely change for a period of time. By default, this flag is set to false. Learn More
Indicates how long Deepgram will wait to detect whether a speaker has finished speaking (or paused for a significant period of time, indicating the completion of an idea). When Deepgram detects an endpoint, it assumes that no additional data will improve its prediction, so it immediately finalizes the result for the processed time range and returns the transcript with a speech_final
parameter set to true
. Endpointing may be disabled by setting endpointing=false
.
Learn More
Expected encoding of the submitted streaming audio. If this parameter is set, sample_rate
must also be specified. Learn More
Number of independent audio channels contained in submitted streaming audio. Only read when a value is provided for encoding
. Learn More
Sample rate of submitted streaming audio. Required (and only read) when a value is provided for encoding
. Learn More
Tag to associate with the request. Learn More
Responses
Status | Description |
---|---|
200 Success | Audio submitted for transcription. |
Response Schema
{
"metadata": {
"transaction_key": "string",
"request_id": "uuid",
"sha256": "string",
"created": "string",
"duration": 0,
"channels": 0,
"models": [
"string"
],
},
"channel": [
{
"alternatives": [
{
"transcript": "string",
"confidence": 0,
"words": [
{
"word": "string",
"start": 0,
"end": 0,
"confidence": 0
}
]
}
],
"search": [
{
"query": "string",
"hits": [
{
"confidence": 0,
"start": 0,
"end": 0,
"snippet": "string"
}
]
}
]
}
],
"channel_index": [
0,
0
],
"duration": 0.0,
"start": 0.0,
"is_final": boolean,
"speech_final": boolean
}
Stream KeepAlive
By default, the Deepgram streaming connection will time out with a NET-0001 error code if no audio is sent by the client for 12 seconds. (See Error Handling below for more information.)
To keep the websocket open without sending audio data, send the following JSON string:
{ 'type': 'KeepAlive' }
This will keep the streaming connection open for an additional 12 seconds. If no audio or additional KeepAlive messages are sent within the 12 second window, the streaming connection will close with a NET-0001. To avoid this error and keep the connection open, continue sending KeepAlive messages 3-5 seconds before the 12 second timeout window expires until you are ready to resume sending audio.
Close Stream
To gracefully close a streaming connection, send the following JSON string:
{ 'type': 'CloseStream' }
This tells Deepgram that no more audio will be sent. Deepgram will close the connection once all audio has finished processing.
Error Handling
If Deepgram encounters an error during real-time streaming, we will return a WebSocket Close frame (WebSocket Protocol specification, section 5.5.1]).
The body of the Close frame will indicate the reason for closing using one of the specification’s pre-defined status codes followed by a UTF-8-encoded payload that represents the reason for the error. Current codes and payloads in use include:
Code | Payload | Description |
---|---|---|
1008 | DATA-0000 | The payload cannot be decoded as audio. Either the encoding is incorrectly specified, the payload is not audio data, or the audio is in a format unsupported by Deepgram. |
1011 | NET-0000 | The service has not transmitted a Text frame to the client within the timeout window. This may indicate an issue internally in Deepgram's systems or could be due to Deepgram not receiving enough audio data to transcribe a frame. |
1011 | NET-0001 | The service has not received a Binary or Text frame from the client within the timeout window. This may indicate an internal issue in Deepgram's systems, the client's systems, or the network connecting them. |
To learn about debugging WebSocket errors, see Troubleshooting WebSocket DATA and NET Errors When Live Streaming Audio.
After sending a Close message, the endpoint considers the WebSocket connection closed and will close the underlying TCP connection.