Speech Recognition API (1.0.0)

Deepgram's Speech Recognition API gives you streamlined access to automatic transcription from Deepgram's off-the-shelf and trained speech recognition models. This feature is very fast, can understand nearly every audio format available, and is customizable. You can customize your transcript using various query parameters and apply general purpose and custom-trained AI models.

Server

URL
https://brain.deepgram.com/v2

Authentication

Basic

All requests to the API should include a basic Authorization header that references the Base64-encoded username (or email address you used to sign up) and password of your Deepgram account.

For example, for user gandalf with password mellon, the base64-encoded value of gandalf:mellon is Z2FuZGFsZjptZWxsb24=. So Gandalf's requests to the Deepgram API should all include the following header: Authorization: Basic Z2FuZGFsZjptZWxsb24=.

Security Scheme Type HTTP
HTTP Authorization Scheme basic

Transcription

High-speed transcription of audio.

Transcribe pre-recorded audio

Transcribes the specified audio file.

Query Parameters

NameDescription
model
string

AI model used to process submitted audio.

Off-the-shelf Deepgram models include:

  • general: Optimized for everyday audio processing; if you aren't sure which model to select, start here.
  • meeting: Optimized for conference room settings, which include multiple speakers with a single microphone.
  • phonecall: Optimized for low-bandwidth audio phone calls.
  • conversationalai (Labs): Optimized to allow artificial intelligence technologies, such as chatbots, to interact with people in a human-like way.

You may also use a custom model associated with your account by including its custom_id.

Default: general
Possible Values: general, phonecall, meeting, conversationalai OR <custom-id>
Examples:
Select Example
language
string

BCP-47 language tag that hints at the primary spoken language. Language support is optimized for the following language/model combinations:

LanguageRegionModel(s)
English
en-GB
general
phonecall
en-IN
general
phonecall
en-NZ
general
en-US
general
meeting
phonecall
French
Labs
fr
general
Hindi
Labs
hi
general
Korean
ko
general
Portuguese
Labs
pt
general
pt-BR
Russian
Labs
ru
general
Spanish
es
general
Turkish
tr
general
Default: en-US
Possible Values: en-GB, en-IN, en-NZ, en-US, es, fr, hi, ko, pt, pt-BR, ru, tr OR null
Examples:
Select Example
punctuate
boolean

Indicates whether to add punctuation and capitalization to the transcript.

Examples:
Select Example
profanity_filter
boolean

Indicates whether to remove profanity from the transcript.

Examples:
Select Example
redact
any

Indicates whether to redact sensitive information, replacing redacted content with asterisks (*). Options include:

  • `pci`: Redacts sensitive credit card information, including credit card number, expiration date, and CVV
  • `numbers` (or `true)`: Aggressively redacts strings of numerals
  • `ssn` (*beta*): Redacts social security numbers
"
Possible Values: pci, numbers, ssn, true OR null
Examples:
Select Example
diarize
boolean

Indicates whether to recognize speaker changes. When set to true, each word in the transcript will be assigned a speaker number starting at 0.

Examples:
Select Example
multichannel
boolean

Indicates whether to transcribe each audio channel independently. When set to true, you will receive one transcript for each channel, which means you can apply a different model to each channel using the model parameter (e.g., set model to general:phonecall, which applies the general model to channel 0 and the phonecall model to channel 1).

Examples:
Select Example
alternatives
integer

Maximum number of transcript alternatives to return. Just like a human listener, Deepgram can provide multiple possible interpretations of what it hears.

Default: 1
Example:

alternatives=1

numerals
boolean

Indicates whether to convert numbers from written format (e.g., one) to numerical format (e.g., 1). Deepgram can format numbers up to 999,999.

Examples:
Select Example
search
any

Terms or phrases to search for in the submitted audio. Deepgram searches for acoustic patterns in audio rather than text patterns in transcripts because we have noticed that acoustic pattern matching is more performant.

Examples:
Select Example
callback
string

Callback URL to provide if you would like your submitted audio to be processed asynchronously. When passed, Deepgram will immediately respond with a request_id. When it has finished analyzing the audio, it will send a POST request to the provided URL with an appropriate HTTP status code.

For streaming audio, callback can be used to redirect streaming responses to a different server:
  • If the callback URL begins with http:// or https://, then POST requests are sent to the callback server for each streaming response.
  • If the callback URL begins with ws:// or wss://, then a WebSocket connection is established with the callback server and WebSocket text messages are sent containing the streaming responses.
  • If a WebSocket callback connection is disconnected at any point, the entire real-time transcription stream is killed; this maintains the strong guarantee of a one-to-one relationship between incoming real-time connections and outgoing WebSocket callback connections.

keywords
any

Keywords to which the model should pay particular attention to boosting or suppressing to help it understand context. Just like a human listener, Deepgram can better understand mumbled, distorted, or otherwise hard-to-decipher speech when it knows the context of the conversation.

To learn more about the most effective way to use keywords and recognize context in your transcript, see our Keyword Boosting guide.

Examples:
Select Example
utterances
boolean

beta Indicates whether Deepgram will segment speech into meaningful semantic units, which allows the model to interact more naturally and effectively with speakers' spontaneous speech patterns. For example, when humans speak to each other conversationally, they often pause mid-sentence to reformulate their thoughts, or stop and restart a badly-worded sentence. When utterances is set to true, these utterances are identified and returned in the transcript results.

By default, when utterances is enabled, it starts a new utterance after 0.8 s of silence. You can customize the length of time used to determine where to split utterances by submitting the utt_split parameter.

Examples:
Select Example
utt_split
number

beta Length of time in seconds of silence between words that Deepgram will use when determining where to split utterances. Used when utterances is enabled. Defaults to 0.8 s.

Default: 0.8
Example:

utt_split=1.5

Request body schema

Request body when submitting pre-recorded audio. Accepts either:

  • raw binary audio data. In this case, include a Content-Type header set to the audio MIME type.
  • JSON object with a single field from which the audio can be retrieved. In this case, include a Content-Type header set to application/json.
NameDescription
url
string*

Location of audio file to transcribe.

Responses

StatusDescription
200 successAudio submitted for transcription.
Response Schema
NameDescription
metadata
object

JSON-formatted ListenMetadata object.

results
object

JSON-formatted ListenResults object.

cURL
Send
curl --request POST \ --url 'https://brain.deepgram.com/v2/listen' \ --header 'Authorization: Basic REPLACE_BASIC_AUTH' \ --header 'content-type: application/json' \ --data '{"url":"string"}'
Response
{
  • "metadata": {
    },
  • "results": {
    }
}
Was this section helpful?
Yes No

Transcribe streaming audio

Deepgram provides its customers with real-time, streaming transcription via its streaming endpoints. These endpoints are high-performance, full-duplex services running over the tried-and-true WebSocket protocol, which makes integration with customer pipelines simple due to the wide array of client libraries available.

To use this endpoint, connect to wss://brain.deepgram.com/v2. TLS encryption will protect your connection and data.

All data is sent to the streaming endpoint as binary-type WebSocket messages containing payloads that are the raw audio data. Because the protocol is full-duplex, you can stream in real-time and still receive transcription responses while uploading data.

When you are finished, send an empty (length zero) binary message to the server. The server will interpret it as a shutdown command, which means it wil finish processing whatever data is still has cached, send the response to the client, send a summary metadata object, and then terminate the WebSocket connection.

To learn more about working with real-time streaming data and results, see Streaming Audio in Real-Time.

Query Parameters

NameDescription
model
string

AI model used to process submitted audio.

Off-the-shelf Deepgram models include:

  • general: Optimized for everyday audio processing; if you aren't sure which model to select, start here.
  • meeting: Optimized for conference room settings, which include multiple speakers with a single microphone.
  • phonecall: Optimized for low-bandwidth audio phone calls.
  • conversationalai (Labs): Optimized to allow artificial intelligence technologies, such as chatbots, to interact with people in a human-like way.

You may also use a custom model associated with your account by including its custom_id.

Default: general
Possible Values: general, phonecall, meeting, conversationalai OR <custom-id>
Examples:
Select Example
language
string

BCP-47 language tag that hints at the primary spoken language. Language support is optimized for the following language/model combinations:

LanguageRegionModel(s)
English
en-GB
general
phonecall
en-IN
general
phonecall
en-NZ
general
en-US
general
meeting
phonecall
French
Labs
fr
general
Hindi
Labs
hi
general
Korean
ko
general
Portuguese
Labs
pt
general
pt-BR
Russian
Labs
ru
general
Spanish
es
general
Turkish
tr
general
Default: en-US
Possible Values: en-GB, en-IN, en-NZ, en-US, es, fr, hi, ko, pt, pt-BR, ru, tr OR null
Examples:
Select Example
punctuate
boolean

Indicates whether to add punctuation and capitalization to the transcript.

Examples:
Select Example
profanity_filter
boolean

Indicates whether to remove profanity from the transcript.

Examples:
Select Example
redact
any

Indicates whether to redact sensitive information, replacing redacted content with asterisks (*). Options include:

  • `pci`: Redacts sensitive credit card information, including credit card number, expiration date, and CVV
  • `numbers` (or `true)`: Aggressively redacts strings of numerals
  • `ssn` (*beta*): Redacts social security numbers
"
Possible Values: pci, numbers, ssn, true OR null
Examples:
Select Example
diarize
boolean

Indicates whether to recognize speaker changes. When set to true, each word in the transcript will be assigned a speaker number starting at 0.

Examples:
Select Example
multichannel
boolean

Indicates whether to transcribe each audio channel independently. When set to true, you will receive one transcript for each channel, which means you can apply a different model to each channel using the model parameter (e.g., set model to general:phonecall, which applies the general model to channel 0 and the phonecall model to channel 1).

Examples:
Select Example
alternatives
integer

Maximum number of transcript alternatives to return. Just like a human listener, Deepgram can provide multiple possible interpretations of what it hears.

Default: 1
Example:

alternatives=1

numerals
boolean

Indicates whether to convert numbers from written format (e.g., one) to numerical format (e.g., 1). Deepgram can format numbers up to 999,999.

Examples:
Select Example
search
any

Terms or phrases to search for in the submitted audio. Deepgram searches for acoustic patterns in audio rather than text patterns in transcripts because we have noticed that acoustic pattern matching is more performant.

Examples:
Select Example
callback
string

Callback URL to provide if you would like your submitted audio to be processed asynchronously. When passed, Deepgram will immediately respond with a request_id. When it has finished analyzing the audio, it will send a POST request to the provided URL with an appropriate HTTP status code.

For streaming audio, callback can be used to redirect streaming responses to a different server:
  • If the callback URL begins with http:// or https://, then POST requests are sent to the callback server for each streaming response.
  • If the callback URL begins with ws:// or wss://, then a WebSocket connection is established with the callback server and WebSocket text messages are sent containing the streaming responses.
  • If a WebSocket callback connection is disconnected at any point, the entire real-time transcription stream is killed; this maintains the strong guarantee of a one-to-one relationship between incoming real-time connections and outgoing WebSocket callback connections.

keywords
any

Keywords to which the model should pay particular attention to boosting or suppressing to help it understand context. Just like a human listener, Deepgram can better understand mumbled, distorted, or otherwise hard-to-decipher speech when it knows the context of the conversation.

To learn more about the most effective way to use keywords and recognize context in your transcript, see our Keyword Boosting guide.

Examples:
Select Example
interim_results
boolean

Indicates whether the streaming endpoint should send you updates to its transcription as more audio becomes available. By default, the streaming endpoint returns regular updates, which means transcription results will likely change for a period of time. You can avoid receiving these updates by setting this flag to false.

Examples:
Select Example
endpointing
boolean

Indicates whether Deepgram will detect whether a speaker has finished speaking (or paused for a significant period of time, indicating the completion of an idea). When Deepgram detects an endpoint, it assumes that no additional data will improve its prediction, so it immediately finalizes the result for the processed time range and returns the transcript with a speech_final parameter set to true.

For example, if you are working with a 15-second audio clip, but someone is speaking for only the first 3 seconds, endpointing allows you to get a finalized result after the first 3 seconds.

By default, endpointing is enabled and finalizes a transcript after 10 ms of silence. You can customize the length of time used to detect whether a speaker has finished speaking by submitting the vad_turnoff parameter.

Default: true
Examples:
Select Example
vad_turnoff
integer

Length of time in milliseconds of silence that voice activation detection (VAD) will use to detect that a speaker has finished speaking. Used when endpointing is enabled. Defaults to 10 ms. Deepgram customers may configure a value between 10 ms and 500 ms; on-premise customers may remove this restriction.

Default: 10
Example:

vad_turnoff=30

encoding
string

Expected encoding of the submitted streaming audio.

Options include:

  • linear16: 16-bit, little endian, signed PCM WAV data
  • flac: FLAC-encoded data
  • mulaw: mu-law encoded WAV data
  • amr-nb: adaptive multi-rate narrowband codec (sample rate must be 8000)
  • amr-wb: adaptive multi-rate wideband codec (sample rate must be 16000)
  • opus: Ogg Opus
  • speex: Ogg Speex
Possible Values: amr-nb, amr-wb, flac, linear16, mulaw, opus OR speex
Example:

encoding=flac

channels
integer

Number of independent audio channels contained in submitted streaming audio. Only read when a value is provided for encoding.

Default: 1
Example:

channels=2

sample_rate
integer

Sample rate of submitted streaming audio. Required (and only read) when a value is provided for encoding.

Example:

sample_rate=16000

Responses

StatusDescription
201 successAudio submitted for transcription.
Response Schema
NameDescription
channel_index
array
duration
number

Duration in seconds.

start
string

Offset in seconds.

is_final
boolean

Indicates final

channel
object
Javascript
Send
// Connect to the streaming endpoint. var establishConnection = function() { console.log("Establishing connection."); // Configure the websocket connection. // This requires ws installed using 'npm i ws'. const WebSocket = require('ws'); socket = new WebSocket( 'wss://brain.deepgram.com/v2/listen/stream', // if your base64 encoded username:password has padding ('=' signs at the end), you must strip them ['Basic', 'MY_BASE64_ENCODED_USERNAME:PASSWORD'] ); socket.onopen = (m) => { console.log("Socket opened!"); // Grab an audio file. var fs = require('fs'); var contents = fs.readFileSync('/path/to/audio/file.wav'); // Send the audio to the brain api all at once (works if audio is relatively short). // socket.send(contents); // Send the audio to the brain api in chunks of 1000 bytes. chunk_size = 1000; for (i = 0; i < contents.length; i += chunk_size) { slice = contents.slice(i, i + chunk_size); socket.send(slice); } // Send the message to close the connection. socket.send(new Uint8Array(0)); }; socket.onclose = (m) => { console.log("Socket closed."); }; socket.onmessage = (m) => { m = JSON.parse(m.data); // Log the received message. console.log(m); // Log just the words from the received message. if (m.hasOwnProperty('channel')) { let words = m.channel.alternatives[0].words; console.log(words); } }; }; var socket = null; establishConnection();
Response
{
  • "channel_index": [
    ],
  • "duration": 0,
  • "start": "string",
  • "is_final": true,
  • "channel": {
    }
}
Was this section helpful?
Yes No

API Keys

Generate API keys.

Create a new key

Don't want to reuse your username and password in your requests? Don't want to share credentials within your team? Want to have separate credentials for your staging and production systems? No problem: generate all the API keys you need and use them just as you would your username and password.

Request body schema

Request body when creating an API key.

NameDescription
label
string

User-friendly name of the API Key.

Example:

My API Key

Responses

StatusDescription
201 successAPI Key generated
Response Schema
NameDescription
key
string

Your new API key. This should replace your username in authentication requests.

secret
string

Your new secret. This should replace your password in authentication requests.

label
string

The user-friendly name of the API key that you submitted in the body of the request.

cURL
Send
curl --request POST \ --url https://brain.deepgram.com/v2/keys \ --header 'Authorization: Basic REPLACE_BASIC_AUTH' \ --header 'content-type: application/json' \ --data '{"label":"My API Key"}'
Response
{
  • "key": "string",
  • "secret": "string",
  • "label": "string"
}
Was this section helpful?
Yes No

Get all keys

Returns the list of keys associated with your account.

Responses

StatusDescription
200 successAPI keys found
Show common responses
Response Schema
NameDescription
keys
array
cURL
Send
curl --request GET \ --url https://brain.deepgram.com/v2/keys \ --header 'Authorization: Basic REPLACE_BASIC_AUTH' \ --header 'content-type: application/json'
Response
{
  • "keys": [
    ]
}
Was this section helpful?
Yes No

Delete a key

Deletes the specified key.

Request body schema

Request body when deleting an API key.

NameDescription
key
string

The API key you wish to delete.

Example:

x020gx00g0s0

Responses

StatusDescription
204 successThe API key was deleted.
Show common responses
cURL
Send
curl --request DELETE \ --url https://brain.deepgram.com/v2/keys \ --header 'Authorization: Basic REPLACE_BASIC_AUTH' \ --header 'content-type: application/json' \ --data '{"key":"x020gx00g0s0"}'
Was this section helpful?
Yes No