Deepgram's Speech Recognition API gives you streamlined access to automatic transcription from Deepgram's off-the-shelf and trained speech recognition models. This feature is very fast, can understand nearly every audio format available, and is customizable. You can customize your transcript using various query parameters and apply general purpose and custom-trained AI models.
URL |
---|
https://brain.deepgram.com/v2 |
All requests to the API should include a basic Authorization
header that references the Base64-encoded username (or email
address you used to sign up) and password of your Deepgram account.
For example, for user gandalf
with password mellon
, the base64-encoded value of gandalf:mellon
is Z2FuZGFsZjptZWxsb24=
.
So Gandalf's requests to the Deepgram API should all include the following header: Authorization: Basic Z2FuZGFsZjptZWxsb24=
.
Security Scheme Type | HTTP |
---|---|
HTTP Authorization Scheme | basic |
Transcribes the specified audio file.
Name | Description | |||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
string | AI model used to process submitted audio. Off-the-shelf Deepgram models include:
You may also use a custom model associated with your account by including its Default: general Possible Values: general, phonecall, meeting OR <custom-id> Examples: Select Example | |||||||||||||||||||||||||||||||||||
string | BCP-47 language tag that hints at the primary spoken language. Language support is optimized for the following language/model combinations:
Default: en-US Possible Values: en-GB, en-IN, en-NZ, en-US, es, fr, ko, pt, pt-BR, ru, tr OR null Examples: Select Example | |||||||||||||||||||||||||||||||||||
boolean | Indicates whether to add punctuation and capitalization to the transcript. Examples: Select Example | |||||||||||||||||||||||||||||||||||
boolean | Indicates whether to remove profanity from the transcript. Examples: Select Example | |||||||||||||||||||||||||||||||||||
any | Indicates whether to redact sensitive information, replacing redacted content with asterisks (*). Options include:
Can send multiple instances in query string (for example, `redact=pci&redact=numbers`). "Possible Values: pci, numbers, ssn, true OR null Examples: Select Example | |||||||||||||||||||||||||||||||||||
boolean | Indicates whether to recognize speaker changes. When set to Examples: Select Example | |||||||||||||||||||||||||||||||||||
boolean | Indicates whether to transcribe each audio channel independently. When set to Examples: Select Example | |||||||||||||||||||||||||||||||||||
integer | Maximum number of transcript alternatives to return. Just like a human listener, Deepgram can provide multiple possible interpretations of what it hears. Default: 1 Example: alternatives=1 | |||||||||||||||||||||||||||||||||||
boolean | Indicates whether to convert numbers from written format (e.g., one) to numerical format (e.g., 1). Deepgram can format numbers up to 999,999. Converted numbers do not include punctuation. For example, 999,999 would be transcribed as 999999 .Examples: Select Example | |||||||||||||||||||||||||||||||||||
any | Terms or phrases to search for in the submitted audio. Deepgram searches for acoustic patterns in audio rather than text patterns in transcripts because we have noticed that acoustic pattern matching is more performant. Can send multiple instances in query string (for example, search=speech&search=Friday ).Examples: Select Example | |||||||||||||||||||||||||||||||||||
string | Callback URL to provide if you would like your submitted audio to be processed asynchronously. When passed, Deepgram will immediately respond with a Notes: For streaming audio,
callback can be used to redirect streaming responses to a different server:
Example: callback=https://example.com/callback | |||||||||||||||||||||||||||||||||||
any | beta Keywords to which the model should pay particular attention to boosting or suppressing to help it understand context. Just like a human listener, Deepgram can better understand mumbled, distorted, or otherwise hard-to-decipher speech when it knows the context of the conversation. To learn more about the most effective way to use keywords and recognize context in your transcript, see our Keyword Boosting guide. Notes:
Examples: Select Example | |||||||||||||||||||||||||||||||||||
boolean | beta Indicates whether Deepgram will segment speech into meaningful semantic units, which allows the model to interact more
naturally and effectively with speakers' spontaneous speech patterns. For example, when humans speak to each
other conversationally, they often pause mid-sentence to reformulate their thoughts, or stop and restart a
badly-worded sentence. When By default, when utterances is enabled, it starts a new utterance after 0.8 s of silence. You can customize the
length of time used to determine where to split utterances by submitting the Examples: Select Example | |||||||||||||||||||||||||||||||||||
number | beta Length of time in seconds of silence between words that Deepgram will use when determining where to split utterances. Used when utterances is enabled. Defaults to 0.8 s. Default: 0.8 Example: utt_split=1.5 |
Request body when submitting pre-recorded audio. Accepts either:
Content-Type
header set to the audio MIME type.Content-Type
header set to application/json
.curl --request POST \ --url 'https://brain.deepgram.com/v2/listen' \ --header 'Authorization: Basic REPLACE_BASIC_AUTH' \ --header 'content-type: application/json' \ --data '{"url":"string"}'
{- "metadata": {
- "request_id": "string",
- "transaction_key": "string",
- "sha256": "string",
- "created": "string",
- "duration": 0,
- "channels": 0
}, - "results": {
- "channels": [
- {
- "search": [
- {
- "query": "string",
- "hits": [
- {
- "confidence": 0,
- "start": 0,
- "end": 0,
- "snippet": "string"
}
]
}
], - "alternatives": [
- {
- "transcript": "string",
- "confidence": 0,
- "words": [
- {
- "word": "string",
- "start": 0,
- "end": 0,
- "confidence": 0
}
]
}
]
}
]
}
}
Deepgram provides its customers with real-time, streaming transcription via its streaming endpoints. These endpoints are high-performance, full-duplex services running over the tried-and-true WebSocket protocol, which makes integration with customer pipelines simple due to the wide array of client libraries available.
To use this endpoint, connect to wss://brain.deepgram.com/v2
. TLS encryption will
protect your connection and data.
All data is sent to the streaming endpoint as binary-type WebSocket messages containing payloads that are the raw audio data. Because the protocol is full-duplex, you can stream in real-time and still receive transcription responses while uploading data.
When you are finished, send an empty (length zero) binary message to the server. The server will interpret it as a shutdown command, which means it wil finish processing whatever data is still has cached, send the response to the client, send a summary metadata object, and then terminate the WebSocket connection.
To learn more about working with real-time streaming data and results, see Streaming Audio in Real-Time.
Name | Description | |||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
string | AI model used to process submitted audio. Off-the-shelf Deepgram models include:
You may also use a custom model associated with your account by including its Default: general Possible Values: general, phonecall, meeting OR <custom-id> Examples: Select Example | |||||||||||||||||||||||||||||||||||
string | BCP-47 language tag that hints at the primary spoken language. Language support is optimized for the following language/model combinations:
Default: en-US Possible Values: en-GB, en-IN, en-NZ, en-US, es, fr, ko, pt, pt-BR, ru, tr OR null Examples: Select Example | |||||||||||||||||||||||||||||||||||
boolean | Indicates whether to add punctuation and capitalization to the transcript. Examples: Select Example | |||||||||||||||||||||||||||||||||||
boolean | Indicates whether to remove profanity from the transcript. Examples: Select Example | |||||||||||||||||||||||||||||||||||
any | Indicates whether to redact sensitive information, replacing redacted content with asterisks (*). Options include:
Can send multiple instances in query string (for example, `redact=pci&redact=numbers`). "Possible Values: pci, numbers, ssn, true OR null Examples: Select Example | |||||||||||||||||||||||||||||||||||
boolean | Indicates whether to recognize speaker changes. When set to Examples: Select Example | |||||||||||||||||||||||||||||||||||
boolean | Indicates whether to transcribe each audio channel independently. When set to Examples: Select Example | |||||||||||||||||||||||||||||||||||
integer | Maximum number of transcript alternatives to return. Just like a human listener, Deepgram can provide multiple possible interpretations of what it hears. Default: 1 Example: alternatives=1 | |||||||||||||||||||||||||||||||||||
boolean | Indicates whether to convert numbers from written format (e.g., one) to numerical format (e.g., 1). Deepgram can format numbers up to 999,999. Converted numbers do not include punctuation. For example, 999,999 would be transcribed as 999999 .Examples: Select Example | |||||||||||||||||||||||||||||||||||
any | Terms or phrases to search for in the submitted audio. Deepgram searches for acoustic patterns in audio rather than text patterns in transcripts because we have noticed that acoustic pattern matching is more performant. Can send multiple instances in query string (for example, search=speech&search=Friday ).Examples: Select Example | |||||||||||||||||||||||||||||||||||
string | Callback URL to provide if you would like your submitted audio to be processed asynchronously. When passed, Deepgram will immediately respond with a Notes: For streaming audio,
callback can be used to redirect streaming responses to a different server:
Example: callback=https://example.com/callback | |||||||||||||||||||||||||||||||||||
any | beta Keywords to which the model should pay particular attention to boosting or suppressing to help it understand context. Just like a human listener, Deepgram can better understand mumbled, distorted, or otherwise hard-to-decipher speech when it knows the context of the conversation. To learn more about the most effective way to use keywords and recognize context in your transcript, see our Keyword Boosting guide. Notes:
Examples: Select Example | |||||||||||||||||||||||||||||||||||
boolean | Indicates whether the streaming endpoint should send you updates to its transcription as more audio becomes
available. By default, the streaming endpoint returns regular updates, which means transcription results will
likely change for a period of time. You can avoid receiving these updates by setting this flag to Setting the flag to Examples: Select Example | |||||||||||||||||||||||||||||||||||
boolean | Indicates whether Deepgram will detect whether a speaker has finished speaking (or paused for a significant period of
time, indicating the completion of an idea). When Deepgram detects an endpoint, it assumes that no additional data
will improve its prediction, so it immediately finalizes the result for the processed time range and returns the
transcript with a For example, if you are working with a 15-second audio clip, but someone is speaking for only the first 3 seconds, endpointing allows you to get a finalized result after the first 3 seconds. By default, endpointing is enabled and finalizes a transcript after 10 ms of silence. You can customize the length
of time used to detect whether a speaker has finished speaking by submitting the Default: true Examples: Select Example | |||||||||||||||||||||||||||||||||||
integer | Length of time in milliseconds of silence that voice activation detection (VAD) will use to detect that a speaker has finished speaking. Used when endpointing is enabled. Defaults to 10 ms. Deepgram customers may configure a value between 10 ms and 500 ms; on-premise customers may remove this restriction. Default: 10 Example: vad_turnoff=30 | |||||||||||||||||||||||||||||||||||
string | Expected encoding of the submitted streaming audio. Options include:
Only required when raw, headerless audio packets are sent to the streaming service. For pre-recorded audio or audio
submitted to the standard Possible Values: amr-nb, amr-wb, flac, linear16, mulaw, opus OR speex Example: encoding=flac | |||||||||||||||||||||||||||||||||||
integer | Number of independent audio channels contained in submitted streaming audio. Only read when a value is provided for Default: 1 Example: channels=2 | |||||||||||||||||||||||||||||||||||
integer | Sample rate of submitted streaming audio. Required (and only read) when a value is provided for Example: sample_rate=16000 |
// Connect to the streaming endpoint. var establishConnection = function() { console.log("Establishing connection."); // Configure the websocket connection. // This requires ws installed using 'npm i ws'. const WebSocket = require('ws'); socket = new WebSocket( 'wss://brain.deepgram.com/v2/listen/stream', // if your base64 encoded username:password has padding ('=' signs at the end), you must strip them ['Basic', 'MY_BASE64_ENCODED_USERNAME:PASSWORD'] ); socket.onopen = (m) => { console.log("Socket opened!"); // Grab an audio file. var fs = require('fs'); var contents = fs.readFileSync('/path/to/audio/file.wav'); // Send the audio to the brain api all at once (works if audio is relatively short). // socket.send(contents); // Send the audio to the brain api in chunks of 1000 bytes. chunk_size = 1000; for (i = 0; i < contents.length; i += chunk_size) { slice = contents.slice(i, i + chunk_size); socket.send(slice); } // Send the message to close the connection. socket.send(new Uint8Array(0)); }; socket.onclose = (m) => { console.log("Socket closed."); }; socket.onmessage = (m) => { m = JSON.parse(m.data); // Log the received message. console.log(m); // Log just the words from the received message. if (m.hasOwnProperty('channel')) { let words = m.channel.alternatives[0].words; console.log(words); } }; }; var socket = null; establishConnection();
{- "channel_index": [
- {
- "channel": "string",
- "num_channels": 0
}
], - "duration": 0,
- "start": "string",
- "is_final": true,
- "channel": {
- "alternatives": [
- {
- "transcript": "string",
- "confidence": 0,
- "words": [
- {
- "word": "string",
- "start": 0,
- "end": 0,
- "confidence": 0
}
]
}
]
}
}
Don't want to reuse your username and password in your requests? Don't want to share credentials within your team? Want to have separate credentials for your staging and production systems? No problem: generate all the API keys you need and use them just as you would your username and password.
Request body when creating an API key.
Status | Description |
---|---|
201 success | API Key generated |
curl --request POST \ --url https://brain.deepgram.com/v2/keys \ --header 'Authorization: Basic REPLACE_BASIC_AUTH' \ --header 'content-type: application/json' \ --data '{"label":"My API Key"}'
{- "key": "string",
- "secret": "string",
- "label": "string"
}
curl --request GET \ --url https://brain.deepgram.com/v2/keys \ --header 'Authorization: Basic REPLACE_BASIC_AUTH' \ --header 'content-type: application/json'
{- "keys": [
- {
- "key": "string",
- "label": "string"
}
]
}
Deletes the specified key. Provide a key
to delete in a JSON object body.
Status | Description |
---|---|
204 success | The API key was deleted. |
Show common responses |
curl --request DELETE \ --url https://brain.deepgram.com/v2/keys \ --header 'Authorization: Basic REPLACE_BASIC_AUTH' \ --header 'content-type: application/json'