Get Started
To start a conversation with Deepgram's Voice Agent API:
- Open a websocket connection to
wss://agent.deepgram.com/agent
.- Include the HTTP header
Authorization: Token YOUR_DEEPGRAM_API_KEY
. If you don't have an API key, you can create one for free through the console.
- Include the HTTP header
- Send a SettingsConfiguration message over the websocket with your desired settings.
- Start streaming audio from your device microphone or audio source of your choice. Audio should be sent as binary messages over the websocket.
- Play the audio stream that the server sends back, which will be binary messages containing the agent's speech.
- Whenever you receive a UserStartedSpeaking message from the server, discard any queued agent speech. This will enable the user to barge-in on the bot.
Client Messages
This section lists the types of JSON messages that the client can send over the websocket. Every message must contain a type
field identifying its type.
SettingsConfiguration
The client should send a SettingsConfiguration
message immediately after opening the websocket and before sending any audio. This message configures the voice agent and sets the input and output audio formats.
{
"type": "SettingsConfiguration",
"audio": {
"input": { // optional. defaults to 16kHz linear16
"encoding": "",
"sample_rate": 24000 // defaults to 24k
},
"output": { // optional. see table below for defaults and allowed combinations
"encoding": "",
"sample_rate": 24000, // defaults to 24k
"bitrate": 0,
"container": ""
}
},
"agent": {
"listen": {
"model": "" // optional. default 'nova-2'
},
"think": {
"provider": {
"type": "" // see `LLM providers and models` table below
},
"model": "", // see `LLM providers and models` table below
"instructions": "", // optional (this is the LLM System prompt)
"functions": [
{
"name": "", // function name
"description": "", // tells the agent what the function does, and how and when to use it
"parameters": {
"type": "object",
"properties": {
"item": { // the name of the input property
"type": "string", // the type of the input
"description":"" // the description of the input so the agent understands what it is
}
},
"required": ["item"] // the list of required input properties for this function to be called
},
"url": "", // optional, only for serverside functions. The endpoint where your function will be called
"headers": [ // optional, only for serverside functions. Http headers to pass when calling the function. Only supports 'authorization'
{
"key": "authorization",
"value": ""
}
],
"method": "post", // optional, only for serverside functions. The http method to use when calling the function
}
]
},
"speak": {
"model": "" // default 'aura-asteria-en'
}
},
"context": {
"messages": [], // LLM message history (e.g. to restore existing conversation if websocket connection breaks)
"replay": false // whether to replay the last message, if it is an assistant message
}
}
Param Name | Description |
---|---|
agent.listen.model | The Deepgram speech-to-text model to be used. Use nova-2 for best results. |
agent.speak.model | The Deepgram text-to-speech model to be used. For example, aura-asteria-en . See all options here. |
agent.think.model | Together with agent.provider.model, defines the LLM to be used. Provider + model combinations in table below. |
agent.think.functions | Pass functions to the agent that will be called throughout the conversation if/when needed as described per function. This is supported for OpenAI, Groq, and Anthropic LLMs. |
audio.output | See table here for valid combinations. |
Supported LLM providers and models
think.provider.type values | think.model values | think.provider.url values | think.provider.key values |
---|---|---|---|
open_ai | gpt-4o-mini | none | none |
anthropic | claude-3-haiku-20240307 | none | none |
groq | (coming soon) | none | none |
custom | YOUR_MODEL_NAME | YOUR_LLM_COMPLETIONS_URL | SECRET_KEY_FOR_CUSTOM_URL |
UpdateInstructions
The client can send an UpdateInstructions
message to give additional instructions to the Think model in the middle of a conversation.
{
"type": "UpdateInstructions",
"instructions": "" // The new instructions to give
}
UpdateSpeak
The client can send an UpdateSpeak
message to change the Speak model in the middle of a conversation.
{
"type": "UpdateSpeak",
"model": "" // The new Speak model
}
InjectAgentMessage
The client can send an InjectAgentMessage
to immediately trigger an agent statement. If the injection request arrives while the user is speaking, or while the server is in the middle of sending audio for an agent response, then the request will be ignored and the server will reply with an InjectionRefused
. Therefore, the client should usually only try to inject agent messages during silent portions of the conversation.
Some example uses are:
- Informing the user that the agent is working on a lengthy function call ("Hold on while I look that up for you")
- Prompting the user to continue if the user hasn't spoken for a while ("Are you still on the line?")
{
"type": "InjectAgentMessage",
"message": "" // The statement the agent should say
}
FunctionCallResponse
The client should reply with a FunctionCallResponse
every time it receives a FunctionCallRequest.
{
"type": "FunctionCallResponse",
"function_call_id": "", // The same ID received in the request
"output": "" // A string containing the result of the function call
}
KeepAlive
The client can send a KeepAlive
message to ensure that the server does not close the connection. This message should only be used during a period of time when the client is not sending audio. During such a period, the client can send a KeepAlive
every 8 seconds to keep the connection open.
Note that KeepAlive
s are not needed for most conversations, because you will typically be sending continuous microphone audio to the agent in case the user has something to say.
{
"type": "KeepAlive"
}
Server Messages
This section lists the types of JSON messages that the server can send over the websocket. Every message will contain a type
field identifying its type.
Welcome
The server will send a Welcome
message as soon as the websocket opens. It will send it immediately, not even waiting for a SettingsConfiguration
from the client.
{
"type": "Welcome",
"session_id": "" // A UUID that can be shared with Deepgram to aid in debugging
}
ConversationText
The server will send a ConversationText
message every time the agent hears the user say something, and every time the agent speaks something itself. These can be used on the clientside to display the conversation messages as they happen in real-time.
{
"type": "ConversationText",
"role": "", // The speaker of this statement, either "user" or "assistant"
"content": "" // The statement that was spoken
}
UserStartedSpeaking
The server will send a UserStartedSpeaking
message every time the user begins a new utterance. If the client is playing agent audio when this message is received, it should stop playback immediately and discard all of its buffered agent audio.
{
"type": "UserStartedSpeaking"
}
AgentThinking
The server will send an AgentThinking
message to inform the client of a non-verbalized agent thought. When functions are available, some LLMs use these thoughts to decide which functions to call.
{
"type": "AgentThinking",
"content": "" // The text of the agent's thought
}
FunctionCallRequest
If a function is clientside, meaning no url
was provided for that function in the SettingsConfiguration, then the server will request to call the function by sending the client a FunctionCallRequest
message. Upon receiving this message, the client should perform the requested function call and reply with a FunctionCallResponse containing the function's output.
{
"type": "FunctionCallRequest",
"function_name": "", // The `name` you gave in the function definition
"function_call_id": "", // ID to be passed back in the `FunctionCallResponse`
"input": {...} // A JSON value containing the `parameters` you defined for this function
}
FunctionCalling
The server will sometimes send FunctionCalling
messages when making function calls to help the client developer debug function calling workflows. The format of this message (and whether it is sent at all) depends on the LLM provider being used.
{
"type": "FunctionCalling",
...
}
AgentStartedSpeaking
The server will send an AgentStartedSpeaking
message when it begins streaming an agent audio response to the client for playback.
{
"type": "AgentStartedSpeaking",
"total_latency": 0.0, // Seconds from receiving the user's utterance to producing the agent's reply
"tts_latency": 0.0, // The portion of total latency attributable to text-to-speech
"ttt_latency": 0.0 // The portion of total latency attributable to text-to-text (usually an LLM)
}
AgentAudioDone
The server will send an AgentAudioDone
message immediately after it sends the last audio message in a piece of agent speech. Note that even though no more audio messages are coming for now, the agent may still be speaking, because previously sent audio may still be queued for playback.
{
"type": "AgentAudioDone"
}
Error
The server will send an Error
message to notify the client that something has gone wrong.
{
"type": "Error",
"message": "" // A description of what went wrong
}
Conversation Context and Welcome Messages
Upon connection, we accomplish supplying any previous conversation context and any potential "welcome message" (i.e. when the assistant/agent speaks first) via the same mechanism, demonstrated here:
"context": {
"messages": [
{"role": "user", "content": "Hello?"},
{"role": "assistant", "content": "Hello, how may I help you today?"}
],
"replay": true
}
Where this JSON is supplied at the top level of a SettingsConfiguration
message. If replay
is true
, the last message will be spoken back by the assistant if the last message was indeed an assistant message.
Rate Limits
EA projects start at 1 concurrent connection for testing. This can be raised upon request or with a contract.
Usage Tracking
Usage is calculated based on websocket connection time. 1 hour of websocket connection time = 1 hour of API usage.
Pricing
- Pay-as-you-go pricing for the standard tier of the voice agent api is $4.50 per hour
- If you bring your own LLM, standard tier pricing is $3.90 per hour.
HIPAA Notice
While many of Deepgram's APIs have the ability to be HIPAA compliant by putting a BAA in place with us, this Voice Agent API is not yet fully HIPAA compliant when used with 3rd party sub-processors. Please contact us for more information.