Get Started

To start a conversation with Deepgram's Voice Agent API:

  1. Open a websocket connection to wss://agent.deepgram.com/agent.
    1. Include the HTTP header Authorization: Token YOUR_DEEPGRAM_API_KEY. If you don't have an API key, you can create one for free through the console.
  2. Send a SettingsConfiguration message over the websocket with your desired settings.
  3. Start streaming audio from your device microphone or audio source of your choice. Audio should be sent as binary messages over the websocket.
  4. Play the audio stream that the server sends back, which will be binary messages containing the agent's speech.
    1. Whenever you receive a UserStartedSpeaking message from the server, discard any queued agent speech. This will enable the user to barge-in on the bot.

Client Messages

This section lists the types of JSON messages that the client can send over the websocket. Every message must contain a type field identifying its type.

SettingsConfiguration

The client should send a SettingsConfiguration message immediately after opening the websocket and before sending any audio. This message configures the voice agent and sets the input and output audio formats.

{
  "type": "SettingsConfiguration",
  "audio": {
    "input": { // optional. defaults to 16kHz linear16
      "encoding": "",
      "sample_rate": 24000 // defaults to 24k 
    },
    "output": { // optional. see table below for defaults and allowed combinations
      "encoding": "",
      "sample_rate": 24000, // defaults to 24k
      "bitrate": 0,
      "container": ""
    }
  },
  "agent": {
    "listen": {
      "model": "" // optional. default 'nova-2'
    },
    "think": {
      "provider": {
        "type": "" // see `LLM providers and models` table below
      },
      "model": "", // see `LLM providers and models` table below
      "instructions": "", // optional (this is the LLM System prompt)
      "functions": [
        {
          "name": "", // function name
          "description": "", // tells the agent what the function does, and how and when to use it
          "parameters": {
            "type": "object",
            "properties": {
              "item": { // the name of the input property
                "type": "string", // the type of the input
                "description":"" // the description of the input so the agent understands what it is
              }
            },
            "required": ["item"] // the list of required input properties for this function to be called
          },
          "url": "", // optional, only for serverside functions. The endpoint where your function will be called
          "headers": [ // optional, only for serverside functions. Http headers to pass when calling the function. Only supports 'authorization'
            {
              "key": "authorization",
              "value": ""
            }
          ],
          "method": "post", // optional, only for serverside functions. The http method to use when calling the function
        }
      ]
    },
    "speak": {
      "model": "" // default 'aura-asteria-en'
    }
  },
  "context": {
    "messages": [], // LLM message history (e.g. to restore existing conversation if websocket connection breaks)
    "replay": false // whether to replay the last message, if it is an assistant message
  }
}
Param NameDescription
agent.listen.modelThe Deepgram speech-to-text model to be used. Use nova-2 for best results.
agent.speak.modelThe Deepgram text-to-speech model to be used. For example, aura-asteria-en. See all options here.
agent.think.modelTogether with agent.provider.model, defines the LLM to be used. Provider + model combinations in table below.
agent.think.functionsPass functions to the agent that will be called throughout the conversation if/when needed as described per function. This is supported for OpenAI, Groq, and Anthropic LLMs.
audio.outputSee table here for valid combinations.

Supported LLM providers and models

think.provider.type valuesthink.model valuesthink.provider.url valuesthink.provider.key values
open_aigpt-4o-mininonenone
anthropicclaude-3-haiku-20240307nonenone
groq(coming soon)nonenone
customYOUR_MODEL_NAMEYOUR_LLM_COMPLETIONS_URLSECRET_KEY_FOR_CUSTOM_URL

UpdateInstructions

The client can send an UpdateInstructions message to give additional instructions to the Think model in the middle of a conversation.

{
  "type": "UpdateInstructions",
  "instructions": "" // The new instructions to give  
}

UpdateSpeak

The client can send an UpdateSpeak message to change the Speak model in the middle of a conversation.

{
  "type": "UpdateSpeak",
  "model": "" // The new Speak model  
}

InjectAgentMessage

The client can send an InjectAgentMessage to immediately trigger an agent statement. If the injection request arrives while the user is speaking, or while the server is in the middle of sending audio for an agent response, then the request will be ignored and the server will reply with an InjectionRefused. Therefore, the client should usually only try to inject agent messages during silent portions of the conversation.

Some example uses are:

  • Informing the user that the agent is working on a lengthy function call ("Hold on while I look that up for you")
  • Prompting the user to continue if the user hasn't spoken for a while ("Are you still on the line?")
{
  "type": "InjectAgentMessage",
  "message": "" // The statement the agent should say
}

FunctionCallResponse

The client should reply with a FunctionCallResponse every time it receives a FunctionCallRequest.

{
  "type": "FunctionCallResponse",
  "function_call_id": "", // The same ID received in the request
  "output": "" // A string containing the result of the function call
}

KeepAlive

The client can send a KeepAlive message to ensure that the server does not close the connection. This message should only be used during a period of time when the client is not sending audio. During such a period, the client can send a KeepAlive every 8 seconds to keep the connection open.

Note that KeepAlives are not needed for most conversations, because you will typically be sending continuous microphone audio to the agent in case the user has something to say.

{
  "type": "KeepAlive" 
}

Server Messages

This section lists the types of JSON messages that the server can send over the websocket. Every message will contain a type field identifying its type.

Welcome

The server will send a Welcome message as soon as the websocket opens. It will send it immediately, not even waiting for a SettingsConfiguration from the client.

{
  "type": "Welcome",
  "session_id": "" // A UUID that can be shared with Deepgram to aid in debugging
}

ConversationText

The server will send a ConversationText message every time the agent hears the user say something, and every time the agent speaks something itself. These can be used on the clientside to display the conversation messages as they happen in real-time.

{
  "type": "ConversationText",
  "role": "", // The speaker of this statement, either "user" or "assistant"
  "content": "" // The statement that was spoken
}

UserStartedSpeaking

The server will send a UserStartedSpeaking message every time the user begins a new utterance. If the client is playing agent audio when this message is received, it should stop playback immediately and discard all of its buffered agent audio.

{
  "type": "UserStartedSpeaking"  
}

AgentThinking

The server will send an AgentThinking message to inform the client of a non-verbalized agent thought. When functions are available, some LLMs use these thoughts to decide which functions to call.

{
  "type": "AgentThinking",
  "content": "" // The text of the agent's thought
}

FunctionCallRequest

If a function is clientside, meaning no url was provided for that function in the SettingsConfiguration, then the server will request to call the function by sending the client a FunctionCallRequest message. Upon receiving this message, the client should perform the requested function call and reply with a FunctionCallResponse containing the function's output.

{
  "type": "FunctionCallRequest",
  "function_name": "", // The `name` you gave in the function definition
  "function_call_id": "", // ID to be passed back in the `FunctionCallResponse`
  "input": {...} // A JSON value containing the `parameters` you defined for this function
}

FunctionCalling

The server will sometimes send FunctionCalling messages when making function calls to help the client developer debug function calling workflows. The format of this message (and whether it is sent at all) depends on the LLM provider being used.

{
  "type": "FunctionCalling",
  ...  
}

AgentStartedSpeaking

The server will send an AgentStartedSpeaking message when it begins streaming an agent audio response to the client for playback.

{
  "type": "AgentStartedSpeaking",
  "total_latency": 0.0, // Seconds from receiving the user's utterance to producing the agent's reply
  "tts_latency": 0.0, // The portion of total latency attributable to text-to-speech
  "ttt_latency": 0.0 // The portion of total latency attributable to text-to-text (usually an LLM)
}

AgentAudioDone

The server will send an AgentAudioDone message immediately after it sends the last audio message in a piece of agent speech. Note that even though no more audio messages are coming for now, the agent may still be speaking, because previously sent audio may still be queued for playback.

{
  "type": "AgentAudioDone"
}

Error

The server will send an Error message to notify the client that something has gone wrong.

{
  "type": "Error",
  "message": "" // A description of what went wrong
}

Conversation Context and Welcome Messages

Upon connection, we accomplish supplying any previous conversation context and any potential "welcome message" (i.e. when the assistant/agent speaks first) via the same mechanism, demonstrated here:

"context": {
	"messages": [
		{"role": "user", "content": "Hello?"},
		{"role": "assistant", "content": "Hello, how may I help you today?"}
	],
	"replay": true
}

Where this JSON is supplied at the top level of a SettingsConfiguration message. If replay is true, the last message will be spoken back by the assistant if the last message was indeed an assistant message.

Rate Limits

EA projects start at 1 concurrent connection for testing. This can be raised upon request or with a contract.

Usage Tracking

Usage is calculated based on websocket connection time. 1 hour of websocket connection time = 1 hour of API usage.

Pricing

  • Pay-as-you-go pricing for the standard tier of the voice agent api is $4.50 per hour
  • If you bring your own LLM, standard tier pricing is $3.90 per hour.


HIPAA Notice

While many of Deepgram's APIs have the ability to be HIPAA compliant by putting a BAA in place with us, this Voice Agent API is not yet fully HIPAA compliant when used with 3rd party sub-processors. Please contact us for more information.