Twilio and Deepgram Voice Agent

Deepgram Voice Agent can integrate with the Twilio streaming API to enable dynamic interactions between callers and voice agents or bots. This guide will walk you through how to setup a Twilio phone number that can interact with Deepgram’s Voice Agent API, allowing callers to engage with a voice agent in real-time.

Before you Begin

Before you can use Deepgram, you’ll need to create a Deepgram account. Signup is free and includes $200 in free credit and access to all of Deepgram’s features!

Before you start, you’ll need to follow the steps in the Make Your First API Request guide to obtain a Deepgram API key, and configure your environment if you are choosing to use a Deepgram SDK.

Prerequisites

For the complete code used in this guide, please check out this repository.

You will need:

A free Twilio account with a Twilio phone number.
ngrok to let Twilio access a local server OR your own hosted server.
Understanding of Python and using Python virtual environments.

TwiML Bin Setup

First, you will need to set up a TwiML Bin. You can refer to the docs on how to do that in the Twilio Console.

XML

1 <?xml version="1.0" encoding="UTF-8"?>
2 
3 <Response>
4     <Say language="en">"This call may be monitored or recorded."</Say>
5     <Connect>
6         <Stream url="wss://a127-75-172-116-97.ngrok-free.app/twilio" />
7     </Connect>
8 </Response>

You should replace the url with wherever you decide to deploy the server we are about to create and ensure/twilio is at the end of the url.
In the TwiML Bin example above, ngrok is used to expose the server running locally.
Be sure to use the ngrok URL provided as your WSS endpoint: In your Twilio Bin configuration you will need to replace http:// with wss://.

Using ngrok

ngrok is recommended for quick development and testing but shouldn’t be used for production instances. To use ngrok see their documentation.

Be sure to set the port correctly to 5000 to align with the server code provided by running this command when you start the ngrok server.

ngrok http 5000

If you restart your ngrok server, your URL will change, which will require you to update your TwiML Bin.

Connecting a Twilio phone number

Your TwiML Bin must then be connected to one of your Twilio phone numbers so that it gets executed whenever someone calls that number. If you need to set up a new phone number and connect it to your TwiML Bin, refer to the Twilio Docs.

In your TwiML Bin The <Connect> verb is required for bi-directional communication, i.e. in order to send audio from the Deepgram Agent to Twilio, you must use this verb.

Building the Server

Copy the server code from the repository as we will use this in the steps below and save this code locally as with a file name of server.py.

At this point you’ll want to start up a virtual environment for Python. Please refer to documentation for how to do that based on your personal Python preferences.

Depending on your situation you may also need to install specific packages used in this code. You can install the packages you need manually or use the requirements.txt file.

Python

1 pip install -r requirements.txt

You can set your Deepgram API key for the sts_connect function to run the server by running the following command in your terminal:

Bash

$ export DEEPGRAM_API_KEY="your_deepgram_api_key"

If your TwiML Bin is setup correctly, you can now navigate to the correct file location in your terminal and run the server with the following command:

Shell

$ python server.py

Shell

$ python3 server.py

Make a test call

You can now start making calls to the phone number your TwiML Bin is using. Without any further code modifications, you should hear Deepgram Aura say simply: “Hello, how are you today?”

Code Tour

Let’s dive into the code used in the server.py file.

First, we have some import statements:

Python

1 import asyncio
2 import base64
3 import json
4 import sys
5 import websockets
6 import ssl

We are using asyncio and websockets to build an asynchronous websocket server.
We will use base64 to handle encoding audio from Aura to pass data to Twilio.
We will use json to deal with parsing text messages from Twilio .
We will use sys to provides access to some variables and functions used or maintained by the Python interpreter.
We will use ssl(optional) to create secure encrypted connections between client and server.

The next block of code sts_connectdefines a function that establishes a WebSocket connection to Deepgram’s agent service.

Python

1 def sts_connect():
2   # you can run export DEEPGRAM_API_KEY="your key" in your terminal to set your API key.
3   api_key = os.getenv('DEEPGRAM_API_KEY')
4   if not api_key:
5       raise ValueError("DEEPGRAM_API_KEY environment variable is not set")
6 
7   sts_ws = websockets.connect(
8       "wss://agent.deepgram.com/v1/agent/converse",
9       subprotocols=["token", api_key]
10   )
11   return sts_ws

Let’s break it down:

Connection Setup:

Creates a secure WebSocket connection to wss://agent.deepgram.com/agent
Includes authentication via subprotocols

Authentication Method:

Uses Deepgram’s token-based authentication
Requires replacing “YOUR_DEEPGRAM_API_KEY” with an actual Deepgram API key

The next block of code, twilio_handler does several things:

In this first code block, we set up an asynchronous function to handle WebSocket messages from Twilio. We define additional asynchronous functions to manage messages received from Twilio, messages sent to Deepgram, and responses from Deepgram. To facilitate data sharing between tasks, we use two queues: one for audio from Twilio and another for Twilio’s stream SID (a unique identifier).

Python

1 async def twilio_handler(twilio_ws):
2     audio_queue = asyncio.Queue()
3     streamsid_queue = asyncio.Queue()

Also included in twilio_handler is the Setting Configuration for our Agent. The most important thing to note here is the audio format we are using 8000 Hz, raw, un-containerized mulaw. This is the format Twilio will be sending, and the format we will need to send back to Twilio including some base64 encoding/decoding.

To learn more about supported media inputs and outputs for the Voice Agent review the documentation.

Python

1  async with sts_connect() as sts_ws:
2       config_message = {
3           "type": "Settings",
4           "audio": {
5               "input": {
6                   "encoding": "mulaw",
7                   "sample_rate": 8000,
8               },
9               "output": {
10                   "encoding": "mulaw",
11                   "sample_rate": 8000,
12                   "container": "none",
13               },
14           },
15           "agent": {
16               "language": "en",
17               "listen": {
18                   "provider": {
19                       "type": "deepgram",
20                       "model": "nova-3",
21                       "keyterms": ["hello", "goodbye"]
22                   }
23               },
24               "think": {
25                   "provider": {
26                       "type": "open_ai",
27                       "model": "gpt-4o-mini",
28                       "temperature": 0.7
29                   },
30                   "prompt": "You are a helpful AI assistant focused on customer service."
31               },
32               "speak": {
33                   "provider": {
34                       "type": "deepgram",
35                       "model": "aura-2-thalia-en"
36                   }
37               },
38               "greeting": "Hello! How can I help you today?"
39           }
40       }

The next block of code is sts_sender. This function waits for audio from Twilio (via the audio queue) and continuously reads audio chunks from the queue forwarding the chunks to the Deepgram Voice Agent API.

Python

1 async def sts_sender(sts_ws):
2             print("sts_sender started")
3             while True:
4                 chunk = await audio_queue.get()
5                 await sts_ws.send(chunk)

Next is sts_receiver which waits until it has received a stream SID from Twilio, and then loops over messages received from the Deepgram Voice Agent API. If we receive a text message, we check to ensure that the user has started speaking. If they have, we treat this as barge-in and have Twilio clear the agent audio on the call using the stream SID.

Other audio messages should be binary messages containing the text-to-speech (TTS) output of the Deepgram Voice Agen API. We pack all of this up into valid Twilio messages (using the stream SID again), and send them to Twilio to be played back on the phone for the caller to here.

For more information about streaming audio to Twilio, see the following Documentation.

Python

1 async def sts_receiver(sts_ws):
2             print("sts_receiver started")
3             # we will wait until the twilio ws connection figures out the streamsid
4             streamsid = await streamsid_queue.get()
5             # for each sts result received, forward it on to the call
6             async for message in sts_ws:
7                 if type(message) is str:
8                     print(message)
9                     # handle barge-in
10                     decoded = json.loads(message)
11                     if decoded['type'] == 'UserStartedSpeaking':
12                         clear_message = {
13                             "event": "clear",
14                             "streamSid": streamsid
15                         }
16                         await twilio_ws.send(json.dumps(clear_message))
17 
18                     continue
19 
20                 print(type(message))
21                 raw_mulaw = message
22 
23                 # construct a Twilio media message with the raw mulaw (see https://www.twilio.com/docs/voice/twiml/stream#websocket-messages---to-twilio)
24                 media_message = {
25                     "event": "media",
26                     "streamSid": streamsid,
27                     "media": {"payload": base64.b64encode(raw_mulaw).decode("ascii")},
28                 }
29 
30                 # send the TTS audio to the attached phonecall
31                 await twilio_ws.send(json.dumps(media_message))

The next block of code is twilio_recieverand loops over messages Twilio is sending our server. If we receive a “start” message, we can extract the stream SID, and send it to our other async task which needs it. If we receive a “media” message, we decode the audio from it, append it to a running buffer, and send it to the async task which forwards it to Deepgram when it’s of a reasonable size.

Be aware there can be throughput issues when sending lots of tiny chunks, so that’s why we are doing this buffering approach.

Python

1   async def twilio_receiver(twilio_ws):
2             print("twilio_receiver started")
3             # twilio sends audio data as 160 byte messages containing 20ms of audio each
4             # we will buffer 20 twilio messages corresponding to 0.4 seconds of audio to improve throughput performance
5             BUFFER_SIZE = 20 * 160
6 
7             inbuffer = bytearray(b"")
8             async for message in twilio_ws:
9                 try:
10                     data = json.loads(message)
11                     if data["event"] == "start":
12                         print("got our streamsid")
13                         start = data["start"]
14                         streamsid = start["streamSid"]
15                         streamsid_queue.put_nowait(streamsid)
16                     if data["event"] == "connected":
17                         continue
18                     if data["event"] == "media":
19                         media = data["media"]
20                         chunk = base64.b64decode(media["payload"])
21                         if media["track"] == "inbound":
22                             inbuffer.extend(chunk)
23                     if data["event"] == "stop":
24                         break
25 
26                     # check if our buffer is ready to send to our audio_queue (and, thus, then to sts)
27                     while len(inbuffer) >= BUFFER_SIZE:
28                         chunk = inbuffer[:BUFFER_SIZE]
29                         audio_queue.put_nowait(chunk)
30                         inbuffer = inbuffer[BUFFER_SIZE:]
31                 except:
32                     break

The next block of code runs the asynchronous tasks defined in twilio_reciever.

await asyncio.wait(
            [
                asyncio.ensure_future(sts_sender(sts_ws)),
                asyncio.ensure_future(sts_receiver(sts_ws)),
                asyncio.ensure_future(twilio_receiver(twilio_ws)),
            ]
        )
        await twilio_ws.close()

Finally the last block of code sets up and runs the server, making sure all incoming websocket connections get handled by twilio_handler.

Python

1 async def router(websocket, path):
2     print(f"Incoming connection on path: {path}")
3     if path == "/twilio":
4         print("Starting Twilio handler")
5         await twilio_handler(websocket)
6 
7 def main():
8     # use this if using ssl
9     # ssl_context = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
10     # ssl_context.load_cert_chain('cert.pem', 'key.pem')
11     # server = websockets.serve(router, '0.0.0.0', 443, ssl=ssl_context)
12 
13     # use this if not using ssl
14     server = websockets.serve(router, "localhost", 5000)
15     print("Server starting on ws://localhost:5000")
16 
17     asyncio.get_event_loop().run_until_complete(server)
18     asyncio.get_event_loop().run_forever()
19 
20 if __name__ == "__main__":
21     sys.exit(main() or 0)