An introduction to using Deepgram’s Aura Streaming Text-to-Speech Websocket API to convert streaming text into audio.

Text to Speech WS

This guide will walk you through how to turn streaming text into speech with Deepgram’s text-to-speech Websocket API.

Before you start, you’ll need to follow the steps in the Make Your First API Request guide to obtain a Deepgram API key, and configure your environment if you are choosing to use a Deepgram SDK.

Text-to-Speech Implementations

Deepgram has several SDKs that can make the API easier to use. Follow these steps to use the SDK of your choice to make a Deepgram TTS request.

Add Dependencies

1# Install the dependencies
2pip install pyaudio==0.2.14

Make the Request with the SDK

1import json
2import os
3import threading
4import asyncio
5import queue
6
7import websockets
8from websockets.sync.client import connect
9
10import pyaudio
11
12TIMEOUT = 0.050
13FORMAT = pyaudio.paInt16
14CHANNELS = 1
15RATE = 48000
16CHUNK = 8000
17
18DEFAULT_URL = f"wss://api.deepgram.com/v1/speak?encoding=linear16&sample_rate={RATE}"
19DEFAULT_TOKEN = os.environ.get("DEEPGRAM_API_KEY", None)
20
21def main():
22 print(f"Connecting to {DEFAULT_URL}")
23
24 _socket = connect(
25 DEFAULT_URL, additional_headers={"Authorization": f"Token {DEFAULT_TOKEN}"}
26 )
27 _exit = threading.Event()
28
29 _story = [
30 "The sun had just begun to rise over the sleepy town of Millfield.",
31 "Emily a young woman in her mid-twenties was already awake and bustling about.",
32 ]
33
34 async def receiver():
35 speaker = Speaker()
36 speaker.start()
37 try:
38 while True:
39 if _socket is None or _exit.is_set():
40 break
41
42 message = _socket.recv()
43 if message is None:
44 continue
45
46 if type(message) is str:
47 print(message)
48 elif type(message) is bytes:
49 speaker.play(message)
50 except Exception as e:
51 print(f"receiver: {e}")
52 finally:
53 speaker.stop()
54
55 _receiver_thread = threading.Thread(target=asyncio.run, args=(receiver(),))
56 _receiver_thread.start()
57
58 for text_input in _story:
59 print(f"Sending: {text_input}")
60 _socket.send(json.dumps({"type": "Speak", "text": text_input}))
61
62 print("Flushing...")
63 _socket.send(json.dumps({"type": "Flush"}))
64
65 input("Press Enter to exit...")
66 _exit.set()
67 _socket.send(json.dumps({"type": "Close"}))
68 _socket.close()
69
70 _listen_thread.join()
71 _listen_thread = None
72
73class Speaker:
74 _audio: pyaudio.PyAudio
75 _chunk: int
76 _rate: int
77 _format: int
78 _channels: int
79 _output_device_index: int
80
81 _stream: pyaudio.Stream
82 _thread: threading.Thread
83 _asyncio_loop: asyncio.AbstractEventLoop
84 _asyncio_thread: threading.Thread
85 _queue: queue.Queue
86 _exit: threading.Event
87
88 def __init__(
89 self,
90 rate: int = RATE,
91 chunk: int = CHUNK,
92 channels: int = CHANNELS,
93 output_device_index: int = None,
94 ):
95 self._exit = threading.Event()
96 self._queue = queue.Queue()
97
98 self._audio = pyaudio.PyAudio()
99 self._chunk = chunk
100 self._rate = rate
101 self._format = FORMAT
102 self._channels = channels
103 self._output_device_index = output_device_index
104
105 def _start_asyncio_loop(self) -> None:
106 self._asyncio_loop = asyncio.new_event_loop()
107 self._asyncio_loop.run_forever()
108
109 def start(self) -> bool:
110 self._stream = self._audio.open(
111 format=self._format,
112 channels=self._channels,
113 rate=self._rate,
114 input=False,
115 output=True,
116 frames_per_buffer=self._chunk,
117 output_device_index=self._output_device_index,
118 )
119
120 self._exit.clear()
121
122 self._thread = threading.Thread(
123 target=_play, args=(self._queue, self._stream, self._exit), daemon=True
124 )
125 self._thread.start()
126
127 self._stream.start_stream()
128
129 return True
130
131 def stop(self):
132 self._exit.set()
133
134 if self._stream is not None:
135 self._stream.stop_stream()
136 self._stream.close()
137 self._stream = None
138
139 self._thread.join()
140 self._thread = None
141
142 self._queue = None
143
144 def play(self, data):
145 self._queue.put(data)
146
147def _play(audio_out: queue, stream, stop):
148 while not stop.is_set():
149 try:
150 data = audio_out.get(True, TIMEOUT)
151 stream.write(data)
152 except queue.Empty as e:
153 # print(f"queue is empty")
154 pass
155 except Exception as e:
156 print(f"_play: {e}")
157
158if __name__ == "__main__":
159 main()

To learn more, check out our audio format tips for websockets in the TTS Chunking for Optimization Guide and our Audio Format Combinations that we offer.

Text-to-Speech Workflow

Below is a high-level workflow for obtaining an audio stream from user-provided text.

Establish a WebSocket Connection

To establish a connection, you must provide a few parameters on the URL to describe the type of audio you want. You can visit the API Ref (TODO: Link) to set the audio model, which controls the voice, the encoding, and the sample rate of the audio.

Sending Text and Retrieving Audio

Send the desired text to transform to audio using the WebSocket message below:

JSON
1{
2 "type": "Speak",
3 "text": "Your text to transform to speech",
4}

When you have queued enough text, you can obtain the corresponding audio by sending a Flush command.

JSON
1{
2 "type": "Flush"
3}

Upon successfully sending the Flush, you will receive an audio byte stream from the websocket connection containing the synthesized text-to-speech. The format will be based on the encoding values provided upon establishing the connection.

Closing the Connection

When you are finished with the WebSocket, you can close the connection by sending the following Close command.

JSON
1{
2 "type": "Close"
3}

Limits

Keep these limits in mind when making a Deepgram text-to-speech request.

Use One WebSocket per Conversation

If you are building for conversational AI use cases where a human is talking to a TTS agent, a single websocket per conversation is required. After you establish a connection, you will not be able to change the voice or media output settings.

Character Limits

The input limit is currently 2000 characters for the text input of each Speak message. If the string length sent as the text payload is 2001 characters or more, you will receive an error, and the audio file will not be created.

Character Throughput Limits

The throughput limit is 12k characters / 2 minutes and is measured by the number of characters sent to the websocket.

Timeout Limits

An active websocket has a 60-minute timeout period from the initial connection. This timeout exists for connections that are actively being used. If you desire a connection for longer than 60 minutes, create a new websocket connection to Deepgram.

Flush Message Limits

The maximum number of times you can send the Flush message is 20 times every 60 seconds. After that, you will receive a warning message stating that we cannot process any more flush messages until the 60-second time window has passed.

Rate Limits

The current rate limit per project for Pay As You Go is 40 concurrent connections and for Growth plans 80 concurrent connections. Learn more about API Rate Limits.

What’s Next?

Now that you’ve transformed text into speech with Deepgram’s API, enhance your knowledge by exploring the following areas.

Read the Feature Guides

Deepgram’s features help you customize your request to produce the best output for your use case. Here are a few guides that can help:

Build your own End-to-End Deepgram Conversational Demo with Twilio

You can get started with building a simple conversational demo using Twilio and Deepgram streaming transcription and text-to-speech WebSockets by checking out our Twilio Example with STT + TTS Streaming WS

We’d love to get your feedback on Deepgram’s Aura text-to-speech. You will receive $50 in additional console credits within two weeks after filling out the form, and you may be invited to join a group of users with access to the latest private releases. To fill out the form, Click Here.

Built with