Text To Speech Streaming — Deepgram

Aura-2 is currently available for the TTS REST API only. Websocket support is coming soon.

The Deepgram Speak Clients allows you to generate human-like audio from text.

This SDK supports both the Threaded and Async/Await Clients as described in the Threaded and Async IO Task Support section. The code blocks contain a tab for Threaded and Async to show examples for websocket versus asyncwebsocket, respectively. The difference between Threaded and Async is subtle.

Installing the SDK

Python

1 # Install the Deepgram Python SDK
2 # https://github.com/deepgram/deepgram-python-sdk
3 
4 pip install deepgram-sdk==3.*

Make a Deepgram Text-to-Speech Request

The Deepgram Speak Clients allows you to create audio generated from provided text and stream the audio bytes through your AudioData callback function. You can also subscribe to other important events, such as:

Metadata - obtain information about the audio stream.
Flushed - receive an acknowledgment when all the text has been received by Deepgram to convert to audio. This is the result of sending the Flush message via the flush() function.
Cleared - that the text buffer has been cleared from Deepgram server. This is the result of sending the Clear message via the clear() function.

1 from deepgram.utils import verboselogs
2 
3 from deepgram import (
4     DeepgramClient,
5     DeepgramClientOptions,
6     SpeakWebSocketEvents,
7     SpeakWSOptions,
8 )
9 
10 TTS_TEXT = "Hello, this is a text to speech example using Deepgram."
11 
12 global warning_notice
13 warning_notice = True
14 
15 def main():
16     try:
17         # create a deepgram client
18         config: DeepgramClientOptions = DeepgramClientOptions(
19             options={"speaker_playback": "true"},
20         )
21         deepgram: DeepgramClient = DeepgramClient("", config)
22 
23         # Create a websocket connection to Deepgram
24         dg_connection = deepgram.speak.websocket.v("1")
25 
26         def on_open(self, open, **kwargs):
27             print(f"\n\n{open}\n\n")
28 
29         def on_binary_data(self, data, **kwargs):
30             global warning_notice
31             if warning_notice:
32                 print("Received binary data")
33                 print("You can do something with the binary data here")
34                 print("OR")
35                 print(
36                     "If you want to simply play the audio, set speaker_playback to true in the options for DeepgramClientOptions"
37                 )
38                 warning_notice = False
39 
40         def on_metadata(self, metadata, **kwargs):
41             print(f"\n\n{metadata}\n\n")
42 
43         def on_flush(self, flushed, **kwargs):
44             print(f"\n\n{flushed}\n\n")
45 
46         def on_clear(self, clear, **kwargs):
47             print(f"\n\n{clear}\n\n")
48 
49         def on_close(self, close, **kwargs):
50             print(f"\n\n{close}\n\n")
51 
52         def on_warning(self, warning, **kwargs):
53             print(f"\n\n{warning}\n\n")
54 
55         def on_error(self, error, **kwargs):
56             print(f"\n\n{error}\n\n")
57 
58         def on_unhandled(self, unhandled, **kwargs):
59             print(f"\n\n{unhandled}\n\n")
60 
61         dg_connection.on(SpeakWebSocketEvents.Open, on_open)
62         dg_connection.on(SpeakWebSocketEvents.AudioData, on_binary_data)
63         dg_connection.on(SpeakWebSocketEvents.Metadata, on_metadata)
64         dg_connection.on(SpeakWebSocketEvents.Flushed, on_flush)
65         dg_connection.on(SpeakWebSocketEvents.Cleared, on_clear)
66         dg_connection.on(SpeakWebSocketEvents.Close, on_close)
67         dg_connection.on(SpeakWebSocketEvents.Error, on_error)
68         dg_connection.on(SpeakWebSocketEvents.Warning, on_warning)
69         dg_connection.on(SpeakWebSocketEvents.Unhandled, on_unhandled)
70 
71         # connect to websocket
72         options = SpeakWSOptions(
73             model="aura-asteria-en",
74             encoding="linear16",
75             sample_rate=16000,
76         )
77 
78         print("\n\nPress Enter to stop...\n\n")
79         if dg_connection.start(options) is False:
80             print("Failed to start connection")
81             return
82 
83         # send the text to Deepgram
84         dg_connection.send_text(TTS_TEXT)
85 
86         # if auto_flush_speak_delta is not used, you must flush the connection by calling flush()
87         dg_connection.flush()
88 
89         # Indicate that we've finished
90         dg_connection.wait_for_complete()
91 
92         print("\n\nPress Enter to stop...\n\n")
93         input()
94 
95         # Close the connection
96         dg_connection.finish()
97 
98         print("Finished")
99 
100     except Exception as e:
101         print(f"An unexpected error occurred: {e}")
102 
103 if __name__ == "__main__":
104     main()

Audio Output Streaming

The audio bytes representing the converted text will stream or be passed to the client via the above AudioData event using the callback function.

It should be noted that these audio bytes are:

Container-less audio. Meaning depending on the encoding value chosen, only the raw audio data is sent. As an example, if you choose linear16 as your encoding for audio, a WAV header will not be sent. Please see the Tips and Tricks for more information.
Not of standard size/length when received by the client. This is because the text is broken down into sounds representing the speech. Certain sounds chained together to form fragments of spoken words are different in length and content.

Depending on what the use case is for the generated audio bytes, please visit one of these guides to better help utilize these audio bytes for your use case:

Where to Find Additional Examples

The SDK repository has a good collection of text-to-speech examples. The README contains links to them. Each example below attempts to provide different options for transcribing an audio source.

Some Examples:

Threaded Client speaking “Hello World” - examples/text-to-speech/websocket/complete

If the Async Client suits your use case better:

Threaded Client speaking “Hello World” - examples/text-to-speech/websocket/async_complete

1	# Install the Deepgram Python SDK
2	# https://github.com/deepgram/deepgram-python-sdk
3
4	pip install deepgram-sdk==3.*

1	from deepgram.utils import verboselogs
2
3	from deepgram import (
4	DeepgramClient,
5	DeepgramClientOptions,
6	SpeakWebSocketEvents,
7	SpeakWSOptions,
8	)
9
10	TTS_TEXT = "Hello, this is a text to speech example using Deepgram."
11
12	global warning_notice
13	warning_notice = True
14
15	def main():
16	try:
17	# create a deepgram client
18	config: DeepgramClientOptions = DeepgramClientOptions(
19	options={"speaker_playback": "true"},
20	)
21	deepgram: DeepgramClient = DeepgramClient("", config)
22
23	# Create a websocket connection to Deepgram
24	dg_connection = deepgram.speak.websocket.v("1")
25
26	def on_open(self, open, **kwargs):
27	print(f"\n\n{open}\n\n")
28
29	def on_binary_data(self, data, **kwargs):
30	global warning_notice
31	if warning_notice:
32	print("Received binary data")
33	print("You can do something with the binary data here")
34	print("OR")
35	print(
36	"If you want to simply play the audio, set speaker_playback to true in the options for DeepgramClientOptions"
37	)
38	warning_notice = False
39
40	def on_metadata(self, metadata, **kwargs):
41	print(f"\n\n{metadata}\n\n")
42
43	def on_flush(self, flushed, **kwargs):
44	print(f"\n\n{flushed}\n\n")
45
46	def on_clear(self, clear, **kwargs):
47	print(f"\n\n{clear}\n\n")
48
49	def on_close(self, close, **kwargs):
50	print(f"\n\n{close}\n\n")
51
52	def on_warning(self, warning, **kwargs):
53	print(f"\n\n{warning}\n\n")
54
55	def on_error(self, error, **kwargs):
56	print(f"\n\n{error}\n\n")
57
58	def on_unhandled(self, unhandled, **kwargs):
59	print(f"\n\n{unhandled}\n\n")
60
61	dg_connection.on(SpeakWebSocketEvents.Open, on_open)
62	dg_connection.on(SpeakWebSocketEvents.AudioData, on_binary_data)
63	dg_connection.on(SpeakWebSocketEvents.Metadata, on_metadata)
64	dg_connection.on(SpeakWebSocketEvents.Flushed, on_flush)
65	dg_connection.on(SpeakWebSocketEvents.Cleared, on_clear)
66	dg_connection.on(SpeakWebSocketEvents.Close, on_close)
67	dg_connection.on(SpeakWebSocketEvents.Error, on_error)
68	dg_connection.on(SpeakWebSocketEvents.Warning, on_warning)
69	dg_connection.on(SpeakWebSocketEvents.Unhandled, on_unhandled)
70
71	# connect to websocket
72	options = SpeakWSOptions(
73	model="aura-asteria-en",
74	encoding="linear16",
75	sample_rate=16000,
76	)
77
78	print("\n\nPress Enter to stop...\n\n")
79	if dg_connection.start(options) is False:
80	print("Failed to start connection")
81	return
82
83	# send the text to Deepgram
84	dg_connection.send_text(TTS_TEXT)
85
86	# if auto_flush_speak_delta is not used, you must flush the connection by calling flush()
87	dg_connection.flush()
88
89	# Indicate that we've finished
90	dg_connection.wait_for_complete()
91
92	print("\n\nPress Enter to stop...\n\n")
93	input()
94
95	# Close the connection
96	dg_connection.finish()
97
98	print("Finished")
99
100	except Exception as e:
101	print(f"An unexpected error occurred: {e}")
102
103	if __name__ == "__main__":
104	main()