Getting Started

An introduction to using Deepgram’s Aura Streaming Text-to-Speech Websocket API to convert streaming text into audio.

This guide will walk you through how to turn streaming text into speech with Deepgram’s text-to-speech Websocket API.

Before you start, you’ll need to follow the steps in the Make Your First API Request guide to obtain a Deepgram API key, and configure your environment if you are choosing to use a Deepgram SDK.

Text-to-Speech Implementations

Deepgram has several SDKs that can make the API easier to use. Follow these steps to use the SDK of your choice to make a Deepgram TTS request.

Add Dependencies

$# Install the SDK
>npm install @deepgram/sdk
>
># Add the dependencies
>npm install dotenv

Make the Request with the SDK

1const fs = require("fs");
2const { createClient, LiveTTSEvents } = require("../../dist/main/index");
3
4// Add a wav audio container header to the file if you want to play the audio
5// using the AudioContext or media player like VLC, Media Player, or Apple Music
6// Without this header in the Chrome browser case, the audio will not play.
7// prettier-ignore
8const wavHeader = [
90x52, 0x49, 0x46, 0x46, // "RIFF"
100x00, 0x00, 0x00, 0x00, // Placeholder for file size
110x57, 0x41, 0x56, 0x45, // "WAVE"
120x66, 0x6D, 0x74, 0x20, // "fmt "
130x10, 0x00, 0x00, 0x00, // Chunk size (16)
140x01, 0x00, // Audio format (1 for PCM)
150x01, 0x00, // Number of channels (1)
160x80, 0xBB, 0x00, 0x00, // Sample rate (48000)
170x00, 0xEE, 0x02, 0x00, // Byte rate (48000 * 2)
180x02, 0x00, // Block align (2)
190x10, 0x00, // Bits per sample (16)
200x64, 0x61, 0x74, 0x61, // "data"
210x00, 0x00, 0x00, 0x00 // Placeholder for data size
22];
23
24const live = async () => {
25const text = "Hello, how can I help you today?";
26
27const deepgram = createClient(process.env.DEEPGRAM_API_KEY);
28
29const dgConnection = deepgram.speak.live({
30 model: "aura-2-thalia-en",
31 encoding: "linear16",
32 sample_rate: 48000,
33});
34
35let audioBuffer = Buffer.from(wavHeader);
36
37dgConnection.on(LiveTTSEvents.Open, () => {
38 console.log("Connection opened");
39
40 // Send text data for TTS synthesis
41 dgConnection.sendText(text);
42
43 // Send Flush message to the server after sending the text
44 dgConnection.flush();
45
46 dgConnection.on(LiveTTSEvents.Close, () => {
47 console.log("Connection closed");
48 });
49
50 dgConnection.on(LiveTTSEvents.Metadata, (data) => {
51 console.dir(data, { depth: null });
52 });
53
54 dgConnection.on(LiveTTSEvents.Audio, (data) => {
55 console.log("Deepgram audio data received");
56 // Concatenate the audio chunks into a single buffer
57 const buffer = Buffer.from(data);
58 audioBuffer = Buffer.concat([audioBuffer, buffer]);
59 });
60
61 dgConnection.on(LiveTTSEvents.Flushed, () => {
62 console.log("Deepgram Flushed");
63 // Write the buffered audio data to a file when the flush event is received
64 writeFile();
65 });
66
67 dgConnection.on(LiveTTSEvents.Error, (err) => {
68 console.error(err);
69 });
70});
71
72const writeFile = () => {
73 if (audioBuffer.length > 0) {
74 fs.writeFile("output.wav", audioBuffer, (err) => {
75 if (err) {
76 console.error("Error writing audio file:", err);
77 } else {
78 console.log("Audio file saved as output.wav");
79 }
80 });
81 audioBuffer = Buffer.from(wavHeader); // Reset buffer after writing
82 }
83};
84};
85
86live();

To learn more, check out our audio format tips for websockets in the TTS Chunking for Optimization Guide and our Audio Format Combinations that we offer.

Text-to-Speech Workflow

Below is a high-level workflow for obtaining an audio stream from user-provided text.

Establish a WebSocket Connection

To establish a connection, you must provide a few parameters on the URL to describe the type of audio you want. You can check out the API Reference to set the audio model, which controls the voice, the encoding, and the sample rate of the audio.

Sending Text and Retrieving Audio

Send the desired text to transform to audio using the WebSocket message below:

JSON
1{
2 "type": "Speak",
3 "text": "Your text to transform to speech",
4}

When you have queued enough text, you can obtain the corresponding audio by sending a Flush command.

JSON
1{
2 "type": "Flush"
3}

Upon successfully sending the Flush, you will receive an audio byte stream from the websocket connection containing the synthesized text-to-speech. The format will be based on the encoding values provided upon establishing the connection.

Closing the Connection

When you are finished with the WebSocket, you can close the connection by sending the following Close command.

JSON
1{
2 "type": "Close"
3}

Limits

Keep these limits in mind when making a Deepgram text-to-speech request.

Use One WebSocket per Conversation

If you are building for conversational AI use cases where a human is talking to a TTS agent, a single websocket per conversation is required. After you establish a connection, you will not be able to change the voice or media output settings.

Character Limits

The input limit is currently 2000 characters for the text input of each Speak message. If the string length sent as the text payload is 2001 characters or more, you will receive an error, and the audio file will not be created.

Character Throughput Limits

The throughput limit is 2400 characters per minute and is measured by the number of characters sent to the websocket.

Timeout Limits

An active websocket has a 60-minute timeout period from the initial connection. This timeout exists for connections that are actively being used. If you desire a connection for longer than 60 minutes, create a new websocket connection to Deepgram.

Flush Message Limits

The maximum number of times you can send the Flush message is 20 times every 60 seconds. After that, you will receive a warning message stating that we cannot process any more flush messages until the 60-second time window has passed.

Rate Limits

For information on Deepgram’s Concurrency Rate Limits, refer to our API Rate Limits Documentation.

Handling Rate Limits

If the number of in-progress requests for a project meets or exceeds the rate limit, new requests will receive a 429: Too Many Requests error.

For suggestions on handling Concurrency Rate Limits, refer to our Working with Concurrency Rate Limits Documentation guide.

What’s Next?

Now that you’ve transformed text into speech with Deepgram’s API, enhance your knowledge by exploring the following areas.

Read the Feature Guides

Deepgram’s features help you customize your request to produce the best output for your use case. Here are a few guides that can help: