Text to Speech Latency

Learn some tips and strategies for minimizing latency in text-to-speech requests.

In the context of text-to-speech (TTS) technology, latency refers to the delay between the input text being received and the audio output being generated and returned.

How to Measure Latency

To comprehensively assess latency, we recommend to measure the total end-to-end latency and to break it down into several components in order to measure and minimize each latency component separately.

Total Latency

Total latency is an aggregation of several separate sources, including time to first byte (TTFB) and audio-synthesis latency. You can think of this as the following equation:

total_latency = network + ttfb + audio_synthesis

Run the following CURL command to measure the total latency it takes to make a complete request to Deepgram’s TTS endpoint. This will run the calculation for you, so you are given a total latency value.

time curl -H "Authorization: token $DEEPGRAM_API_KEY" \
 -o "out.mp3" \
-w "dns_resolution: %{time_namelookup}, tcp_established: %{time_connect}, ssl_handshake_done: %{time_appconnect}, TTFB: %{time_starttransfer}\n" \
-X POST \
-d "Welcome to Deepgram! We offer essential building blocks for voice AI." \
"https://api.deepgram.com/v1/speak?model=alpha-asteria-en"

In the output you can see information about the total latency, as well as its subcomponents. In this example, the total latency is 745 milliseconds.

dns_resolution: 0.118451, tcp_established: 0.244566, ssl_handshake_done: 0.339993, TTFB: 0.616216  
curl -H "Authorization: token $DEEPGRAM_API_KEY" -o "out.mp3" -w  -X POST -d   0.02s user 0.02s system 5% cpu 0.745 total

Network Latency

Network latency is the amount of time it takes for your request to leave your computer, through the internet, and be received by Deepgram. Depending on where your computer is located, as well as any additional firewalls, proxies, or VPNs, you will incur some network latency in reaching Deepgram.

You can measure the time it takes to connect to Deepgram’s servers:

curl -sSf -w "latency: %{time_connect}\\n" -so /dev/null https://api.deepgram.com  
# latency: 0.203186

📘

Deepgram’s servers are exclusively in the United States. If you are making API requests from another country, this will incur relatively higher network latency than requests from the USA.

Time to First Byte (TTFB)

Time to first byte (TTFB) is the time it takes from initiating an API request to receiving the first byte of audio from the response.

In the Total Latency example, we saw that time-to-first-byte (TTFB) occurred after 616 milliseconds. The first-byte latency can be calculated by subtracting the time it takes to complete the SSL handshake, i.e. 616 - 339 = 277 milliseconds.

dns_resolution: 0.118451, tcp_established: 0.244566, ssl_handshake_done: 0.339993, TTFB: 0.616216  
curl -H "Authorization: token $DEEPGRAM_API_KEY" -o "out.mp3" -w  -X POST -d   0.02s user 0.02s system 5% cpu 0.745 total

Another relevant piece of information is the speed-up factor, which is the ratio of the length of audio generated compared to the length of processing time. For instance, a TTS request that generates 100 seconds of audio in 5 seconds demonstrates a speed-up factor of 20x (100 ÷ 5 = 20).

After Deepgram receives your TTS API request, the process of synthesizing audio begins. Deepgram streams this output back to you to prevent waiting until the entire audio output is available. Streaming of results begins once the first byte of audio is synthesized.

Stream Audio Output

Leverage the streaming output by playing audio immediately upon receiving the first byte returned from Deepgram, as done in the following Python snippet:

with requests.post("https://api.deepgram.com/v1/speak", stream=True, headers=headers, json=json={"text": "Hello world!"}) as r:  
    for chunk in r.iter_content(chunk_size=1024):  
        if chunk:  
            stream.write(chunk)

You can begin streaming after receiving the first byte of audio because Deepgram's speed-up factor is high, far greater than the speed-up factor of 1x required to stream the output in real time.

📘

Tip: Stream the audio output after receiving the first byte. Read the guide Streaming Audio Outputs to learn more.

Audio Synthesis Latency

Finally, you can measure the time it takes to generate the complete audio response from Deepgram. This is the Audio Synthesis Latency.

If you wait for the entire audio file to be returned to you before playing it, your code will look similar to the following Python snippet:

response = requests.post("https://api.deepgram.com/v1/speak", headers=headers, json={"text": "Hello world!"})

if response.status_code == 200:  
    data = response.content  
    with open(“out.wav”, "wb") as f:  
        f.write(data)  
    play_obj = simpleaudio.play_buffer(wav_file.read())

From the Total Latency example, we can take the values for the SSL handshake and the total latency. We subtract the SSL handshake from the total latency, i.e. 745 - 339 = 406 milliseconds. This means that once Deepgram has created a secure connection, it takes 406 milliseconds to generate the TTS audio.

Minimizing Latency

Latency in text-to-speech (TTS) technology can be influenced by various factors. Here are some common elements that can affect latency, along with actionable tips to mitigate their impact.

Server Setup

The configuration and location of servers can greatly impact latency. Optimal server placement and hardware specifications can minimize processing delays.

Consider optimizing server deployment by strategically locating them closer to your target audience, investing in high-performance hardware tailored to workload demands, and implementing load balancing mechanisms to evenly distribute incoming requests, thereby minimizing latency and ensuring consistent performance.

📘

Tip: Maximize server performance and reduce latency by strategically placing servers near your target audience, investing in high-performance hardware, and implementing efficient load balancing techniques.

Length of Text Input

Deepgram TTS accepts up to 2000 characters of text input. There is a baseline level of latency that exists with any request. Beyond that, latency scales approximately linearly with increasing character input. Think y = mx + b from algebra class. The y-intercept m includes network latency, which is relatively constant at ~600 ms for total latency without streaming. Then the slope m is ~40 ms per 100 characters.

total_latency = constant_latency + latency_per_100_characters * num_characters

Imagine a 300 character input. The latency could be estimated as: 600 + 40 * 3 = 720 ms.

Then imagine tripling that to a 900-character input. The latency estimate is 600 + 40 * 9 = 960 ms.

Note that while the 900-character request is slower than the 300-character request, it takes 1.33x time to process 3x characters. Keep that in mind for longer-form content: for lowest cumulative latency across multiple requests, include as close to the max of 2000 characters in each request if you are chunking your text to fit Deepgram's input character limit.

📘

Tip: For optimal latency across multiple requests, aim to include as close to the maximum of 2000 characters in each request when chunking your text to fit input character limits.

You can also run timing experiments yourself via CURL by modifying the number of characters in your text input:

Experiment 1

# 300 characters - 756 ms latency

time curl -H "Authorization: token $DEEPGRAM_API_KEY" \
-o "out.mp3" \
-H 'content-type: text/plain' \
-X POST \
-d "Frolic, a sprightly squirrel, stumbles onto Splash, a unique flying fish tangled on the riverbank. As diverse fellows, they traverse each other’s worlds. Splash, on land, savors forest fruits while Frolic, underwater, discovers vivid coral reefs. Their journey forms a profound bond." \
"https://api.deepgram.com/v1/speak"

Output:

 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 97147    0 96862  100   285   128k    385 --:--:-- --:--:-- --:--:--  129k

curl -H "Authorization: token $DEEPGRAM_API_KEY" -o "out.mp3" -H  -X POST -d   0.02s user 0.02s system 4% cpu 0.756 total

Experiment 2

# 900 characters - 879 ms latency

time curl \
-H "Authorization: token $DEEPGRAM_API_KEY" \
-o "out.mp3" \
-H 'content-type: text/plain' \
-X POST \
-d "In a serene woodland, Frolic, an adventurous squirrel, crosses paths with Splash, a flying fish marooned by a misjudged wave on the riverbank. Using his wits, Frolic helps his new friend back into the river. As a token of thanks, Splash offers Frolic an unusual proposition – a tour of each other's world. First, Splash rides on Frolic's back, savoring forest fruits and learning about rustling leaves. Then, with the help of a magical Splash-provided air bubble, Frolic dives underwater, opening gateways to vibrant coral reefs, mysterious underwater cave systems, and friendly aquatic creatures. This extraordinary journey shapes a deep-seated comradeship between the unlikeliest of friends, illuminating the joy of shared experiences and breaking down their comfort zones' barriers." \
"https://api.deepgram.com/v1/speak"

Output:

 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  249k    0  249k  100   787   289k    914 --:--:-- --:--:-- --:--:--  309k

curl -H "Authorization: token $DEEPGRAM_API_KEY" -o "out.mp3" -H  -X POST -d   0.02s user 0.02s system 5% cpu 0.879 total

You will notice variation in latency; these numbers are an illustration of general trends and concepts, not a fixed result.

📘

Tip: On long content, provide the maximum number of characters per request.

Audio Output Formats

Deepgram offers many options for you to customize your output encoding, bit rate, and container, and overall audio format combinations. Select audio formats based on your specific application requirements and use cases.

Consider linear16 encoded wav audio when prioritizing high-fidelity audio quality and compatibility with a wide range of playback systems. Deepgram supports output sample rates of 8000, 16000, 24000, 32000, and 48000, and sample rate does not affect latency. You can generate audio with any of these sample rates at the same speed.

Choose compressed formats like mp3 or opus when file size and bandwidth efficiency are critical factors. These formats offer reduced file sizes while maintaining acceptable audio quality, potentially reducing latency in streaming and transmission.

📘

Tip: Choose the audio format that best suits your specific use case and requirements, considering factors such as audio quality, compatibility, file size, and efficiency, rather than solely focusing on minimizing latency.

More Tips to Minimize Latency

Text Chunking

When chunking long outputs as input into TTS, you'll want a balance of low latency and voice naturalness.

To accelerate generating a several-sentence reply, you may split by punctuation and submit a TTS request with the first sentence, streaming the sentence back from the first byte received. Then, submit another request with the subsequent audio to be streamed to your end-user once you've finished streaming that initial sentence.

Avoid splitting a single sentence into multiple requests, as the output speech will sound choppy due to normal variation in pitch and expression not being cohesive across speech snippets.

📘

Strategies and code examples for text chunking can be found in the guide Text Chunking for TTS Optimization.

LLM Latency

While this guide focuses on TTS latency, in the full context of a voice bot application that joins STT + LLM + TTS, an LLM is typically the primary source of latency. To lower voice bot LLM latency, you may consider the following aspects:

  1. If self-hosted, can a smaller model be used, or can more powerful hardware be used to serve requests?
  2. If using an external API, does the provider offer their own set of latency recommendations? Do they offer certain parameters or product tiers that enable lower latency?
  3. Can you reduce the number of input or output tokens to produce a shorter reply? For instance, add a system prompt to reply in fewer than a certain number of words or sentences (e.g. "Reply in three sentences or less").

Open-Source Resources and Demos

Try out our browser-based AI Agent Conversational Demo, and then jump start your own voice bot with its open-source implementation on GitHub at deepgram-conversational-demo.