Self-Hosted Text to Speech

Self-hosted Text to Speech deployments often have unique latency and availability requirements that are worth optimizing for.

Deepgram Aura is a natural-sounding, high-throughput text-to-speech (TTS) model for real-time voicebots and conversational AI applications. As with all of our other product lines, we offer our text-to-speech API as a self-hosted product, in addition to our hosted offering. Self-hosting text-to-speech services can significantly enhance the performance of your application and possibly meet other business requirements.

Text-to-speech capabilities are automatically included with the Deepgram API and Engine, in addition to speech-to-text and audio intelligence. If you are interested in learning more about self-hosting Deepgram text-to-speech, please see our Text to Speech guides, and reach out to Support for information on our self-hosted offering.

There are a number of considerations to keep in mind when deploying Deepgram TTS in a self-hosted environment.

Latency and Throughput

First, let's define some helpful terms.

Latency is the time delay between when a transfer of data begins and when a system begins processing it. In the context of self-hosted TTS, this could be generalized as the time between submitting a TTS API request, and the time you receive the first byte of audio data, which you can then begin playing back to the end user.

Throughput refers to the number of requests or transactions that can be processed successfully within a given time period. In the context of self-hosted TTS, this may refer to the number of individual TTS requests that can be processed within one second, or one minute.

There is a technical tradeoff between these two metrics. If you submit a lower number of TTS requests to a single GPU, you will be able to have lower latency per request at the expense of lower overall throughput. If you raise the number of submitted requests, the latency of each request will rise, but your overall system throughput will rise as well.

Finding the correct balance between latency and throughput is essential to ensure the success of your application and limit the hardware cost for your self-hosted environment. Your Deepgram Account Representative will assist you with optimizing this balance.

Dedicated Servers

Real-time text-to-speech applications, such as voicebots, often have strict latency requirement to ensure a natural experience for the end user. Other application domains that are not real-time, such as producing pre-recorded speech snippets for accessibility, will want to optimize for overall throughput to make the best use of hardware resources.

In both cases, it is beneficial to designate dedicated hardware for text-to-speech traffic. You may already be running Deepgram self-hosted speech-to-text in your production environments. If this is the case, you should dedicate separate GPUs to handle text-to-speech requests, and route traffic accordingly.

Hardware Selection

Latency is often the critical performance metric for real-time applications, as opposed to overall throughput. Deepgram self-hosted products run on NVIDIA GPUs. Certain NVIDIA GPU families will have better performance for text-to-speech in particular. Please consult your Deepgram Account Representative for recommendations on selecting hardware for text-to-speech services your self-hosted environment.