Audio Output Streaming
Start streaming your text-to-speech REST audio as soon as the first byte arrives.
Upon successful processing of a Deepgram text-to-speech request, you will receive an audio file containing the synthesized text-to-speech output.
Fortunately, you do not have to wait for the entire audio file to be available before playing the audio. As soon as the first byte arrives, the playback can begin, and the subsequent bytes can continue to be streamed in while the audio is already playing.
This guide will give you some tips and provide some examples so you can start streaming the audio as soon as you receive the first byte.
Implementation Examples
The following two examples demonstrate how to play the audio as soon as the first byte is returned. The first example takes a single text source and sends it to Deepgram for processing, while the second example chunks the text source by sentence boundaries and then consecutively sends each chunk to Deepgram for processing.
To see similar code examples that use Deepgram’s SDKs, check out the output streaming branch of the text-to-speech Starter Apps.
Single Text Source Payload
Chunked Text Source Payload
Read more about text chunking as an optimization strategy in the guide Text Chunking for TTS REST Optimization.
Tips
Chunked Transfer Encoding
Use a reasonable chunk size to strike a balance between efficiency and responsiveness. This allows for efficient processing of the response content without overwhelming memory resources.
The optimal chunk size might vary depending on the audio format, but for real-time applications where low latency is crucial, smaller chunks are generally preferred regardless of the audio format. Smaller chunks allow for faster data transmission and processing, reducing overall latency in the audio playback pipeline.
Dynamic Buffering
Implement dynamic buffering techniques to adapt to fluctuations in network conditions and audio playback requirements. Instead of using a fixed buffer size, dynamically adjust the buffer size based on factors such as network latency, available bandwidth, and audio playback latency.
This adaptive buffering approach helps optimize audio playback performance, ensuring smooth and uninterrupted streaming even under varying network conditions. Techniques such as buffer prediction algorithms or rate-based buffering can help dynamically adjust buffer sizes to maintain optimal audio streaming quality.
Buffer Prediction Algorithm Example:
Optimizing Streaming Performance
Select the audio format and configuration that best suits your streaming requirements and playback environment. Consider factors such as compression efficiency, network bandwidth, and device compatibility.
Efficient Streaming with Lower Bandwidth
Opus, AAC, and MP3 are considered efficient for streaming over networks with lower available bandwidth. These formats typically offer good compression without significant loss of audio quality, making them suitable for streaming applications where conserving bandwidth is crucial. They are optimized for efficient transmission and decoding, allowing for smoother playback even under bandwidth constraints.
High-Quality Streaming with Higher Bandwidth
FLAC and Linear PCM (linear16), along with AAC at high bitrates, are better suited for streaming over networks with higher available bandwidth. These formats prioritize audio quality over compression efficiency, resulting in higher fidelity audio reproduction. While they may require more bandwidth for transmission compared to efficient codecs, they deliver superior audio quality, making them ideal for scenarios where audio fidelity is paramount, such as music streaming or professional audio applications.