Audio Output Streaming | Deepgram's Docs

Upon successful processing of a Deepgram text-to-speech request, you will receive an audio file containing the synthesized text-to-speech output.

Fortunately, you do not have to wait for the entire audio file to be available before playing the audio. As soon as the first byte arrives, the playback can begin, and the subsequent bytes can continue to be streamed in while the audio is already playing.

This guide will give you some tips and provide some examples so you can start streaming the audio as soon as you receive the first byte.

Implementation Examples

The following two examples demonstrate how to play the audio as soon as the first byte is returned. The first example takes a single text source and sends it to Deepgram for processing, while the second example chunks the text source by sentence boundaries and then consecutively sends each chunk to Deepgram for processing.

Single Text Source Payload

1 import requests
2 
3 DEEPGRAM_URL = "https://api.deepgram.com/v1/speak?model=aura-2-thalia-en"
4 DEEPGRAM_API_KEY = "DEEPGRAM_API_KEY"
5 
6 payload = {
7     "text": "Hello, how can I help you today? My name is Emily and I'm very glad to meet you. What do you think of this new text-to-speech API?"
8 }
9 
10 headers = {
11     "Authorization": f"Token {DEEPGRAM_API_KEY}",
12     "Content-Type": "application/json"
13 }
14 
15 audio_file_path = "output.mp3"  # Path to save the audio file
16 
17 with open(audio_file_path, 'wb') as file_stream:
18     response = requests.post(DEEPGRAM_URL, headers=headers, json=payload, stream=True)
19     for chunk in response.iter_content(chunk_size=1024):
20         if chunk:
21             file_stream.write(chunk) # Write each chunk of audio data to the file
22 
23 print("Audio download complete")

Chunked Text Source Payload

1 import re
2 import requests
3 
4 DEEPGRAM_URL = 'https://api.deepgram.com/v1/speak?model=aura-2-thalia-en'
5 headers = {
6     "Authorization": "Token DEEPGRAM_API_KEY",
7     "Content-Type": "application/json"
8 }
9 
10 input_text = "Our story begins in a peaceful woodland kingdom where a lively squirrel named Frolic made his abode high up within a cedar tree's embrace. He was not a usual woodland creature, for he was blessed with an insatiable curiosity and a heart for adventure. Nearby, a glistening river snaked through the landscape, home to a wonder named Splash - a silver-scaled flying fish whose ability to break free from his water-haven intrigued the woodland onlookers. This magical world moved on a rhythm of its own until an unforeseen circumstance brought Frolic and Splash together. One radiant morning, while Frolic was on his regular excursion, and Splash was making his aerial tours, an unpredictable wave playfully tossed and misplaced Splash onto the riverbank. Despite his initial astonishment, Frolic hurriedly and kindly assisted his new friend back to his watery abode. Touched by Frolic's compassion, Splash expressed his gratitude by inviting his friend to share his world. As Splash perched on Frolic's back, he tasted of the forest's bounty, felt the sun’s rays filter through the colors of the trees, experienced the conversations amidst the woods, and while at it, taught the woodland how to blur the lines between earth and water."
11 
12 def segment_text_by_sentence(text):
13     sentence_boundaries = re.finditer(r'(?<=[.!?])\s+', text)
14     boundaries_indices = [boundary.start() for boundary in sentence_boundaries]
15 
16     segments = []
17     start = 0
18     for boundary_index in boundaries_indices:
19         segments.append(text[start:boundary_index + 1].strip())
20         start = boundary_index + 1
21     segments.append(text[start:].strip())
22 
23     return segments
24 
25 def synthesize_audio(text, output_file):
26     payload = {"text": text}
27     with requests.post(DEEPGRAM_URL, stream=True, headers=headers, json=payload) as r:
28         for chunk in r.iter_content(chunk_size=1024):
29             if chunk:
30                 output_file.write(chunk)
31 
32 def main():
33     segments = segment_text_by_sentence(input_text)
34 
35     # Create or truncate the output file
36     with open("output.mp3", "wb") as output_file:
37         for segment_text in segments:
38             synthesize_audio(segment_text, output_file)
39 
40     print("Audio file creation completed.")
41 
42 if __name__ == "__main__":
43     main()

Read more about text chunking as an optimization strategy in the guide Text Chunking for TTS REST Optimization.

Tips

Chunked Transfer Encoding

Use a reasonable chunk size to strike a balance between efficiency and responsiveness. This allows for efficient processing of the response content without overwhelming memory resources.

The optimal chunk size might vary depending on the audio format, but for real-time applications where low latency is crucial, smaller chunks are generally preferred regardless of the audio format. Smaller chunks allow for faster data transmission and processing, reducing overall latency in the audio playback pipeline.

Dynamic Buffering

Implement dynamic buffering techniques to adapt to fluctuations in network conditions and audio playback requirements. Instead of using a fixed buffer size, dynamically adjust the buffer size based on factors such as network latency, available bandwidth, and audio playback latency.

This adaptive buffering approach helps optimize audio playback performance, ensuring smooth and uninterrupted streaming even under varying network conditions. Techniques such as buffer prediction algorithms or rate-based buffering can help dynamically adjust buffer sizes to maintain optimal audio streaming quality.

Buffer Prediction Algorithm Example:

1 class BufferPredictor:
2     def __init__(self):
3         self.buffer_size = DEFAULT_BUFFER_SIZE
4 
5     def adjust_buffer_size(self, network_latency, bandwidth):
6         # Example buffer prediction algorithm based on network latency and bandwidth
7         predicted_buffer_size = calculate_buffer_size(network_latency, bandwidth)
8         self.buffer_size = predicted_buffer_size
9 
10 # Usage
11 buffer_predictor = BufferPredictor()
12 network_latency = get_network_latency()
13 bandwidth = get_available_bandwidth()
14 buffer_predictor.adjust_buffer_size(network_latency, bandwidth)
15 print("Adjusted buffer size:", buffer_predictor.buffer_size)

Optimizing Streaming Performance

Select the audio format and configuration that best suits your streaming requirements and playback environment. Consider factors such as compression efficiency, network bandwidth, and device compatibility.

Efficient Streaming with Lower Bandwidth

Opus, AAC, and MP3 are considered efficient for streaming over networks with lower available bandwidth. These formats typically offer good compression without significant loss of audio quality, making them suitable for streaming applications where conserving bandwidth is crucial. They are optimized for efficient transmission and decoding, allowing for smoother playback even under bandwidth constraints.

High-Quality Streaming with Higher Bandwidth

FLAC and Linear PCM (linear16), along with AAC at high bitrates, are better suited for streaming over networks with higher available bandwidth. These formats prioritize audio quality over compression efficiency, resulting in higher fidelity audio reproduction. While they may require more bandwidth for transmission compared to efficient codecs, they deliver superior audio quality, making them ideal for scenarios where audio fidelity is paramount, such as music streaming or professional audio applications.