Text Chunking for Streaming TTS Optimization

Text chunking is the process of breaking down text inputs into smaller, manageable chunks before processing.

Code Examples for TTS WebSockets

The idea behind using WebSockets for Text-to-Speech is to skip writing audio to the disk for playback but rather have a continuous stream of audio data flowing into the playback device. This can present some challenges, especially regarding the audio stream's latency, the audio format, and the playback device itself.

Simple WebSocket Example using Audio

In most cases, the example below represents a simple example of taking a TTS audio stream and sending it directly to a playback device like a speaker. Most hardware devices today support linear16 audio. This is sufficient for use on your laptop (Linux, MacOS, Windows), IoT/Edge devices, and a multitude of other hardware platforms.

import sounddevice as sd
import numpy as np
import time

from deepgram import (
    DeepgramClient,
    DeepgramClientOptions,
    SpeakWebSocketEvents,
    SpeakOptions,
)

TTS_TEXT = "Hello, this is a text to speech example using Deepgram."

def main():
    try:
        # Create a Deepgram client using the API key from environment variables
        deepgram: DeepgramClient = DeepgramClient()

        # Create a websocket connection to Deepgram
        dg_connection = deepgram.speak.websocket.v("1")

        def on_open(self, open, **kwargs):
            print(f"\n\n{open}\n\n")

        def on_binary_data(self, data, **kwargs):
            print("Received binary data")
            array = np.frombuffer(data, dtype=np.int16)
            sd.play(array, 48000)
            sd.wait()

        def on_close(self, close, **kwargs):
            print(f"\n\n{close}\n\n")

        dg_connection.on(SpeakWebSocketEvents.Open, on_open)
        dg_connection.on(SpeakWebSocketEvents.AudioData, on_binary_data)
        dg_connection.on(SpeakWebSocketEvents.Close, on_close)

        # connect to websocket
        options = SpeakOptions(
            model="aura-asteria-en",
            encoding="linear16",
            sample_rate=48000,
        )

        print("\n\nPress Enter to stop...\n\n")
        if dg_connection.start(options) is False:
            print("Failed to start connection")
            return

        # send the text to Deepgram
        dg_connection.send_text(TTS_TEXT)
        dg_connection.flush()

        # Indicate that we've finished
        time.sleep(5)
        print("\n\nPress Enter to stop...\n\n")
        input()
        dg_connection.finish()

        print("Finished")
        
    except Exception as e:
        print(f"An unexpected error occurred: {e}")


if __name__ == "__main__":
    main()

Websocket Considerations

When using text chunking as a strategy to minimize latency, some factors to keep in mind are the following:

  • User expectations: Consider users' preferences and needs, such as their tolerance for latency, the desired quality of synthesized speech, and their overall satisfaction with the application's performance.
  • By default, we will only support the websocket sending partially chunked audio.

For WebSockets

  • Optimize Latency: While we handle all of the chunking for you to optimize the balance between latency and naturalness of speech, try experimenting with different sentence chunk lengths (chunk before the first “,” and “.”) of the first sentence of your response to further optimize the latency of time-to-first-byte from an LLM output.