Text Chunking for Streaming TTS Optimization
Text chunking is the process of breaking down text inputs into smaller, manageable chunks before processing.
Code Examples for TTS WebSockets
The idea behind using WebSockets for Text-to-Speech is to skip writing audio to the disk for playback but rather have a continuous stream of audio data flowing into the playback device. This can present some challenges, especially regarding the audio stream's latency, the audio format, and the playback device itself.
Simple WebSocket Example using Audio
In most cases, the example below represents a simple example of taking a TTS audio stream and sending it directly to a playback device like a speaker. Most hardware devices today support linear16
audio. This is sufficient for use on your laptop (Linux, MacOS, Windows), IoT/Edge devices, and a multitude of other hardware platforms.
import sounddevice as sd
import numpy as np
import time
from deepgram import (
DeepgramClient,
DeepgramClientOptions,
SpeakWebSocketEvents,
SpeakOptions,
)
TTS_TEXT = "Hello, this is a text to speech example using Deepgram."
def main():
try:
# Create a Deepgram client using the API key from environment variables
deepgram: DeepgramClient = DeepgramClient()
# Create a websocket connection to Deepgram
dg_connection = deepgram.speak.websocket.v("1")
def on_open(self, open, **kwargs):
print(f"\n\n{open}\n\n")
def on_binary_data(self, data, **kwargs):
print("Received binary data")
array = np.frombuffer(data, dtype=np.int16)
sd.play(array, 48000)
sd.wait()
def on_close(self, close, **kwargs):
print(f"\n\n{close}\n\n")
dg_connection.on(SpeakWebSocketEvents.Open, on_open)
dg_connection.on(SpeakWebSocketEvents.AudioData, on_binary_data)
dg_connection.on(SpeakWebSocketEvents.Close, on_close)
# connect to websocket
options = SpeakOptions(
model="aura-asteria-en",
encoding="linear16",
sample_rate=48000,
)
print("\n\nPress Enter to stop...\n\n")
if dg_connection.start(options) is False:
print("Failed to start connection")
return
# send the text to Deepgram
dg_connection.send_text(TTS_TEXT)
dg_connection.flush()
# Indicate that we've finished
time.sleep(5)
print("\n\nPress Enter to stop...\n\n")
input()
dg_connection.finish()
print("Finished")
except Exception as e:
print(f"An unexpected error occurred: {e}")
if __name__ == "__main__":
main()
Websocket Considerations
When using text chunking as a strategy to minimize latency, some factors to keep in mind are the following:
- User expectations: Consider users' preferences and needs, such as their tolerance for latency, the desired quality of synthesized speech, and their overall satisfaction with the application's performance.
- By default, we will only support the websocket sending partially chunked audio.
For WebSockets
- Optimize Latency: While we handle all of the chunking for you to optimize the balance between latency and naturalness of speech, try experimenting with different sentence chunk lengths (chunk before the first “,” and “.”) of the first sentence of your response to further optimize the latency of time-to-first-byte from an LLM output.
Updated 3 months ago