Text Chunking for TTS

Basic techniques for breaking text into chunks to reduce latency in Text-to-Speech applications.

Why Text Chunking Matters

Text chunking significantly reduces perceived latency in TTS applications by allowing audio playback to begin sooner. This is especially important for conversational AI and voice agents where responsiveness is critical.

Instead of waiting for the entire audio to be generated, chunking lets you:

  • Begin audio playback much faster
  • Create more responsive voice experiences
  • Maintain natural-sounding speech

Basic Sentence Chunking

The simplest and most effective approach is to split text at sentence boundaries. This preserves natural speech patterns while enabling faster time-to-first-byte:

1import re
2
3def chunk_by_sentence(text):
4 # Split text at sentence boundaries (periods, question marks, exclamation points)
5 # while preserving the punctuation
6 sentences = re.split(r'(?<=[.!?])\s+', text)
7
8 # Remove any empty chunks
9 return [sentence for sentence in sentences if sentence]
10
11# Example usage
12text = "Hello, welcome to Deepgram. This is an example of text chunking. How does it sound?"
13chunks = chunk_by_sentence(text)
14
15for i, chunk in enumerate(chunks):
16 print(f"Chunk {i+1}: {chunk}")
17
18# Output:
19# Chunk 1: Hello, welcome to Deepgram.
20# Chunk 2: This is an example of text chunking.
21# Chunk 3: How does it sound?

Processing Streaming Text with WebSockets

When working with streaming text (like from an LLM), you need to collect tokens until you have complete sentences. Here’s a simplified approach to process text chunks that arrive as paragraphs:

1import re
2import asyncio
3from deepgram import DeepgramClient, SpeakOptions
4
5def chunk_by_sentence(text):
6 # Split text at sentence boundaries (periods, question marks, exclamation points)
7 # while preserving the punctuation
8 sentences = re.split(r'(?<=[.!?])\s+', text)
9
10 # Remove any empty chunks
11 return [sentence for sentence in sentences if sentence]
12
13class SimpleTextChunker:
14 def __init__(self, tts_client):
15 self.queue = [] # Queue to store incoming paragraph chunks
16 self.processed_sentences = set()
17 self.tts_client = tts_client
18
19 async def process_text_stream(self, paragraph):
20 """Process an array of paragraph chunks, each containing 1-2 sentences"""
21
22 # Queue paragraph as it arrives (simulating fast reception)
23 self.queue.append(paragraph)
24 print(f"Received and queued paragraph: {paragraph}")
25
26 # You could preprocess paragraphs here and split them by more than just sentence boundaries
27
28 # Process the queue
29 while self.queue:
30 # Get the next paragraph from the queue
31 paragraph = self.queue.pop(0)
32
33 # Split paragraph into sentences using our chunk_by_sentence function
34 sentences = chunk_by_sentence(paragraph)
35
36 # Process each sentence
37 for sentence in sentences:
38 if sentence and sentence not in self.processed_sentences:
39 # Send the sentence to TTS
40 print(f"Sending sentence to TTS: {sentence}")
41 audio_response = await self.tts_client.sync_speak({
42 "text": sentence,
43 "model": "aura-2-thalia-en",
44 "sample_rate": 24000
45 })
46 # In a real app, you would play this audio immediately
47 self.processed_sentences.add(sentence)
48
49# Example usage with an array of paragraph chunks
50async def main():
51 # This simulates text coming in as paragraph chunks from an LLM
52 paragraph_chunks = [
53 "Deepgram's TTS API offers low latency. It works great for voice agents.",
54 "This approach simulates receiving chunks as paragraphs. Each paragraph may contain one or two sentences.",
55 "Try it today! You'll be impressed with the results."
56 ]
57
58 # Set up TTS client
59 deepgram = DeepgramClient()
60 tts_client = deepgram.speak
61
62 # Set up listeners for TTS events to handle audio data and connection status
63
64 chunker = SimpleTextChunker(tts_client)
65 # Process each paragraph sequentially
66 for paragraph in paragraph_chunks:
67 await chunker.process_text_stream(paragraph)
68
69# Run the example
70if __name__ == "__main__":
71 asyncio.run(main())

For complete details on implementing the TTS WebSocket connection, see our guide on Real-Time TTS with WebSockets.

Processing Chunked Text

After creating chunks, you have two main options for processing them:

Sequential Processing

Process each chunk in sequence, prioritizing the first chunk:

1async def process_chunks_sequential(chunks, tts_function):
2 results = []
3 for i, chunk in enumerate(chunks):
4 # You might prioritize the first chunk for faster response
5 result = await tts_function(chunk)
6 results.append(result)
7 return results

Setting Chunk Size

For most applications, sentences work well as chunks. If you need finer control:

  • Voice assistants: Aim for 50-100 character chunks
  • Call center bots: Use complete sentences (most natural)
  • Long-form content: Larger chunks (200-400 characters) preserve intonation

Other Chunking Strategies

If you need more advanced chunking methods, search for these techniques:

  • Clause-based chunking: Splits long sentences at commas and semicolons
  • NLP-based chunking: Uses natural language processing to find semantic boundaries
  • Adaptive chunking: Adjusts chunk size based on content complexity
  • First-chunk optimization: Specially optimizes the first chunk for minimal latency
  • SSML chunking: Handles Speech Synthesis Markup Language tags when chunking

For WebSocket implementation details to stream the chunked audio, see our guide on Real-Time TTS with WebSockets.