Text Chunking for TTS REST Optimization

When dealing with lengthy text inputs, latency can become an issue as the processing time increases with the length of the text. To address this challenge, one effective strategy is text chunking. Text chunking is the process of breaking down text inputs into smaller, manageable chunks before processing.

Code Examples

The following three code examples build upon each previous example in complexity to present strategies for using text chunking to optimize your Text-to-Speech applications.

These examples rely on Deepgram’s Python SDK. Learn how to get started with Aura and Deepgram’s SDKs by reading the Aura Text-to-Speech guide.

Chunk By Maximum Number of Characters

This example breaks down lengthy text inputs into chunks determined by a maximum number of characters.

This is a straightforward example which does not take into consideration characteristics of the text structure, such as clause and sentence boundaries. For some types of text, this is acceptable and will not have a negative effect on the quality of speech.

SDK - Python

1 from deepgram import (
2     DeepgramClient,
3     SpeakOptions,
4 )
5 
6 # Define the maximum chunk size (in characters) for text chunking
7 MAX_CHUNK_SIZE = 200
8 
9 input_text = "Our story begins in a peaceful woodland kingdom where a lively squirrel named Frolic made his abode high up within a cedar tree's embrace. He was not a usual woodland creature, for he was blessed with an insatiable curiosity and a heart for adventure. Nearby, a glistening river snaked through the landscape, home to a wonder named Splash - a silver-scaled flying fish whose ability to break free from his water-haven intrigued the woodland onlookers. This magical world moved on a rhythm of its own until an unforeseen circumstance brought Frolic and Splash together. One radiant morning, while Frolic was on his regular excursion, and Splash was making his aerial tours, an unpredictable wave playfully tossed and misplaced Splash onto the riverbank. Despite his initial astonishment, Frolic hurriedly and kindly assisted his new friend back to his watery abode. Touched by Frolic's compassion, Splash expressed his gratitude by inviting his friend to share his world. As Splash perched on Frolic's back, he tasted of the forest's bounty, felt the sun’s rays filter through the colors of the trees, experienced the conversations amidst the woods, and while at it, taught the woodland how to blur the lines between earth and water."
10 
11 def chunk_text(text, chunk_size):
12     chunks = []
13     words = text.split()
14     current_chunk = ''
15     for word in words:
16         if len(current_chunk) + len(word) <= chunk_size:
17             current_chunk += ' ' + word
18         else:
19             chunks.append(current_chunk.strip())
20             current_chunk = word
21     if current_chunk:
22         chunks.append(current_chunk.strip())
23     return chunks
24 
25 def main():
26     try:
27         # Create a Deepgram client using the API key
28         deepgram = DeepgramClient(api_key="DEEPGRAM_API_KEY")
29 
30         # Choose a model to use for synthesis
31         options = SpeakOptions(
32             model="aura-2-thalia-en",
33         )
34 
35         # Chunk the text into smaller parts
36         text_chunks = chunk_text(input_text, MAX_CHUNK_SIZE)
37 
38         # Synthesize audio for each chunk
39         for i, chunk in enumerate(text_chunks):
40             print(f"\nProcessing chunk {i + 1}...{chunk}\n")
41             filename = f"chunk_{i + 1}.mp3"
42 
43             SPEAK_OPTIONS = {"text": chunk}
44 
45             response = deepgram.speak.v("1").save(filename, SPEAK_OPTIONS, options)
46             print(response.to_json(indent=4))
47 
48     except Exception as e:
49         print(f"Exception: {e}")
50 
51 if __name__ == "__main__":
52     main()

Chunk By Clauses and Sentence Boundaries

In this example, the aim is to preserve naturalness of speech by chunking the text based on clause and sentence boundaries. When people speak, they tend to pause at the end of a clause or a sentence, so this strategy is helpful when working with texts that contain complex sentences in a narrative style.

The regular expression in the code tells the program to break the text into chunks based on the following:

The punctuation marks of period ., question mark ?, explanation mark !, or semicolon ;
A comma , + single whitespace + coordinating conjunctions and, but, or, nor, for, yet, so

These are two grammatical rules for identifying clauses, but you may decide to include more.

SDK - Python

1 import re
2 from deepgram import (
3     DeepgramClient,
4     SpeakOptions,
5 )
6 
7 input_text = "Our story begins in a peaceful woodland kingdom where a lively squirrel named Frolic made his abode high up within a cedar tree's embrace. He was not a usual woodland creature, for he was blessed with an insatiable curiosity and a heart for adventure. Nearby, a glistening river snaked through the landscape, home to a wonder named Splash - a silver-scaled flying fish whose ability to break free from his water-haven intrigued the woodland onlookers. This magical world moved on a rhythm of its own until an unforeseen circumstance brought Frolic and Splash together. One radiant morning, while Frolic was on his regular excursion, and Splash was making his aerial tours, an unpredictable wave playfully tossed and misplaced Splash onto the riverbank. Despite his initial astonishment, Frolic hurriedly and kindly assisted his new friend back to his watery abode. Touched by Frolic's compassion, Splash expressed his gratitude by inviting his friend to share his world. As Splash perched on Frolic's back, he tasted of the forest's bounty, felt the sun’s rays filter through the colors of the trees, experienced the conversations amidst the woods, and while at it, taught the woodland how to blur the lines between earth and water."
8 
9 CLAUSE_BOUNDARIES = r'\.|\?|!|;|, (and|but|or|nor|for|yet|so)'
10 
11 def chunk_text_by_clause(text):
12     # Find clause boundaries using regular expression
13     clause_boundaries = re.finditer(CLAUSE_BOUNDARIES, text)
14     boundaries_indices = [boundary.start() for boundary in clause_boundaries]
15 
16     chunks = []
17     start = 0
18     for boundary_index in boundaries_indices:
19         chunks.append(text[start:boundary_index + 1].strip())
20         start = boundary_index + 1
21     # Append the remaining part of the text
22     chunks.append(text[start:].strip())
23 
24     return chunks
25 
26 def main():
27     try:
28         # Create a Deepgram client using the API key
29         deepgram = DeepgramClient(api_key="DEEPGRAM_API_KEY")
30 
31         # Choose a model to use for synthesis
32         options = SpeakOptions(
33             model="aura-2-thalia-en",
34         )
35 
36         # Chunk the text into smaller parts
37         text_chunks = chunk_text_by_clause(input_text)
38 
39         # Synthesize audio for each chunk
40         for i, chunk in enumerate(text_chunks):
41             print(f"\nProcessing chunk {i + 1}...{chunk}\n")
42             filename = f"chunk_{i + 1}.mp3"
43 
44             SPEAK_OPTIONS = {"text": chunk}
45 
46             response = deepgram.speak.v("1").save(filename, SPEAK_OPTIONS, options)
47             print(response.to_json(indent=4))
48 
49     except Exception as e:
50         print(f"Exception: {e}")
51 
52 if __name__ == "__main__":
53     main()

Dynamic Chunking

The goal of dynamic chunking is to adjust the chunk sizes dynamically based on various factors, which may include adaptive rules or algorithms to determine how to split the text into chunks.

This next example implements a more flexible chunking strategy that adjusts chunk sizes dynamically based on the length and structure of the input text. It retains the rule from the previous example - to chunk based on clause/sentence boundaries - but it then looks at each chunk and determines the character count of the chunk. If the count exceeds a maximum character length, it chunks further into subchunks, where subchunks are defined by a comma but cannot be less than three characters.

SDK - Python

1 import re
2 from deepgram import (
3     DeepgramClient,
4     SpeakOptions,
5 )
6 
7 input_text = "Our story begins in a peaceful woodland kingdom where a lively squirrel named Frolic made his abode high up within a cedar tree's embrace. He was not a usual woodland creature, for he was blessed with an insatiable curiosity and a heart for adventure. Nearby, a glistening river snaked through the landscape, home to a wonder named Splash - a silver-scaled flying fish whose ability to break free from his water-haven intrigued the woodland onlookers. This magical world moved on a rhythm of its own until an unforeseen circumstance brought Frolic and Splash together. One radiant morning, while Frolic was on his regular excursion, and Splash was making his aerial tours, an unpredictable wave playfully tossed and misplaced Splash onto the riverbank. Despite his initial astonishment, Frolic hurriedly and kindly assisted his new friend back to his watery abode. Touched by Frolic's compassion, Splash expressed his gratitude by inviting his friend to share his world. As Splash perched on Frolic's back, he tasted of the forest's bounty, felt the sun’s rays filter through the colors of the trees, experienced the conversations amidst the woods, and while at it, taught the woodland how to blur the lines between earth and water."
8 
9 CLAUSE_BOUNDARIES = r'\.|\?|!|;|, (and|but|or|nor|for|yet|so)'
10 MAX_CHUNK_LENGTH = 100
11 
12 def chunk_text_dynamically(text):
13     # Find clause boundaries using regular expression
14     clause_boundaries = re.finditer(CLAUSE_BOUNDARIES, text)
15     boundaries_indices = [boundary.start() for boundary in clause_boundaries]
16 
17     chunks = []
18     start = 0
19     # Add chunks until the last clause boundary
20     for boundary_index in boundaries_indices:
21         chunk = text[start:boundary_index + 1].strip()
22         if len(chunk) <= MAX_CHUNK_LENGTH:
23             chunks.append(chunk)
24         else:
25             # Split by comma if it doesn't create subchunks less than three words
26             subchunks = chunk.split(',')
27             temp_chunk = ''
28             for subchunk in subchunks:
29                 if len(temp_chunk) + len(subchunk) <= MAX_CHUNK_LENGTH:
30                     temp_chunk += subchunk + ','
31                 else:
32                     if len(temp_chunk.split()) >= 3:
33                         chunks.append(temp_chunk.strip())
34                     temp_chunk = subchunk + ','
35             if temp_chunk:
36                 if len(temp_chunk.split()) >= 3:
37                     chunks.append(temp_chunk.strip())
38         start = boundary_index + 1
39 
40     # Split remaining text into subchunks if needed
41     remaining_text = text[start:].strip()
42     if remaining_text:
43         remaining_subchunks = [remaining_text[i:i+MAX_CHUNK_LENGTH] for i in range(0, len(remaining_text), MAX_CHUNK_LENGTH)]
44         chunks.extend(remaining_subchunks)
45 
46     return chunks
47 
48 def main():
49     try:
50         # Create a Deepgram client using the API key
51         deepgram = DeepgramClient(api_key="DEEPGRAM_API_KEY")
52 
53         # Choose a model to use for synthesis
54         options = SpeakOptions(
55             model="aura-2-thalia-en",
56         )
57 
58         # Chunk the text into smaller parts
59         text_chunks = chunk_text_dynamically(input_text)
60 
61         # Synthesize audio for each chunk
62         for i, chunk in enumerate(text_chunks):
63             print(f"\nProcessing chunk {i + 1}...{chunk}\n")
64             filename = f"chunk_{i + 1}.mp3"
65 
66             SPEAK_OPTIONS = {"text": chunk}
67 
68             response = deepgram.speak.v("1").save(filename, SPEAK_OPTIONS, options)
69             # print(response.to_json(indent=4))
70 
71     except Exception as e:
72         print(f"Exception: {e}")
73 
74 if __name__ == "__main__":
75     main()

Chunking + Streaming Audio

Instead of turning each chunk into an audio file such as an MP3 file, you might prefer to stream the audio as it comes in. It is possible to start streaming your text-to-speech audio as soon as the first byte arrives.

In this example, the text is chunked by sentence boundaries. Then each chunk is sent to Deepgram to be processed into audio, but when the first byte of audio arrives back to you, it is played immediately in a stream. Each audio stream is played consecutively in the order that the text splits them into chunks.

Python - SDK

1 import re
2 from deepgram import (
3     DeepgramClient,
4     SpeakOptions,
5 )
6 from pydub import AudioSegment
7 from pydub.playback import play
8 
9 input_text = "Our story begins in a peaceful woodland kingdom where a lively squirrel named Frolic made his abode high up within a cedar tree's embrace. He was not a usual woodland creature, for he was blessed with an insatiable curiosity and a heart for adventure. Nearby, a glistening river snaked through the landscape, home to a wonder named Splash - a silver-scaled flying fish whose ability to break free from his water-haven intrigued the woodland onlookers. This magical world moved on a rhythm of its own until an unforeseen circumstance brought Frolic and Splash together. One radiant morning, while Frolic was on his regular excursion, and Splash was making his aerial tours, an unpredictable wave playfully tossed and misplaced Splash onto the riverbank. Despite his initial astonishment, Frolic hurriedly and kindly assisted his new friend back to his watery abode. Touched by Frolic's compassion, Splash expressed his gratitude by inviting his friend to share his world. As Splash perched on Frolic's back, he tasted of the forest's bounty, felt the sun’s rays filter through the colors of the trees, experienced the conversations amidst the woods, and while at it, taught the woodland how to blur the lines between earth and water."
10 
11 def chunk_text_by_sentence(text):
12     # Find sentence boundaries using regular expression
13     sentence_boundaries = re.finditer(r'(?<=[.!?])\s+', text)
14     boundaries_indices = [boundary.start() for boundary in sentence_boundaries]
15 
16     chunks = []
17     start = 0
18     # Split the text into chunks based on sentence boundaries
19     for boundary_index in boundaries_indices:
20         chunks.append(text[start:boundary_index + 1].strip())
21         start = boundary_index + 1
22     chunks.append(text[start:].strip())
23 
24     return chunks
25 
26 def synthesize_audio(text):
27     # Create a Deepgram client using the API key
28     deepgram = DeepgramClient(api_key="DEEPGRAM_API_KEY")
29     # Choose a model to use for synthesis
30     options = SpeakOptions(
31             model="aura-2-thalia-en",
32         )
33     speak_options = {"text": text}
34     # Synthesize audio and stream the response
35     response =  deepgram.speak.v("1").stream(speak_options, options)
36     # Get the audio stream from the response
37     audio_buffer = response.stream
38     audio_buffer.seek(0)
39     # Load audio from buffer using pydub
40     audio = AudioSegment.from_mp3(audio_buffer)
41 
42     return audio
43 
44 def main():
45     # Chunk the text into smaller parts
46     chunks = chunk_text_by_sentence(input_text)
47 
48     # Synthesize each chunk into audio and play the audio
49     for chunk_text in chunks:
50         audio = synthesize_audio(chunk_text)
51         play(audio)
52 
53 if __name__ == "__main__":
54     main()

Read more about about streaming the TTS audio output in the guide Streaming Audio Outputs.

Considerations

When using text chunking as a strategy to minimize latency, some factors to keep in mind are the following:

Preserving naturalness of speech - Maintain proper pronunciation, intonation, and rhythm to enhance the user experience.
Contextual understanding - Analyze the structure and meaning of the text to identify natural breakpoints, such as sentence or clause boundaries, for dividing the text.
Dynamic chunking - Implement a flexible chunking strategy that adjusts chunk sizes dynamically based on the length and structure of the input text.
User expectations - Consider the preferences and needs of users, such as their tolerance for latency, the desired quality of synthesized speech, and their overall satisfaction with the application’s performance.