Text Chunking for TTS REST Optimization
Text chunking is the process of breaking down text inputs into smaller, manageable chunks before processing.
When dealing with lengthy text inputs, latency can become an issue as the processing time increases with the length of the text. To address this challenge, one effective strategy is text chunking. Text chunking is the process of breaking down text inputs into smaller, manageable chunks before processing.
Code Examples
The following three code examples build upon each previous example in complexity to present strategies for using text chunking to optimize your Text-to-Speech applications.
These examples rely on Deepgram’s Python SDK. Learn how to get started with Aura and Deepgram’s SDKs by reading the Aura Text-to-Speech guide.
Chunk By Maximum Number of Characters
This example breaks down lengthy text inputs into chunks determined by a maximum number of characters.
This is a straightforward example which does not take into consideration characteristics of the text structure, such as clause and sentence boundaries. For some types of text, this is acceptable and will not have a negative effect on the quality of speech.
Chunk By Clauses and Sentence Boundaries
In this example, the aim is to preserve naturalness of speech by chunking the text based on clause and sentence boundaries. When people speak, they tend to pause at the end of a clause or a sentence, so this strategy is helpful when working with texts that contain complex sentences in a narrative style.
The regular expression in the code tells the program to break the text into chunks based on the following:
- The punctuation marks of period
.
, question mark?
, explanation mark!
, or semicolon;
- A comma
,
+ single whitespace + coordinating conjunctionsand
,but
,or
,nor
,for
,yet
,so
These are two grammatical rules for identifying clauses, but you may decide to include more.
Dynamic Chunking
The goal of dynamic chunking is to adjust the chunk sizes dynamically based on various factors, which may include adaptive rules or algorithms to determine how to split the text into chunks.
This next example implements a more flexible chunking strategy that adjusts chunk sizes dynamically based on the length and structure of the input text. It retains the rule from the previous example - to chunk based on clause/sentence boundaries - but it then looks at each chunk and determines the character count of the chunk. If the count exceeds a maximum character length, it chunks further into subchunks, where subchunks are defined by a comma but cannot be less than three characters.
Chunking + Streaming Audio
Instead of turning each chunk into an audio file such as an MP3 file, you might prefer to stream the audio as it comes in. It is possible to start streaming your text-to-speech audio as soon as the first byte arrives.
In this example, the text is chunked by sentence boundaries. Then each chunk is sent to Deepgram to be processed into audio, but when the first byte of audio arrives back to you, it is played immediately in a stream. Each audio stream is played consecutively in the order that the text splits them into chunks.
Read more about about streaming the TTS audio output in the guide Streaming Audio Outputs.
Considerations
When using text chunking as a strategy to minimize latency, some factors to keep in mind are the following:
- preserving naturalness of speech - Maintain proper pronunciation, intonation, and rhythm to enhance the user experience.
- contextual understanding - Analyze the structure and meaning of the text to identify natural breakpoints, such as sentence or clause boundaries, for dividing the text.
- dynamic chunking - Implement a flexible chunking strategy that adjusts chunk sizes dynamically based on the length and structure of the input text.
- user expectations - Consider the preferences and needs of users, such as their tolerance for latency, the desired quality of synthesized speech, and their overall satisfaction with the application’s performance.