Text Chunking for TTS Optimization

Text chunking is the process of breaking down text inputs into smaller, manageable chunks before processing.

When dealing with lengthy text inputs, latency can become an issue as the processing time increases with the length of the text. To address this challenge, one effective strategy is text chunking. Text chunking is the process of breaking down text inputs into smaller, manageable chunks before processing.

Code Examples

The following three code examples build upon each previous example in complexity to present strategies for using text chunking to optimize your Text-to-Speech applications.

📘

These examples rely on Deepgram's Python SDK. Learn how to get started with Aura and Deepgram's SDKs by reading the Aura Text-to-Speech guide.

Chunk By Maximum Number of Characters

This example breaks down lengthy text inputs into chunks determined by a maximum number of characters.

This is a straightforward example which does not take into consideration characteristics of the text structure, such as clause and sentence boundaries. For some types of text, this is acceptable and will not have a negative effect on the quality of speech.

from deepgram import (
    DeepgramClient,
    SpeakOptions,
)


# Define the maximum chunk size (in characters) for text chunking
MAX_CHUNK_SIZE = 200

input_text = "Our story begins in a peaceful woodland kingdom where a lively squirrel named Frolic made his abode high up within a cedar tree's embrace. He was not a usual woodland creature, for he was blessed with an insatiable curiosity and a heart for adventure. Nearby, a glistening river snaked through the landscape, home to a wonder named Splash - a silver-scaled flying fish whose ability to break free from his water-haven intrigued the woodland onlookers. This magical world moved on a rhythm of its own until an unforeseen circumstance brought Frolic and Splash together. One radiant morning, while Frolic was on his regular excursion, and Splash was making his aerial tours, an unpredictable wave playfully tossed and misplaced Splash onto the riverbank. Despite his initial astonishment, Frolic hurriedly and kindly assisted his new friend back to his watery abode. Touched by Frolic's compassion, Splash expressed his gratitude by inviting his friend to share his world. As Splash perched on Frolic's back, he tasted of the forest's bounty, felt the sun’s rays filter through the colors of the trees, experienced the conversations amidst the woods, and while at it, taught the woodland how to blur the lines between earth and water."

def chunk_text(text, chunk_size):
    chunks = []
    words = text.split()
    current_chunk = ''
    for word in words:
        if len(current_chunk) + len(word) <= chunk_size:
            current_chunk += ' ' + word
        else:
            chunks.append(current_chunk.strip())
            current_chunk = word
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks


def main():
    try:
        # Create a Deepgram client using the API key
        deepgram = DeepgramClient(api_key="DEEPGRAM_API_KEY")

        # Choose a model to use for synthesis
        options = SpeakOptions(
            model="aura-helios-en",
        )

        # Chunk the text into smaller parts
        text_chunks = chunk_text(input_text, MAX_CHUNK_SIZE)
        
        # Synthesize audio for each chunk
        for i, chunk in enumerate(text_chunks):
            print(f"\nProcessing chunk {i + 1}...{chunk}\n")
            filename = f"chunk_{i + 1}.mp3"

            SPEAK_OPTIONS = {"text": chunk}
        
            response = deepgram.speak.v("1").save(filename, SPEAK_OPTIONS, options)
            print(response.to_json(indent=4))

    except Exception as e:
        print(f"Exception: {e}")

if __name__ == "__main__":
    main()

Chunk By Clauses and Sentence Boundaries

In this example, the aim is to preserve naturalness of speech by chunking the text based on clause and sentence boundaries. When people speak, they tend to pause at the end of a clause or a sentence, so this strategy is helpful when working with texts that contain complex sentences in a narrative style.

The regular expression in the code tells the program to break the text into chunks based on the following:

  • The punctuation marks of period ., question mark ?, explanation mark !, or semicolon ;
  • A comma , + single whitespace + coordinating conjunctions and, but, or, nor, for, yet, so

These are two grammatical rules for identifying clauses, but you may decide to include more.

import re
from deepgram import (
    DeepgramClient,
    SpeakOptions,
)

input_text = "Our story begins in a peaceful woodland kingdom where a lively squirrel named Frolic made his abode high up within a cedar tree's embrace. He was not a usual woodland creature, for he was blessed with an insatiable curiosity and a heart for adventure. Nearby, a glistening river snaked through the landscape, home to a wonder named Splash - a silver-scaled flying fish whose ability to break free from his water-haven intrigued the woodland onlookers. This magical world moved on a rhythm of its own until an unforeseen circumstance brought Frolic and Splash together. One radiant morning, while Frolic was on his regular excursion, and Splash was making his aerial tours, an unpredictable wave playfully tossed and misplaced Splash onto the riverbank. Despite his initial astonishment, Frolic hurriedly and kindly assisted his new friend back to his watery abode. Touched by Frolic's compassion, Splash expressed his gratitude by inviting his friend to share his world. As Splash perched on Frolic's back, he tasted of the forest's bounty, felt the sun’s rays filter through the colors of the trees, experienced the conversations amidst the woods, and while at it, taught the woodland how to blur the lines between earth and water."

CLAUSE_BOUNDARIES = r'\.|\?|!|;|, (and|but|or|nor|for|yet|so)'

def chunk_text_by_clause(text):
    # Find clause boundaries using regular expression
    clause_boundaries = re.finditer(CLAUSE_BOUNDARIES, text)
    boundaries_indices = [boundary.start() for boundary in clause_boundaries]

    chunks = []
    start = 0
    for boundary_index in boundaries_indices:
        chunks.append(text[start:boundary_index + 1].strip())
        start = boundary_index + 1
    # Append the remaining part of the text
    chunks.append(text[start:].strip())

    return chunks


def main():
    try:
        # Create a Deepgram client using the API key
        deepgram = DeepgramClient(api_key="DEEPGRAM_API_KEY")

        # Choose a model to use for synthesis
        options = SpeakOptions(
            model="aura-helios-en",
        )

        # Chunk the text into smaller parts
        text_chunks = chunk_text_by_clause(input_text)
        
        # Synthesize audio for each chunk
        for i, chunk in enumerate(text_chunks):
            print(f"\nProcessing chunk {i + 1}...{chunk}\n")
            filename = f"chunk_{i + 1}.mp3"

            SPEAK_OPTIONS = {"text": chunk}
        
            response = deepgram.speak.v("1").save(filename, SPEAK_OPTIONS, options)
            print(response.to_json(indent=4))

    except Exception as e:
        print(f"Exception: {e}")

if __name__ == "__main__":
    main()

Dynamic Chunking

The goal of dynamic chunking is to adjust the chunk sizes dynamically based on various factors, which may include adaptive rules or algorithms to determine how to split the text into chunks.

This next example implements a more flexible chunking strategy that adjusts chunk sizes dynamically based on the length and structure of the input text. It retains the rule from the previous example - to chunk based on clause/sentence boundaries - but it then looks at each chunk and determines the character count of the chunk. If the count exceeds a maximum character length, it chunks further into subchunks, where subchunks are defined by a comma but cannot be less than three characters.

import re
from deepgram import (
    DeepgramClient,
    SpeakOptions,
)

input_text = "Our story begins in a peaceful woodland kingdom where a lively squirrel named Frolic made his abode high up within a cedar tree's embrace. He was not a usual woodland creature, for he was blessed with an insatiable curiosity and a heart for adventure. Nearby, a glistening river snaked through the landscape, home to a wonder named Splash - a silver-scaled flying fish whose ability to break free from his water-haven intrigued the woodland onlookers. This magical world moved on a rhythm of its own until an unforeseen circumstance brought Frolic and Splash together. One radiant morning, while Frolic was on his regular excursion, and Splash was making his aerial tours, an unpredictable wave playfully tossed and misplaced Splash onto the riverbank. Despite his initial astonishment, Frolic hurriedly and kindly assisted his new friend back to his watery abode. Touched by Frolic's compassion, Splash expressed his gratitude by inviting his friend to share his world. As Splash perched on Frolic's back, he tasted of the forest's bounty, felt the sun’s rays filter through the colors of the trees, experienced the conversations amidst the woods, and while at it, taught the woodland how to blur the lines between earth and water."

CLAUSE_BOUNDARIES = r'\.|\?|!|;|, (and|but|or|nor|for|yet|so)'
MAX_CHUNK_LENGTH = 100

def chunk_text_dynamically(text):
    # Find clause boundaries using regular expression
    clause_boundaries = re.finditer(CLAUSE_BOUNDARIES, text)
    boundaries_indices = [boundary.start() for boundary in clause_boundaries]

    chunks = []
    start = 0
    # Add chunks until the last clause boundary
    for boundary_index in boundaries_indices:
        chunk = text[start:boundary_index + 1].strip()
        if len(chunk) <= MAX_CHUNK_LENGTH:
            chunks.append(chunk)
        else:
            # Split by comma if it doesn't create subchunks less than three words
            subchunks = chunk.split(',')
            temp_chunk = ''
            for subchunk in subchunks:
                if len(temp_chunk) + len(subchunk) <= MAX_CHUNK_LENGTH:
                    temp_chunk += subchunk + ','
                else:
                    if len(temp_chunk.split()) >= 3:
                        chunks.append(temp_chunk.strip())
                    temp_chunk = subchunk + ','
            if temp_chunk:
                if len(temp_chunk.split()) >= 3:
                    chunks.append(temp_chunk.strip())
        start = boundary_index + 1

    # Split remaining text into subchunks if needed
    remaining_text = text[start:].strip()
    if remaining_text:
        remaining_subchunks = [remaining_text[i:i+MAX_CHUNK_LENGTH] for i in range(0, len(remaining_text), MAX_CHUNK_LENGTH)]
        chunks.extend(remaining_subchunks)

    return chunks


def main():
    try:
        # Create a Deepgram client using the API key
        deepgram = DeepgramClient(api_key="DEEPGRAM_API_KEY")

        # Choose a model to use for synthesis
        options = SpeakOptions(
            model="aura-helios-en",
        )

        # Chunk the text into smaller parts
        text_chunks = chunk_text_dynamically(input_text)
        
        # Synthesize audio for each chunk
        for i, chunk in enumerate(text_chunks):
            print(f"\nProcessing chunk {i + 1}...{chunk}\n")
            filename = f"chunk_{i + 1}.mp3"

            SPEAK_OPTIONS = {"text": chunk}
        
            response = deepgram.speak.v("1").save(filename, SPEAK_OPTIONS, options)
            # print(response.to_json(indent=4))

    except Exception as e:
        print(f"Exception: {e}")

if __name__ == "__main__":
    main()

Chunking + Streaming Audio

Instead of turning each chunk into an audio file such as an MP3 file, you might prefer to stream the audio as it comes in. It is possible to start streaming your text-to-speech audio as soon as the first byte arrives.

In this example, the text is chunked by sentence boundaries. Then each chunk is sent to Deepgram to be processed into audio, but when the first byte of audio arrives back to you, it is played immediately in a stream. Each audio stream is played consecutively in the order that the text splits them into chunks.

import re
from deepgram import (
    DeepgramClient,
    SpeakOptions,
)
from pydub import AudioSegment
from pydub.playback import play

input_text = "Our story begins in a peaceful woodland kingdom where a lively squirrel named Frolic made his abode high up within a cedar tree's embrace. He was not a usual woodland creature, for he was blessed with an insatiable curiosity and a heart for adventure. Nearby, a glistening river snaked through the landscape, home to a wonder named Splash - a silver-scaled flying fish whose ability to break free from his water-haven intrigued the woodland onlookers. This magical world moved on a rhythm of its own until an unforeseen circumstance brought Frolic and Splash together. One radiant morning, while Frolic was on his regular excursion, and Splash was making his aerial tours, an unpredictable wave playfully tossed and misplaced Splash onto the riverbank. Despite his initial astonishment, Frolic hurriedly and kindly assisted his new friend back to his watery abode. Touched by Frolic's compassion, Splash expressed his gratitude by inviting his friend to share his world. As Splash perched on Frolic's back, he tasted of the forest's bounty, felt the sun’s rays filter through the colors of the trees, experienced the conversations amidst the woods, and while at it, taught the woodland how to blur the lines between earth and water."

def chunk_text_by_sentence(text):
    # Find sentence boundaries using regular expression
    sentence_boundaries = re.finditer(r'(?<=[.!?])\s+', text)
    boundaries_indices = [boundary.start() for boundary in sentence_boundaries]

    chunks = []
    start = 0
    # Split the text into chunks based on sentence boundaries
    for boundary_index in boundaries_indices:
        chunks.append(text[start:boundary_index + 1].strip())
        start = boundary_index + 1
    chunks.append(text[start:].strip())

    return chunks

def synthesize_audio(text):
    # Create a Deepgram client using the API key
    deepgram = DeepgramClient(api_key="DEEPGRAM_API_KEY")
    # Choose a model to use for synthesis
    options = SpeakOptions(
            model="aura-helios-en",
        )
    speak_options = {"text": text}
    # Synthesize audio and stream the response
    response =  deepgram.speak.v("1").stream(speak_options, options)
    # Get the audio stream from the response
    audio_buffer = response.stream
    audio_buffer.seek(0)
    # Load audio from buffer using pydub
    audio = AudioSegment.from_mp3(audio_buffer)

    return audio

def main():
    # Chunk the text into smaller parts
    chunks = chunk_text_by_sentence(input_text)

    # Synthesize each chunk into audio and play the audio
    for chunk_text in chunks:
        audio = synthesize_audio(chunk_text)
        play(audio)

if __name__ == "__main__":
    main()

📘

Read more about about streaming the TTS audio output in the guide Streaming Audio Outputs.

Considerations

When using text chunking as a strategy to minimize latency, some factors to keep in mind are the following:

  • preserving naturalness of speech - Maintain proper pronunciation, intonation, and rhythm to enhance the user experience.
  • contextual understanding - Analyze the structure and meaning of the text to identify natural breakpoints, such as sentence or clause boundaries, for dividing the text.
  • dynamic chunking - Implement a flexible chunking strategy that adjusts chunk sizes dynamically based on the length and structure of the input text.
  • user expectations - Consider the preferences and needs of users, such as their tolerance for latency, the desired quality of synthesized speech, and their overall satisfaction with the application's performance.