Text To Speech

An overview of the Deepgram Python SDK and Deepgram text-to-speech.

The Deepgram Speak Clients allows you to take text to generate human-like audio.

📘

This SDK supports both the Threaded and Async/Await Clients as described in the Threaded and Async IO Task Support section. The code blocks contain a tab for Threaded and Async to show examples for speak versus asyncspeak, respectively. The difference between Threaded and Async is subtle.

Make a Deepgram Text-to-Speech Request

The Deepgram Speak Clients allows you to create audio generated from provided text, also known as text-to-speech.

There are two functions provided to translate text-to-speech:

  • stream() - This takes the resulting audio and holds it in memory via io.BytesIO.
  • save() or file() - This saves the audio stream to a user provided filename.
import os
from deepgram import DeepgramClient, SpeakOptions

SPEAK_OPTIONS = {"text": "Hello, how can I help you today?"}
filename = "output.mp3"

def main():
    try:
        # STEP 1: Create a Deepgram client.
        # By default, the DEEPGRAM_API_KEY environment variable will be used for the API Key
        deepgram = DeepgramClient()

        # STEP 2: Configure the options (such as model choice, audio configuration, etc.)
        options = SpeakOptions(
            model="aura-asteria-en",
        )

        # STEP 3: Call the save method on the speak property
        response = deepgram.speak.v("1").save(filename, SPEAK_OPTIONS, options)
        print(response.to_json(indent=4))

    except Exception as e:
        print(f"Exception: {e}")

if __name__ == "__main__":
    main()

import asyncio
from deepgram import DeepgramClient, SpeakOptions

SPEAK_OPTIONS = {"text": "Hello, how can I help you today?"}
filename = "output.mp3"


async def main():
    try:
        # STEP 1: Create a Deepgram client.
        # By default, the DEEPGRAM_API_KEY environment variable will be used for the API Key
        deepgram = DeepgramClient()

        # STEP 2: Configure the options (such as model choice, audio configuration, etc.)
        options = SpeakOptions(
            model="aura-asteria-en",
        )

        # STEP 3: Call the save method on the speak property
        response = await deepgram.asyncspeak.v("1").save(filename, SPEAK_OPTIONS, options)
        print(response.to_json(indent=4))

    except Exception as e:
        print(f"Exception: {e}")


if __name__ == "__main__":
    asyncio.run(main())

Audio Output Streaming

Deepgram's TTS API allows you to start playing the audio once the first byte is received. This section provides examples to help you stream the audio output efficiently.

Single Text Source Payload

The following example demonstrates how to stream the audio as soon as the first byte arrives for a single text source.

import re
from deepgram import Deepgram
import requests

DEEPGRAM_API_KEY = 'YOUR_DEEPGRAM_API_KEY'
deepgram = Deepgram(DEEPGRAM_API_KEY)

inputText = "Your long text goes here..."

def segmentTextBySentence(text):
    return re.findall(r"[^.!?]+[.!?]", text)

def synthesizeAudio(text):
    response = deepgram.speech.speak(
        text,
        model='aura-helios-en',
        encoding='linear16',
        container='wav',
    )

    if response.status_code == 200:
        return response.content
    else:
        raise Exception('Error generating audio')

def main():
    segments = segmentTextBySentence(inputText)

    for i, segment in enumerate(segments):
        try:
            audioData = synthesizeAudio(segment)
            with open(f"output_{i}.wav", 'wb') as f:
                f.write(audioData)
            print(f"Audio stream finished for segment: {segment}")
        except Exception as error:
            print(f"Error synthesizing audio: {error}")

    print("Audio file creation completed.")

main()

Chunked Text Source Payload

This example shows how to chunk the text source by sentence boundaries and stream the audio for each chunk consecutively.

import re
from deepgram import Deepgram
import requests

DEEPGRAM_API_KEY = 'YOUR_DEEPGRAM_API_KEY'
deepgram = Deepgram(DEEPGRAM_API_KEY)

inputText = "Your long text goes here..."

def segmentTextBySentence(text):
    return re.findall(r"[^.!?]+[.!?]", text)

def synthesizeAudio(text):
    response = deepgram.speech.speak(
        text,
        model='aura-helios-en',
        encoding='linear16',
        container='wav',
    )

    if response.status_code == 200:
        return response.content
    else:
        raise Exception('Error generating audio')

def main():
    segments = segmentTextBySentence(inputText)

    for i, segment in enumerate(segments):
        try:
            audioData = synthesizeAudio(segment)
            with open(f"output_{{i}}.wav", 'wb') as f:
                f.write(audioData)
            print(f"Audio stream finished for segment: {{segment}}")
        except Exception as error:
            print(f"Error synthesizing audio: {{error}}")

    print("Audio file creation completed.")

main()

Where to Find Additional Examples

The SDK repository has a good collection of text-to-speech examples. The README contains links to them. Each example below attempts to provide different options for transcribing an audio source.

Some Examples:

If the Async Client suits your use case better: