Text to Speech Prompting
Prompting text-to-speech with natural pauses or filler words can help to make your audio sound more natural.
This guide introduces specific techniques for directing Deepgram Aura to produce audio output that sounds more like natural speech. Use punctuation and filler words to include intentional pauses in speech and to adjust the rhythm of speech for better pacing and engagement with users.
Pauses
Natural Pauses
Natural Pauses are pauses non-deterministically generated by our model that can sometimes be a long silent pause, a breath, or an elongation of a previous or next word, depending on the context of the text input. If you need to insert a longer pause in your audio, use the ellipsis: ...
. To include longer pauses, you can have insert more dots in groups of three (eg. 6 dots). Any dots that are not in groups of three (eg. 5 dots) will not be determined as a pause.
A comma (,
) or a period (.
) present in your text will be treated as a very short pause.
"Hello, how can I help you today? ... Are you there ... Hello?"
curl --request POST \
--header "Content-Type: application/json" \
--header "Authorization: Token DEEPGRAM_API_KEY" \
--output your_output_file.mp3 \
--data '{"text":"Hello, how can I help you today? ... Are you there ... Hello?"}' \
--url "https://api.deepgram.com/v1/speak?model=aura-asteria-en"
Use Ellipses when you'd like to insert a natural pause to imitate effects of thinking. It can be used together with filler words:
"Deepgram is great for real-time conversational experiences… and also, you can build apps for things like... customer support, logistics, and more. Um… what do you think about the voices?"
Use comma or a period when you'd like to include a short pause in dense information. If there are used too densely, the speech may sound awkward. For example, if you'd like to add pauses with phone numbers, we suggest that you add a period between every 3-4 characters instead of every character:
"Your phone number is 203.912.3456."
Silent Pauses
Silent Pauses are only available as a prompting work around for our current Aura model and may not be available in our future models. If used in extreme cases (eg. too frequently, or with too many dots added), it may not work as intended.
Silent pauses are pauses that have a higher probability of staying silent in the pause duration, with lower chance of having breaths or elongation of words seen in Natural pauses. You can create a silent pause with a series of dots with a space in between . . .
. (Note that it is not ...
). To increase pause duration, include more dots with spaces . . . .
"To confirm, is your registration number BY. . 3984. . 0297?"
curl --request POST \
--header "Content-Type: application/json" \
--header "Authorization: Token DEEPGRAM_API_KEY" \
--output your_output_file.mp3 \
--data '{"text":"To confirm, is your registration number BY. . 3984. . 0297?"}' \
--url "https://api.deepgram.com/v1/speak?model=aura-asteria-en"
Filler words
Filler words such as um
and uh
can also be used to offer a more natural sounding audio output.
"Hello, how can I help you today? um Are you there uh Hello?"
curl --request POST \
--header "Content-Type: application/json" \
--header "Authorization: Token DEEPGRAM_API_KEY" \
--output your_output_file.mp3 \
--data '{"text":"Hello, how can I help you today? um Are you there uh Hello?"}' \
--url "https://api.deepgram.com/v1/speak?model=aura-asteria-en"
Pronunciation
Pronunciation control
While we do not offer pronunciation control as part of our API, you can create spelled-out words as they are spoken and include them as part of the LLM prompt or part of text normalization. For example, Thalia
can be spoken as Taylia
.
"Can I confirm that your name, spelled Teee Aitch Eigh Elle Eye Eigh, is pronounced as Taylia?"
curl --request POST \
--header "Content-Type: application/json" \
--header "Authorization: Token DEEPGRAM_API_KEY" \
--output your_output_file.mp3 \
--data '{"text":" To confirm, is your referral code Queue Why. Eigh Beee?"}' \
--url "https://api.deepgram.com/v1/speak?model=aura-asteria-en"
Alphabets
While we are currently working on improving alphabet pronunciation (A-Z), if you see issues with pronunciation in your use case, we suggest that you use the following spelled out words as your text input. You can include this as part of your prompt in you LLM, or use text normalization.
"The alphabets are Eigh, Beee, Sea, Deee, Eeeee, Eff, Geee, Aitch, Eye, Jay, Kay, Elle , Emm, En, Owe, Peee, Queue, Ar, Ess, Teee, Yue, Veee, Double Yue, Eks, Why, Zeee."
You can add natural pauses between groups of 2 - 4 alphabets to include pauses.
"To confirm, is your referral code Queue Why. Eigh Beee?"
curl --request POST \
--header "Content-Type: application/json" \
--header "Authorization: Token DEEPGRAM_API_KEY" \
--output your_output_file.mp3 \
--data '{"text":" To confirm, is your referral code Queue Why. Eigh Beee?"}' \
--url "https://api.deepgram.com/v1/speak?model=aura-asteria-en"
Acronyms
In most cases, acronyms can be handled by just providing the letters of the acronym. Given some acronyms are pronounced as a word (e.g, NASA), while others aren't (e.g, NBA), Aura will attempt to pronounce the acronym correctly in your audio output.
"I love watching NBA Basketball."
curl --request POST \
--header "Content-Type: application/json" \
--header "Authorization: Token DEEPGRAM_API_KEY" \
--output your_output_file.mp3 \
--data '{"text":"I love watching NBA Basketball."}' \
--url "https://api.deepgram.com/v1/speak?model=aura-asteria-en"
Numbers
Depending on how you want numbers to be spoken in your audio output, consider using the following prompts for number pronunciation.
Explicitly add the word and
to tell the model to pronounce the entire phrase as "twelve hundred and thirty-five". Otherwise, the model will pronounce it as "twelve thirty-five".
"The total is 1235, or twelve hundred and thirty-five."
curl --request POST \
--header "Content-Type: application/json" \
--header "Authorization: Token DEEPGRAM_API_KEY" \
--output your_output_file.mp3 \
--data '{"text":"The total is 1235, or twelve hundred and thirty-five."}' \
--url "https://api.deepgram.com/v1/speak?model=aura-asteria-en"
Words in different languages
If you are having trouble with words rooted in different languages in your text, you can try spelling the word out phonetically and Aura will attempt to pronounce the word correctly in your audio output. However, in some cases Aura can pronounce these words correctly without the phonetic spelling.
"I want to rahn-day-voo with you."
"I want to rendezvous with you."
curl --request POST \
--header "Content-Type: application/json" \
--header "Authorization: Token DEEPGRAM_API_KEY" \
--output your_output_file.mp3 \
--data '{"text":"I want to rahn-day-voo with you."}' \
--url "https://api.deepgram.com/v1/speak?model=aura-asteria-en"
We want your feedback!
Updated 6 months ago