Media Output Settings

Upon successful processing of a Deepgram text-to-speech request, you will receive an audio file containing the synthesized text-to-speech output.

Media Output Settings

Feature	API / Protocol
Encoding	`REST`, `Streaming`
Bit Rate	`REST`
Container	`REST`
Sample Rate	`REST`, `Streaming`

Supported Audio Formats

📘
If you don't provide any values for encoding, container or sample rate in your request Deepgram will provide the following defaults depending on which API / Protocol you use.
Streaming Defaults encoding: Linear16container: n/a sample_rate: 24000
REST Defaults encoding: Linear16 container: wav sample_rate: 24000

Audio Format	Description	Use Cases	API / Protocol
Linear16	Uncompressed, high-fidelity audio format	- Professional audio recordings - Broadcast applications - Quality-centric text-to-speech applications	`REST`, `Streaming`
MuLaw	Compressed audio format	- Telecommunication systems - Voice transmission in bandwidth-constrained environments	`REST`, `Streaming`
ALaw	Compressed audio format	- Telecommunication systems - Voice transmission where bandwidth efficiency is crucial	`REST`, `Streaming`
MP3	Lossy audio compression format. Capable of significantly reducing file sizes while maintaining good audio quality	- Audio streaming - Digital music distribution - Podcasts	`REST`
Opus	Versatile audio codec optimized for interactive speech and music transmission. Offers low latency and superior compression efficiency.	- VoIP (Voice over Internet Protocol) communications - Online gaming - Real-time interactive speech and music transmission	`REST`
FLAC	Lossless audio compression format that maintains original audio quality with efficient storage.	- Archival audio storage - High-quality audio applications	`REST`
AAC	Advanced audio coding format providing superior audio quality compared to MP3 at similar bitrates.	- Mobile applications - Streaming platforms - Text-to-speech (TTS) applications	`REST`

Audio Format Combinations

While Deepgram offers flexibility in choosing encoding options, there are predefined combinations of audio formats with specific configurations that you must adhere to. These combinations determine the encoding parameters, container format, sample rate, bitrate, and content-type header attached to the response.

📘
Refer to the above Table for which formats are supported for REST and Streaming.

Encoding	Container	Sample Rate (Hz)	Bitrate (bps)
`linear16`	`wav`(default), `none`	`8000`, `16000`, `24000`,(default), `32000`, `48000`	Not Applicable
`mulaw`	`wav`(default), `none`	`8000`(default), `16000`	Not Applicable
`alaw`	`wav`(default), `none`	`8000`(default), `16000`	Not Applicable
`mp3`(default)	Not Applicable	Not Configurable ( set to`22050`)	`32000`, `48000` (default)
`opus`	`ogg`(default)	Not Configurable (set to `48000`)	Default: `12000` Range: >= `4000`and <= `650000`
`flac`	Not Applicable	`8000`, `16000`, `22050`, `32000`, `48000`	Not Applicable
`aac`	Not Applicable	Not Configurable (set to `22050`)	Default:`48000`, Range: >=`4000`and <= `192000`

Audio Format Outputs

Depending on which Encoding and Container you choose to use, you will receive the following audio output.

Encoding	Container	Content-Type
`linear16`	`wav`	`audio/wav`
`linear16`	`none`	`audio/l16;rate=24000`
`mulaw`	`wav`	`audio/wav`
`mulaw`	`none`	`audio/mulaw;rate=8000`
`alaw`	`wav`	`audio/wav`
`alaw`	`none`	`audio/alaw;rate=8000`
`mp3`	`n/a`	`audio/mpeg`
`opus`	`ogg`	`audio/ogg;codecs=opus`
`flac`	`n/a`	`audio/flac`
`aac`	`n/a`	`audio/aac`

FAQ & Troubleshooting

Why does Streaming Text to Speech only support a limited set of audio formats?

Currently we only support linear16, mulaw , and alawfor our Streaming TTS Websocket. In the future we may offer additional support for other audio formats.

Why do I hear clicks in my audio?

If you encounter a problem of hearing "clicks" in the audio, we recommend that you add container=none to the request.

These clicks can occur when integrating with telephony providers, i.e. services that handle real-time communication such as VoIP (Voice over Internet Protocol). When using these services, you probably want to use raw audio without a container. If you don't specify container=none, the default container of wav will be requested, which will result in header information being misinterpreted by the service, leading to anomalies such as clicks or static sounds.

Be sure to ask Deepgram to provide raw audio by setting container=none; otherwise, you will likely hear static or clicking at the beginning of the audio playback as your speaker tries to interpret the accidentally-included container as actual audio.

🚧
When using VoIP (Voice over Internet Protocol), we recommend adding container=none to your request to prevent request header information being misinterpreted as audio, which can result in static or click sounds.

Why do certain audio formats have a default container and some don't?

Many encodings require a certain container format as part of their specification (e.g., MP3, FLAC, AAC), so Deepgram can't provide container options for those.

Other encodings can be container-less (e.g., raw audio). This is why wav and none are options for those.

With raw audio you no longer have metadata to give you information like the sample rate or encoding.

Why do certain audio formats have sample rate options and some don't?

Some formats specify their sample rates, so we cannot provide sample_rate options for them.

Why do certain audio formats have bit rates options and some don't?

Uncompressed audio formats don't have configurable bit rates, so we cannot provide bit_rate options for them.