Handling Audio Issues in Text To Speech
Learn how to better handle audio issues when processing text-to-speech.
Text-to-speech (TTS) technology involves challenges like producing natural-sounding speech and eliminating audio errors. Achieving high-quality output requires careful text pre-processing, accurate phonetic and prosodic modeling, and post-processing steps like noise reduction and compression.
Deepgram handles much of this for you, but if you are experiencing issues with your text-to-speech output, this guide will help you troubleshoot those problems.
Common problems
Attempting to Play Audio in a Web Browser
When attempting to play generated audio within a web browser, like Chrome, you may not hear any audio playing from your speakers because the stream is not containerized audio (ie, audio with a defining header, for example: wav
header).
Currently, the Text-to-Speech WebSocket implementation does not support containerized audio formats. To support playing audio through a web browser, it is recommended that you create and prepend a containerized audio header for your audio output bytes each time they are generated.
In the linear16
audio encoding case, you will need to prepend a WAV
container header to each audio segment you plan to play through the speakers. This is required for many media device implementations within a browser. In several cases for Python, this may be sufficient for a WAV
header in the case of a file:
import wave
# Generate a generic WAV container header
header = wave.open(AUDIO_FILE, "wb")
header.setnchannels(1) # Mono audio
header.setsampwidth(2) # 16-bit audio
header.setframerate(16000) # Sample rate of 16000 Hz
header.close()
# then continue saving the rest of the audio stream to the file
In many cases, file-based playback is not desired, and you may want to play the audio directly by streaming to the media device. For those cases, you may need to manipulate the audio stream and create the header bytes directly in front of the audio stream segments. In Go, this may look like the following:
// Add a wav audio container header to the file if you want to play the audio
// using the AudioContext or media player like VLC, Media Player, or Apple Music
// Without this header in the Chrome browser case, the audio will not play.
const wavHeader = Buffer.from([
0x52, 0x49, 0x46, 0x46, // "RIFF"
0x00, 0x00, 0x00, 0x00, // Placeholder for file size
0x57, 0x41, 0x56, 0x45, // "WAVE"
0x66, 0x6D, 0x74, 0x20, // "fmt "
0x10, 0x00, 0x00, 0x00, // Chunk size (16)
0x01, 0x00, // Audio format (1 for PCM)
0x01, 0x00, // Number of channels (1)
0x80, 0xBB, 0x00, 0x00, // Sample rate (48000)
0x00, 0xEE, 0x02, 0x00, // Byte rate (48000 * 2)
0x02, 0x00, // Block align (2)
0x10, 0x00, // Bits per sample (16)
0x64, 0x61, 0x74, 0x61, // "data"
0x00, 0x00, 0x00, 0x00 // Placeholder for data size
]);
// Concatenate the header to your audio buffer
const audio = Buffer.concat([wavHeader, audioBuffer]);
# Add a wav audio container header to the file if you want to play the audio
# using the AudioContext or media player like VLC, Media Player, or Apple Music
# Without this header in the Chrome browser case, the audio will not play.
header = bytes(
[
0x52,
0x49,
0x46,
0x46, # "RIFF"
0x00,
0x00,
0x00,
0x00, # Placeholder for file size
0x57,
0x41,
0x56,
0x45, # "WAVE"
0x66,
0x6D,
0x74,
0x20, # "fmt "
0x10,
0x00,
0x00,
0x00, # Chunk size (16)
0x01,
0x00, # Audio format (1 for PCM)
0x01,
0x00, # Number of channels (1)
0x80,
0xBB,
0x00,
0x00, # Sample rate (48000)
0x00,
0xEE,
0x02,
0x00, # Byte rate (48000 * 2)
0x02,
0x00, # Block align (2)
0x10,
0x00, # Bits per sample (16)
0x64,
0x61,
0x74,
0x61, # "data"
0x00,
0x00,
0x00,
0x00, # Placeholder for data size
]
)
// Add a wav audio container header to the file if you want to play the audio
// using the AudioContext or media player like VLC, Media Player, or Apple Music
// Without this header in the Chrome browser case, the audio will not play.
header := []byte{
0x52, 0x49, 0x46, 0x46, // "RIFF"
0x00, 0x00, 0x00, 0x00, // Placeholder for file size
0x57, 0x41, 0x56, 0x45, // "WAVE"
0x66, 0x6d, 0x74, 0x20, // "fmt "
0x10, 0x00, 0x00, 0x00, // Chunk size (16)
0x01, 0x00, // Audio format (1 for PCM)
0x01, 0x00, // Number of channels (1)
0x80, 0xbb, 0x00, 0x00, // Sample rate (48000)
0x00, 0xee, 0x02, 0x00, // Byte rate (48000 * 2)
0x02, 0x00, // Block align (2)
0x10, 0x00, // Bits per sample (16)
0x64, 0x61, 0x74, 0x61, // "data"
0x00, 0x00, 0x00, 0x00, // Placeholder for data size
}
// Add a wav audio container header to the file if you want to play the audio
// using the AudioContext or media player like VLC, Media Player, or Apple Music
// Without this header in the Chrome browser case, the audio will not play.
byte[] header = new byte[]
{
0x52, 0x49, 0x46, 0x46, // "RIFF"
0x00, 0x00, 0x00, 0x00, // Placeholder for file size
0x57, 0x41, 0x56, 0x45, // "WAVE"
0x66, 0x6d, 0x74, 0x20, // "fmt "
0x10, 0x00, 0x00, 0x00, // Chunk size (16)
0x01, 0x00, // Audio format (1 for PCM)
0x01, 0x00, // Number of channels (1)
0x80, 0xbb, 0x00, 0x00, // Sample rate (48000)
0x00, 0xee, 0x02, 0x00, // Byte rate (48000 * 2)
0x02, 0x00, // Block align (2)
0x10, 0x00, // Bits per sample (16)
0x64, 0x61, 0x74, 0x61, // "data"
0x00, 0x00, 0x00, 0x00, // Placeholder for data size
};
There may be variations in creating this header based on the goals of your application.
DC Offset
DC (Direct Current) offset refers to a mean amplitude displacement from zero in an audio signal. Essentially, it means the audio waveform is not centered on the zero axis. This can cause problems such as reduced headroom, distortion, and other audio inefficiencies.
Identifying DC Offset
- Visual Inspection: In an audio editing software, the waveform will appear shifted up or down above or below the zero axis.
- Measurement: Calculate the average amplitude of the signal over time.
No DC Offset
DC Offset
Correcting DC Offset
- DC Offset Removal Tool: Most audio editing software, like Audacity provides a DC offset correction tool.
- High-pass Filtering: Applying a high-pass filter with a very low cutoff frequency can also remove DC offset.
- Correcting DC offset is important to ensure the audio signal's integrity and to prevent unwanted noise and distortion.
To learn more about how to use Audacity see to the Audacity Docs.
Audio clicking noises
Using container=none
When attempting to play generated audio direct to an output device (like using Portaudio) for streaming audio playback, we recommend adding container=none
to your request to prevent request header information from being misinterpreted as audio, which can result in static or click sounds. If you've changed the encoding and still are hearing clicks, it might be another problem.
Depending on your use case, you may require containerized audio. To understand the differences when to use one of the other, check out the Getting Started Guide.
To Learn more about the the different text-to-speech encoding option see this guide.
Analyzing audio with FFMPEG
FFmpeg can be used to analyze audio and is a very powerful tool. You can use the tool to analyze your audio files with audio issues, such as clicking noises.
For example, you could take an audio file that is having clicking noises and run:
ffmpeg -f mulaw -codec:a pcm_mulaw -ar 8000 -ac 1 -i your_audio_file -f s16le -codec:a pcm_s16le -ar 8000 -ac 1 pipe:1 | aucat -c 0:0 -e s16le -f snd/1 -h raw -r 8000 -i
Then, import the raw audio into Audacity for further analysis.
To learn more about how to use FFmpeg see the FFmpeg Docs.
Audio header issues
In some cases, additional headers might be added to the response, which could cause audio playback issues. For example, if you request container=wav
then you should expect only 1 wav
header at the beginning of the response.
You can inspect your audio headers by first doing a hexdump
before reading the audio file on your terminal.
hexdump -C your_audio_file
In the terminal you'll see output as such, in this case we see duplicate wave headers and there are two RIFF headers in the audio.
This is the source of the clicking.
00000000 52 49 46 46 56 1d 00 00 57 41 56 45 66 6d 74 20 |RIFFV...WAVEfmt | 00000010 10 00 00 00 01 00 01 00 40 1f 00 00 80 3e 00 00 |........@....>..| 00000020 02 00 10 00 64 61 74 61 32 1d 00 00 52 49 46 46 |....data2...RIFF| 00000030 24 00 ff 7f 57 41 56 45 66 6d 74 20 10 00 00 00 |$...WAVEfmt ....| 00000040 01 00 01 00 40 1f 00 00 80 3e 00 00 02 00 10 00 |....@....>......| 00000050 64 61 74 61 00 00 ff 7f 0e f9 69 fb a6 ff 28 02 |data......i...(.| 00000060 15 04 3e 05 fd 02 81 00 9f fa 90 f5 d9 ee 02 eb |..>.............| 00000070 7d fb 1d 03 16 0b f8 15 c4 17 05 15 3b 0f a3 ff |}...........;...| 00000080 e9 f4 2f f1 42 ef 9f f4 59 fa cd fe 2d 08 7b 0f |../.B...Y...-.{.| 00000090 ae 10 cc 0d 18 07 f1 ff d0 fa ea f4 d4 f0 89 f3 |................|
Hexdump in available on macOS and Linux. if you use Windows try HxD.
Audio includes other type of data
Ensure your audio output contains no additional data types besides an audio response. For example, a common issue users face is when incorrect code contains JSON
in the audio response. This can sometimes be inserted after a Flush JSON
message from Deepgram has returned indicating a web socket stream has closed.
Download Audacity
Audacity is a free, open-source, cross-platform audio editing software. It allows users to record, edit, and mix audio tracks, and supports various file formats like WAV, MP3, and OGG. Key features include multi-track editing, noise reduction, click removal, equalization, compression, and various effects and plugins. Audacity is widely used for podcasting, music production, and general audio editing tasks due to its powerful features and user-friendly interface. It is available for Windows, macOS, and Linux.
Download FFMPEG
FFmpeg is a free and powerful open-source software suite for handling multimedia data. It can record, convert, stream, and play audio and video files in various formats.
Updated 3 months ago