Handling Audio Issues in Text To Speech
Learn how to better handle audio issues when processing text-to-speech.
Text-to-speech (TTS) technology involves challenges like producing natural-sounding speech and eliminating audio errors. Achieving high-quality output requires careful text pre-processing, accurate phonetic and prosodic modeling, and post-processing steps like noise reduction and compression.
Deepgram handles much of this for you, but if you are experiencing issues with your text-to-speech output, this guide will help you troubleshoot those problems.
Common problems
Attempting to Play Audio in a Web Browser
When attempting to play generated audio within a web browser, like Chrome, you may not hear any audio playing from your speakers because the stream is not containerized audio (ie, audio with a defining header, for example: wav
header).
Currently, the Text-to-Speech WebSocket implementation does not support containerized audio formats. To support playing audio through a web browser, it is recommended that you create and prepend a containerized audio header for your audio output bytes each time they are generated.
In the linear16
audio encoding case, you will need to prepend a WAV
container header to each audio segment you plan to play through the speakers. This is required for many media device implementations within a browser. In several cases for Python, this may be sufficient for a WAV
header in the case of a file:
In many cases, file-based playback is not desired, and you may want to play the audio directly by streaming to the media device. For those cases, you may need to manipulate the audio stream and create the header bytes directly in front of the audio stream segments. In Go, this may look like the following:
There may be variations in creating this header based on the goals of your application.
DC Offset
DC (Direct Current) offset refers to a mean amplitude displacement from zero in an audio signal. Essentially, it means the audio waveform is not centered on the zero axis. This can cause problems such as reduced headroom, distortion, and other audio inefficiencies.
Identifying DC Offset
- Visual Inspection: In an audio editing software, the waveform will appear shifted up or down above or below the zero axis.
- Measurement: Calculate the average amplitude of the signal over time.
No DC Offset
This audio doesn’t contain a DC Offset.
DC Offset
This audio does contain a DC Offset.
Correcting DC Offset
- DC Offset Removal Tool: Most audio editing software, like Audacity provides a DC offset correction tool.
- High-pass Filtering: Applying a high-pass filter with a very low cutoff frequency can also remove DC offset.
- Correcting DC offset is important to ensure the audio signal’s integrity and to prevent unwanted noise and distortion.
To learn more about how to use Audacity see to the Audacity Docs.
Audio clicking noises
Using container=none
When attempting to play generated audio direct to an output device (like using Portaudio) for streaming audio playback, we recommend adding container=none
to your request to prevent request header information from being misinterpreted as audio, which can result in static or click sounds. If you’ve changed the encoding and still are hearing clicks, it might be another problem.
Depending on your use case, you may require containerized audio. To understand the differences when to use one of the other, check out the Getting Started Guide.
To Learn more about the the different text-to-speech encoding option see this guide.
Analyzing audio with FFMPEG
FFmpeg can be used to analyze audio and is a very powerful tool. You can use the tool to analyze your audio files with audio issues, such as clicking noises.
For example, you could take an audio file that is having clicking noises and run:
Then, import the raw audio into Audacity for further analysis.
To learn more about how to use FFmpeg see the FFmpeg Docs.
Audio header issues
In some cases, additional headers might be added to the response, which could cause audio playback issues. For example, if you request container=wav
then you should expect only 1 wav
header at the beginning of the response.
You can inspect your audio headers by first doing a hexdump
before reading the audio file on your terminal.
In the terminal you’ll see output as such, in this case we see duplicate wave headers and there are two RIFF headers in the audio.
This is the source of the clicking.
Hexdump in available on macOS and Linux. if you use Windows try HxD.
Audio includes other type of data
Ensure your audio output contains no additional data types besides an audio response. For example, a common issue users face is when incorrect code contains JSON
in the audio response. This can sometimes be inserted after a Flush JSON
message from Deepgram has returned indicating a web socket stream has closed.
Download Audacity
Audacity is a free, open-source, cross-platform audio editing software. It allows users to record, edit, and mix audio tracks, and supports various file formats like WAV, MP3, and OGG. Key features include multi-track editing, noise reduction, click removal, equalization, compression, and various effects and plugins. Audacity is widely used for podcasting, music production, and general audio editing tasks due to its powerful features and user-friendly interface. It is available for Windows, macOS, and Linux.
Download FFMPEG
FFmpeg is a free and powerful open-source software suite for handling multimedia data. It can record, convert, stream, and play audio and video files in various formats.