Pipecat and Deepgram
Build a real-time voice AI agent using Pipecat with Deepgram speech-to-text and text-to-speech.
Build a real-time voice AI agent using Pipecat with Deepgram speech-to-text and text-to-speech.
This guide walks you through building a voice AI agent that uses Pipecat for pipeline orchestration and Deepgram for speech-to-text (STT) and text-to-speech (TTS). By the end, you have a working voice agent that listens to a user, generates a response with an LLM, and speaks back in real-time.
Pipecat is an open-source Python framework for building voice and multimodal AI agents. It connects STT, LLM, and TTS services into a real-time pipeline and handles audio transport, turn-taking, and interruption detection.
Before you can use Deepgram, you need to create a Deepgram account. Signup is free and includes $200 in credit.
You need:
Install the Pipecat CLI, then scaffold a new project:
Navigate to the server directory inside your new project, create a virtual environment, and install the dependencies:
Copy the example environment file and fill in your API keys:
Replace the placeholder values with your API keys:
The remaining values are defaults you can change later.
You need two terminals — one for the server and one for the client.
In the first terminal, start the bot from the server directory:
The first run takes about 20 seconds to download the Silero VAD model. Subsequent starts are faster.
In a second terminal, install the client dependencies and start the dev server:
Open the local URL shown in your terminal after npm run dev, allow microphone access, and start talking to your agent. Try these interactions:
If you prefer not to use the CLI, clone the quickstart repo. The quickstart uses Deepgram for STT but Cartesia for TTS. To switch TTS to Deepgram, open bot.py and find the Cartesia TTS setup:
Replace it with:
You can also remove CARTESIA_API_KEY from your .env file since it is no longer needed. No other changes are required. The STT service already uses Deepgram and the rest of the pipeline stays the same.
Flux is Deepgram’s conversational STT model with built-in turn detection. It uses acoustic and semantic cues to determine when a speaker has finished their turn, resulting in more natural conversations.
To use Flux, replace DeepgramSTTService with DeepgramFluxSTTService in your bot.py:
Even when using Flux for turn detection, keep the Silero VAD analyzer in your pipeline. Flux handles turn detection, but VAD is required for interruption handling. Without it, the agent cannot detect when a user speaks over the agent’s response.
DEEPGRAM_VOICE_ID in your .env file.