Pipecat and Deepgram
Build a real-time voice AI agent using Pipecat with Deepgram speech-to-text and text-to-speech.
This guide walks you through building a voice AI agent that uses Pipecat for pipeline orchestration and Deepgram for speech-to-text (STT) and text-to-speech (TTS). By the end, you have a working voice agent that listens to a user, generates a response with an LLM, and speaks back in real-time.
Pipecat is an open-source Python framework for building voice and multimodal AI agents. It connects STT, LLM, and TTS services into a real-time pipeline and handles audio transport, turn-taking, and interruption detection.
Before you begin
Before you can use Deepgram, you need to create a Deepgram account. Signup is free and includes $200 in credit.
You need:
- A Deepgram API key
- An LLM API key. This guide uses OpenAI, but Pipecat supports other providers including Anthropic, Google, and Groq
- Python 3.11+
- Node.js 18+
- uv (for installing the Pipecat CLI)
Step 1: Create the project
Install the Pipecat CLI, then scaffold a new project:
Step 2: Install dependencies
Navigate to the server directory inside your new project, create a virtual environment, and install the dependencies:
Step 3: Configure your environment
Copy the example environment file and fill in your API keys:
Replace the placeholder values with your API keys:
- DEEPGRAM_API_KEY โ from your Deepgram Console
- OPENAI_API_KEY โ from your OpenAI dashboard
- DAILY_API_KEY โ from your Daily dashboard. Daily is the WebRTC transport layer that handles audio between the browser and your agent. See the Pipecat Daily transport guide for more.
The remaining values are defaults you can change later.
Step 4: Run the agent
You need two terminals โ one for the server and one for the client.
In the first terminal, start the bot from the server directory:
The first run takes about 20 seconds to download the Silero VAD model. Subsequent starts are faster.
In a second terminal, install the client dependencies and start the dev server:
Step 5: Test the conversation
Open the local URL shown in your terminal after npm run dev, allow microphone access, and start talking to your agent. Try these interactions:
- Ask a question and confirm the agent responds with speech.
- Start speaking while the agent is talking. It should stop and listen.
- Pause after speaking. The agent should detect the end of your turn and respond.
Using the quickstart repo instead
If you prefer not to use the CLI, clone the quickstart repo. The quickstart uses Deepgram for STT but Cartesia for TTS. To switch TTS to Deepgram, open bot.py and find the Cartesia TTS setup:
Replace it with:
You can also remove CARTESIA_API_KEY from your .env file since it is no longer needed. No other changes are required. The STT service already uses Deepgram and the rest of the pipeline stays the same.
Use Flux for turn detection
Flux is Deepgramโs conversational STT model with built-in turn detection. It uses acoustic and semantic cues to determine when a speaker has finished their turn, resulting in more natural conversations.
To use Flux, replace DeepgramSTTService with DeepgramFluxSTTService in your bot.py:
Even when using Flux for turn detection, keep the Silero VAD analyzer in your pipeline. Flux handles turn detection, but VAD is required for interruption handling. Without it, the agent cannot detect when a user speaks over the agentโs response.
Go further with Deepgram
- Voices โ Deepgram offers 60+ voices across seven languages. Browse the voice library and update
DEEPGRAM_VOICE_IDin your.envfile. - Keyterm prompting โ Improve recognition of domain-specific vocabulary by passing keyterms to Nova-3 via the STT service settings.
- Speaker diarization โ Assign a speaker identifier to each word in the transcript using diarization via the STT service settings.
- Dynamic STT settings โ Pipecat supports updating Deepgram STT settings without reconnecting. See the Pipecat Deepgram STT guide for details.