Build a Voice Agent with Pipecat and Deepgram
Learn how to build a real-time voice agent using Deepgram for speech-to-text and text-to-speech, with Pipecat for pipeline orchestration.
If you already use Pipecat for voice AI pipelines, you can add Deepgramโs speech-to-text and text-to-speech models to your Pipecat agent. Pipecatโs pipeline architecture connects separate STT, LLM, and TTS services, so this guide pairs Deepgramโs audio models with OpenAI for language understanding โ though any Pipecat-compatible LLM works.
For a standalone voice agent without Pipecat or an external LLM, see the Deepgram Voice Agent API, which bundles STT, LLM routing, and TTS in a single WebSocket connection.
For a CLI-scaffolded approach using pipecat init, see the Pipecat and Deepgram integration guide.
Before You Begin
This guide assumes you are familiar with Python and have a basic understanding of how voice agents work.
Youโll need a Deepgram account and an API key. Signup is free and includes $200 in credit.
Get OpenAI Credentials
This tutorial uses OpenAI for its LLM. Youโll need to sign up for an OpenAI account and obtain an API key.
Requirements
- Python 3.11+
Set Up Your Project
This implementation is a starting reference for building your own voice agent with Pipecat and Deepgram. It is not designed for production deployments.
Create a new directory, set up a virtual environment, and install Pipecat with the Deepgram, OpenAI, and runner extras:
The runner extra includes a built-in WebRTC transport with a browser-based test client โ no external account needed.
Set Environment Variables
Create a .env file in your project root with the credentials you collected earlier. The agent reads these at startup to authenticate with each service:
Build the Agent
The agent creates a pipeline that connects audio input to Deepgram for transcription, OpenAI for a response, and Deepgram again for speech synthesis. Pipecat handles the audio transport, turn-taking, and interruption detection.
The key components are:
Pipelineโ connects frame processors in sequence: transport input, STT, context aggregation, LLM, TTS, and transport output.LLMContextAggregatorPairโ manages conversation context and uses Silero VAD to detect when the user starts and stops speaking.LLMRunFrameโ triggers the LLM to generate a response. Used here to make the agent greet the user on connect.
Create bot.py:
How the pipeline works
Audio flows through the pipeline from left to right:
transport.input()captures microphone audio from the browser.stt(Deepgram Nova-3) transcribes audio into text in real-time.user_aggregatorcollects transcription frames and uses Silero VAD to detect when the user finishes speaking, then adds the complete utterance to the conversation context.llm(OpenAI GPT-4o) generates a text response based on the conversation context.tts(Deepgram Aura) converts the response text to speech.transport.output()sends the audio back to the browser.assistant_aggregatorrecords the assistantโs response in the conversation context for future turns.
When a user speaks over the agent, Silero VAD detects the interruption and cancels the current TTS output so the agent stops and listens.
Run the Agent
Start the agent with the WebRTC transport. The -t webrtc flag launches a built-in browser client for testing:
The first run takes about 20 seconds to download the Silero VAD model. Subsequent starts are faster.
Test the Agent
Once the agent is running, open http://localhost:7860/client in your browser.
- Allow microphone access when prompted
- Click to connect โ the agent greets you automatically
- Start talking โ the agent responds in real time
Try interrupting the agent mid-sentence. Silero VAD detects your speech and cancels the current response so the agent listens to you instead.
Further Reading
- Deepgram Voice Agent API โ build voice agents without an external LLM or transport layer
- Pipecat and Deepgram Integration Guide โ scaffold a project with the Pipecat CLI
- Pipecat Documentation โ Pipecatโs framework reference