Build a Voice Agent with Pipecat and Deepgram
Learn how to build a real-time voice agent using Deepgram for speech-to-text and text-to-speech, with Pipecat for pipeline orchestration.
Learn how to build a real-time voice agent using Deepgram for speech-to-text and text-to-speech, with Pipecat for pipeline orchestration.
If you already use Pipecat for voice AI pipelines, you can add Deepgram’s speech-to-text and text-to-speech models to your Pipecat agent. Pipecat’s pipeline architecture connects separate STT, LLM, and TTS services, so this guide pairs Deepgram’s audio models with OpenAI for language understanding — though any Pipecat-compatible LLM works.
For a standalone voice agent without Pipecat or an external LLM, see the Deepgram Voice Agent API, which bundles STT, LLM routing, and TTS in a single WebSocket connection.
For a CLI-scaffolded approach using pipecat init, see the Pipecat and Deepgram integration guide.
This guide assumes you are familiar with Python and have a basic understanding of how voice agents work.
You’ll need a Deepgram account and an API key. Signup is free and includes $200 in credit.
This tutorial uses OpenAI for its LLM. You’ll need to sign up for an OpenAI account and obtain an API key.
This implementation is a starting reference for building your own voice agent with Pipecat and Deepgram. It is not designed for production deployments.
Create a new directory, set up a virtual environment, and install Pipecat with the Deepgram, OpenAI, and runner extras:
The runner extra includes a built-in WebRTC transport with a browser-based test client — no external account needed.
Create a .env file in your project root with the credentials you collected earlier. The agent reads these at startup to authenticate with each service:
The agent creates a pipeline that connects audio input to Deepgram for transcription, OpenAI for a response, and Deepgram again for speech synthesis. Pipecat handles the audio transport, turn-taking, and interruption detection.
The key components are:
Pipeline — connects frame processors in sequence: transport input, STT, context aggregation, LLM, TTS, and transport output.LLMContextAggregatorPair — manages conversation context and uses Silero VAD to detect when the user starts and stops speaking.LLMRunFrame — triggers the LLM to generate a response. Used here to make the agent greet the user on connect.Create bot.py:
Audio flows through the pipeline from left to right:
transport.input() captures microphone audio from the browser.stt (Deepgram Nova-3) transcribes audio into text in real-time.user_aggregator collects transcription frames and uses Silero VAD to detect when the user finishes speaking, then adds the complete utterance to the conversation context.llm (OpenAI GPT-4o) generates a text response based on the conversation context.tts (Deepgram Aura) converts the response text to speech.transport.output() sends the audio back to the browser.assistant_aggregator records the assistant’s response in the conversation context for future turns.When a user speaks over the agent, Silero VAD detects the interruption and cancels the current TTS output so the agent stops and listens.
Start the agent with the WebRTC transport. The -t webrtc flag launches a built-in browser client for testing:
The first run takes about 20 seconds to download the Silero VAD model. Subsequent starts are faster.
Once the agent is running, open http://localhost:7860/client in your browser.
Try interrupting the agent mid-sentence. Silero VAD detects your speech and cancels the current response so the agent listens to you instead.