Build a Voice Agent with Pipecat and Deepgram

Learn how to build a real-time voice agent using Deepgram for speech-to-text and text-to-speech, with Pipecat for pipeline orchestration.

If you already use Pipecat for voice AI pipelines, you can add Deepgramโ€™s speech-to-text and text-to-speech models to your Pipecat agent. Pipecatโ€™s pipeline architecture connects separate STT, LLM, and TTS services, so this guide pairs Deepgramโ€™s audio models with OpenAI for language understanding โ€” though any Pipecat-compatible LLM works.

For a standalone voice agent without Pipecat or an external LLM, see the Deepgram Voice Agent API, which bundles STT, LLM routing, and TTS in a single WebSocket connection.

For a CLI-scaffolded approach using pipecat init, see the Pipecat and Deepgram integration guide.

Before You Begin

This guide assumes you are familiar with Python and have a basic understanding of how voice agents work.

Youโ€™ll need a Deepgram account and an API key. Signup is free and includes $200 in credit.

Get OpenAI Credentials

This tutorial uses OpenAI for its LLM. Youโ€™ll need to sign up for an OpenAI account and obtain an API key.

Requirements

  • Python 3.11+

Set Up Your Project

This implementation is a starting reference for building your own voice agent with Pipecat and Deepgram. It is not designed for production deployments.

Create a new directory, set up a virtual environment, and install Pipecat with the Deepgram, OpenAI, and runner extras:

$mkdir deepgram-pipecat-agent
$cd deepgram-pipecat-agent
$python -m venv venv
$source venv/bin/activate # On Windows: venv\Scripts\activate
$pip install "pipecat-ai[deepgram,openai,silero,webrtc]" python-dotenv

The runner extra includes a built-in WebRTC transport with a browser-based test client โ€” no external account needed.

Set Environment Variables

Create a .env file in your project root with the credentials you collected earlier. The agent reads these at startup to authenticate with each service:

DEEPGRAM_API_KEY=your_deepgram_api_key
OPENAI_API_KEY=your_openai_api_key

Build the Agent

The agent creates a pipeline that connects audio input to Deepgram for transcription, OpenAI for a response, and Deepgram again for speech synthesis. Pipecat handles the audio transport, turn-taking, and interruption detection.

The key components are:

  • Pipeline โ€” connects frame processors in sequence: transport input, STT, context aggregation, LLM, TTS, and transport output.
  • LLMContextAggregatorPair โ€” manages conversation context and uses Silero VAD to detect when the user starts and stops speaking.
  • LLMRunFrame โ€” triggers the LLM to generate a response. Used here to make the agent greet the user on connect.

Create bot.py:

1# bot.py
2
3import os
4
5from dotenv import load_dotenv
6
7from pipecat.audio.vad.silero import SileroVADAnalyzer
8from pipecat.frames.frames import LLMRunFrame
9from pipecat.pipeline.pipeline import Pipeline
10from pipecat.pipeline.runner import PipelineRunner
11from pipecat.pipeline.task import PipelineParams, PipelineTask
12from pipecat.processors.aggregators.llm_context import LLMContext
13from pipecat.processors.aggregators.llm_response_universal import (
14 LLMContextAggregatorPair,
15 LLMUserAggregatorParams,
16)
17from pipecat.runner.types import RunnerArguments
18from pipecat.runner.utils import create_transport
19from pipecat.services.deepgram.stt import DeepgramSTTService
20from pipecat.services.deepgram.tts import DeepgramTTSService
21from pipecat.services.openai.llm import OpenAILLMService
22from pipecat.transports.base_transport import BaseTransport, TransportParams
23
24load_dotenv(override=True)
25
26transport_params = {
27 "webrtc": lambda: TransportParams(
28 audio_in_enabled=True,
29 audio_out_enabled=True,
30 ),
31}
32
33
34async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
35 stt = DeepgramSTTService(
36 api_key=os.getenv("DEEPGRAM_API_KEY"),
37 settings=DeepgramSTTService.Settings(
38 model="nova-3-general",
39 language="en",
40 punctuate=True,
41 smart_format=True,
42 ),
43 )
44
45 llm = OpenAILLMService(
46 api_key=os.getenv("OPENAI_API_KEY"),
47 settings=OpenAILLMService.Settings(
48 model="gpt-4o",
49 system_instruction=(
50 "You are a friendly, helpful voice assistant. "
51 "Keep your responses concise โ€” aim for 1-3 sentences "
52 "unless the user asks for detail. "
53 "Your responses will be spoken aloud, so avoid emojis, "
54 "bullet points, or other formatting that can't be spoken."
55 ),
56 ),
57 )
58
59 tts = DeepgramTTSService(
60 api_key=os.getenv("DEEPGRAM_API_KEY"),
61 settings=DeepgramTTSService.Settings(
62 voice="aura-2-thalia-en",
63 ),
64 )
65
66 context = LLMContext()
67
68 user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
69 context,
70 user_params=LLMUserAggregatorParams(
71 vad_analyzer=SileroVADAnalyzer(),
72 ),
73 )
74
75 pipeline = Pipeline([
76 transport.input(),
77 stt,
78 user_aggregator,
79 llm,
80 tts,
81 transport.output(),
82 assistant_aggregator,
83 ])
84
85 task = PipelineTask(
86 pipeline,
87 params=PipelineParams(
88 enable_metrics=True,
89 enable_usage_metrics=True,
90 ),
91 )
92
93 @transport.event_handler("on_client_connected")
94 async def on_client_connected(transport, client):
95 context.add_message(
96 {"role": "developer", "content": "Greet the user and ask how you can help."}
97 )
98 await task.queue_frames([LLMRunFrame()])
99
100 @transport.event_handler("on_client_disconnected")
101 async def on_client_disconnected(transport, client):
102 await task.cancel()
103
104 runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
105 await runner.run(task)
106
107
108async def bot(runner_args: RunnerArguments):
109 transport = await create_transport(runner_args, transport_params)
110 await run_bot(transport, runner_args)
111
112
113if __name__ == "__main__":
114 from pipecat.runner.run import main
115
116 main()

How the pipeline works

Audio flows through the pipeline from left to right:

  1. transport.input() captures microphone audio from the browser.
  2. stt (Deepgram Nova-3) transcribes audio into text in real-time.
  3. user_aggregator collects transcription frames and uses Silero VAD to detect when the user finishes speaking, then adds the complete utterance to the conversation context.
  4. llm (OpenAI GPT-4o) generates a text response based on the conversation context.
  5. tts (Deepgram Aura) converts the response text to speech.
  6. transport.output() sends the audio back to the browser.
  7. assistant_aggregator records the assistantโ€™s response in the conversation context for future turns.

When a user speaks over the agent, Silero VAD detects the interruption and cancels the current TTS output so the agent stops and listens.

Run the Agent

Start the agent with the WebRTC transport. The -t webrtc flag launches a built-in browser client for testing:

$python bot.py -t webrtc

The first run takes about 20 seconds to download the Silero VAD model. Subsequent starts are faster.

Test the Agent

Once the agent is running, open http://localhost:7860/client in your browser.

  1. Allow microphone access when prompted
  2. Click to connect โ€” the agent greets you automatically
  3. Start talking โ€” the agent responds in real time

Try interrupting the agent mid-sentence. Silero VAD detects your speech and cancels the current response so the agent listens to you instead.

Further Reading