Getting Started with Flux
Flux is the first conversational speech recognition model built specifically for voice agents. Unlike traditional STT that just transcribes words, Flux understands conversational flow and automatically handles turn-taking.
Flux is the first conversational speech recognition model built specifically for voice agents. Unlike traditional STT that just transcribes words, Flux understands conversational flow and automatically handles turn-taking.
Flux tackles the most critical challenges for voice agents today: knowing when to listen, when to think, and when to speak. The model features first-of-its-kind model-integrated end-of-turn detection, configurable turn-taking dynamics, and ultra-low latency optimized for voice agent pipelines, all with Nova-3 level accuracy.
Flux is Perfect for: turn-based voice agents, customer service bots, phone assistants, and real-time conversation tools.
Multilingual support: Flux Multilingual (flux-general-multi) extends Flux to 10 languages with optional language_hint biasing. See the Language Prompting guide for details.
Key Benefits:
EagerEndOfTurn events for faster repliesFor more information on how Flux manages turns, see the Flux State Machine Guide guide.
Flux requires the /v2/listen endpoint — Using /v1/listen will not work with Flux.
When connecting to Flux, you must use:
/v2/listen (not /v1/listen)flux-general-en for English or flux-general-multi for multilingual workloadsWebSocket URL Format:
When using the Deepgram SDK, use client.listen.v2.connect() to access the v2 endpoint. For direct WebSocket connections, ensure you’re using /v2/listen in your URL.
Flux provides three key parameters to control end-of-turn detection behavior and optimize your voice agent’s conversational flow:
For most use cases, the default eot_threshold=0.7 works well. You only need to configure these parameters if:
eager_eot_threshold to enable EagerEndOfTurn events and start LLM processing before the user fully finishes speakingeot_timeout_ms to avoid cutting off turns prematurelyeot_threshold to reduce false positives (at the cost of slightly higher latency)eot_threshold to trigger turns earlierImportant: Setting eager_eot_threshold enables EagerEndOfTurn and TurnResumed events. These events allow you to start preparing LLM responses early, reducing end-to-end latency by hundreds of milliseconds. See the Eager End-of-Turn Optimization Guide for implementation strategies.
Cost Consideration: Using EagerEndOfTurn can increase LLM API calls by 50-70% due to speculative response generation. The TurnResumed event signals when to cancel a draft response because the user continued speaking.
For comprehensive parameter documentation and tuning guidance, see the End-of-Turn Configuration.
Dynamic Configuration: You can update these parameters mid-stream using the Configure control message without disconnecting and reconnecting. This is useful for adapting to changing conversation context or user behavior.
Common Mistakes to Avoid:
/v1/listen instead of /v2/listenmodel=flux instead of model=flux-general-en or model=flux-general-multilanguage=en parameter (use the model name to select language support; use language_hint with flux-general-multi for language biasing)language_hint to flux-general-en (only flux-general-multi supports it)encoding or sample_rate when sending containerized audio (omit these for containerized formats)This guide walks you through building a basic streaming transcription application powered by Deepgram Flux and the Deepgram SDK.
By the end of this guide, you’ll have:
Audio Stream
To handle the audio stream will be using the following conversion approach:
Install the additional dependencies:
FFMPEG on your machineYou will need the actual FFmpeg binary installed to run this demo:
brew install ffmpegsudo apt install ffmpegDownload from https://ffmpeg.org/.env fileCreate a .env file in your project root with your Deepgram API key:
Replace your_deepgram_api_key with your actual Deepgram API key.
Core Dependencies:
asyncio - Handles concurrent audio streaming and Deepgram connectionsubprocess - Manages FFmpeg process for audio conversiondotenv - Loads Deepgram API key from .env fileDeepgram SDK:
AsyncDeepgramClient - Main client for Flux API connectionEventType - WebSocket event constants (OPEN, MESSAGE, CLOSE, ERROR)ListenV2TurnInfo - Type hints for incoming transcription messagesConfiguration:
STREAM_URL - BBC World Service streaming audio endpointVisual Feedback System:
Colors class - ANSI terminal color codes for confidence visualizationget_confidence_color() - Maps confidence scores to colors:
Purpose: Sets up the foundation for real-time streaming transcription with visual quality indicators, making it easy to spot transcription accuracy at a glance.
The main function orchestrates real-time transcription of streaming audio URLs:
AsyncDeepgramClient and connects to Flux with required linear16 formatlinear16 PCM formatThe function handles both the audio conversion requirement (Flux only accepts linear16) and real-time streaming coordination between multiple async processes.
Here’s the complete working example that combines all the steps. You can also find this code on GitHub.
For additional demos showcasing Flux, check out the following repositories:
Are you ready to build a voice agent with Flux? See our Build a Flux-enabled Voice Agent Guide to get started.