Optimize Voice Agent Latency with Eager End of Turn
Reduce end-to-end latency by preparing responses early with Eager End of Turn events.
Eager end of turn processing is the practice of starting LLM processing on medium-confidence transcripts (EagerEndOfTurn
events) before waiting for a high-confidence EndOfTurn
. By overlapping LLM generation with user speech, you can cut hundreds of milliseconds from your agentβs response time.
Important Note
EagerEndOfTurn
and TurnResumed
events are ONLY triggered if you have configured the eager_eot_threshold
in your connection string.
How Eager End of Turn Processing Works
-
Receive
EagerEndOfTurn
- Flux is moderately confident the user has finished speaking.
- Send the transcript downstream to your LLM and begin preparing a reply.
-
If
TurnResumed
occurs- The user wasnβt finished after all.
- Cancel the in-progress response and wait for the next
EagerEndOfTurn
orEndOfTurn
.
-
If
EndOfTurn
occurs- The user is done speaking with high confidence.
- Finalize and deliver the response youβve already started preparing.
EndOfTurn
transcript will exactly match theEagerEndOfTurn
transcript, ensuring consistent transcription throughout the turn lifecycle.
Implementation Strategies
1. Simple (Recommended Starting Point)
- Use only
EndOfTurn
events. - Simplest implementation and minimal LLM calls
- Ideal for majority of developers
2. Optimized (With Eager End of Turn Processing)
- Use both
EagerEndOfTurn
andEndOfTurn
events. - Reduce latency by preparing replies early with speculative response generation.
- Expect more LLM calls and slightly more complexity.
- Recommended once youβre confident in your pipeline and want production-grade performance.
Tips & Tricks for Eager End of Turn Processing
Tune Confidence Thresholds
eager_eot_threshold
: Lower values β earlier triggers, but more false starts.eot_threshold
: Higher values β more reliableEndOfTurn
, but may increase latency.- Experiment with values to balance speed vs. stability.
Handle TurnResumed
Gracefully
- Treat
TurnResumed
as a cancellation signal. - Be ready to discard or revise any LLM replies in progress.
- Consider a retry strategy if this happens often in your use case.
Keep Responses Flexible
- Avoid committing to a reply until
EndOfTurn
. - Use
EagerEndOfTurn
outputs to draft, not finalize. - Build resilience for when transcripts shift slightly.
Optimize LLM Cost
- Eager end of turn processing means more LLM requests.
- To reduce spend:
- Use smaller/faster models for EagerEndOfTurn drafts.
- Only call the full LLM on
EndOfTurn
. - Cache prepared responses and reuse on
EndOfTurn
(transcript guaranteed to match).
Monitor and Log Events
- Track how often
EagerEndOfTurn
βTurnResumed
vs.EagerEndOfTurn
βEndOfTurn
. - Use this data to refine thresholds and tune your pipeline.
Example Code
This code demonstrates how to handle Flux message events to implement eager end-of-turn processing. The examples show message parsing and event handling for the three critical events: -
EagerEndOfTurn
(start preparing response)TurnResumed
(cancel draft response)EndOfTurn
(finalize and deliver response)
Summary
When to Use Eager End of Turn
- High-interruption environments (e.g., call centers, IVRs).
- Conversational agents where natural back-and-forth timing matters.
- Latency-sensitive apps where response speed is critical to user experience.
- If your LLM configuration is complex and has high latency issues. e.g., Good for trimming that last 100-200ms of end-to-end latency at the cost of 50-70% more LLM calls.
- If Your LLM configuration has complex RAG (Retrieval-Augmented Generation) or Function Calling involved.
When to use End of Turn only
- Most developers will find
EndOfTurn
detection sufficiently fast enough to support natural conversation, but not all voice AI workflows are the same. - For more complex and expensive voice AI workflows, it might be worthwhile to use
EagerEndOfTurn
to call LLMs speculatively, i.e., in preparation for an upcoming turn end, in order to minimize response latency.