Aura-2 Voice Controls — Speed and Pronunciation
Aura-2 TTS voices now support runtime speed and pronunciation controls in English and Spanish, available on both batch and streaming WebSocket endpoints. Together, these controls give developers finer-grained tools to improve naturalness — tuning pacing and correcting pronunciation to match their needs.
Speed control adjusts the speaking rate of generated audio while maintaining natural prosody, with a supported range of 0.7x–1.5x. For Spanish voices, the recommended range is 0.9x–1.5x; values below 0.9x may introduce disfluencies.
Pronunciation control overrides the default pronunciation of specific words using inline IPA notation, e.g. The patient was prescribed {"word": "dupilumab", "pronounce": "duːˈpɪljuːmæb"}.
See Voice Controls for details.
Voice Agent
These controls are also supported when using Aura-2 inside the Voice Agent API: set speed once at the session level, and have the LLM emit pronunciation overrides inline. See Voice Agent TTS Controls for the recommended setup.
Self-Hosted
These controls are powered by an updated Aura-2 voice-pack model. Self-hosted deployments using a voice-pack from before the April 2026 release will return 400 Bad Request on requests including speed or pronounce parameters. See the April 2026 self-hosted release notes for voice-pack version requirements and upgrade instructions.