For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Ask AIPlaygroundLoginFree API Key
HomeAPI ReferenceVoice AgentSpeech-to-TextText-to-SpeechIntelligenceSelf-Hosted Deployments
HomeAPI ReferenceVoice AgentSpeech-to-TextText-to-SpeechIntelligenceSelf-Hosted Deployments
    • Getting Started with Speech to Text
  • Pre-Recorded Audio
    • Getting Started
    • Feature Overview
    • Template Apps
  • Streaming Audio
      • Getting Started
      • Feature Overview
      • Live Streaming Starter Kit
      • Template Apps
        • Speech Started
        • Utterance End
        • Endpointing
        • Interim Results
    • Compare Flux to Nova-3
  • Models and Languages
    • Models & Languages Overview
    • Languages Support
    • Language Detection
    • Multilingual Codeswitching
    • Model Options
    • Version
  • Formatting
    • Speaker Diarization
    • Dictation
    • Filler Words
    • Measurements
    • Numerals
    • Paragraphs
    • Profanity Filtering
    • Punctuation
    • Redaction
    • Smart Formatting
    • Supported Entity Types
    • Utterances
    • Utterance Split
  • Custom Vocabulary
    • Find and Replace
    • Keyterm Prompting
    • Keywords
    • Search
  • Media Input Settings
    • Channels
    • Encoding
    • Multichannel
    • Sample Rate
  • Results Processing
    • Understanding Word Confidence Scores
    • STT Callback
    • STT Tagging
    • Extra Metadata
  • Migrating
    • Migrating From Amazon Web Services (AWS) Transcribe to Deepgram
    • Migrating From Google Speech-to-Text (STT) to Deepgram
    • Migrating From OpenAI Whisper to Deepgram
    • Migrating from AssemblyAI Speech-to-Text to Deepgram
LogoLogo
Ask AIPlaygroundLoginFree API Key
On this page
  • Enable Feature
  • Results
Streaming AudioTranscription (Nova-3)Speech Detection

Speech Started

Speech Started sends a message when the start of speech is detected in live streaming audio.
Was this page helpful?
Previous

Utterance End

Utterance End sends a message when the end of speech is detected in live streaming audio.
Next
Built with

vad_events boolean.

Pre-recorded Streaming:Nova All available languages

Deepgram’s Speech Started feature can be used for speech detection and can be used to detect the start of speech while transcribing live streaming audio.

SpeechStarted complements Voice Activity Detection (VAD) to promptly detect the start of speech post-silence. By gauging tonal nuances in human speech, the VAD can effectively differentiate between silent and non-silent audio segments, providing immediate notification of speech detection.

Enable Feature

To enable the SpeechStarted event, include the parameter vad_events=true in your request:

vad_events=true

You’ll then begin receiving messages upon speech starting.

1# For more Python SDK migration guides, visit:
2# https://github.com/deepgram/deepgram-python-sdk/tree/main/docs
3
4 with client.listen.v1.connect(
5 model="nova-3",
6 language="en-US",
7 # Apply smart formatting to the output
8 smart_format=True,
9 # Raw audio format details
10 encoding="linear16",
11 channels=1,
12 sample_rate=16000,
13 # To get UtteranceEnd, the following must be set:
14 interim_results=True,
15 utterance_end_ms="1000",
16 vad_events=True,
17 # Time in milliseconds of silence to wait for before finalizing speech
18 endpointing=300
19 ) as connection:

Results

The JSON message sent when the start of speech is detected looks similar to this:

JSON
1{
2 "type": "SpeechStarted",
3 "channel": [
4 0,
5 1
6 ],
7 "timestamp": 9.54
8}
  • The type field is always SpeechStarted for this event.
  • The channel field is interpreted as [A,B], where A is the channel index, and B is the total number of channels. The above example is channel 0 of single-channel audio.
  • The timestamp field is the time at which speech was first detected.

The timestamp doesn’t always match the start time of the first word in the next transcript because the systems for transcribing and timing words work independently of the speech detection system.