Speech Started | Deepgram's Docs

vad_events boolean.

Pre-recorded Streaming All available languages

Deepgram’s Speech Started feature can be used for speech detection and can be used to detect the start of speech while transcribing live streaming audio.

SpeechStarted complements Voice Activity Detection (VAD) to promptly detect the start of speech post-silence. By gauging tonal nuances in human speech, the VAD can effectively differentiate between silent and non-silent audio segments, providing immediate notification of speech detection.

Enable Feature

To enable the SpeechStarted event, include the parameter vad_events=true in your request:

vad_events=true

You’ll then begin receiving messages upon speech starting.

Python

1 # see https://github.com/deepgram/deepgram-python-sdk/blob/main/examples/streaming/async_microphone/main.py
2 # for complete example code
3 
4    options: LiveOptions = LiveOptions(
5             model="nova-3",
6             language="en-US",
7             # Apply smart formatting to the output
8             smart_format=True,
9             # Raw audio format details
10             encoding="linear16",
11             channels=1,
12             sample_rate=16000,
13             # To get UtteranceEnd, the following must be set:
14             interim_results=True,
15             utterance_end_ms="1000",
16             vad_events=True,
17             # Time in milliseconds of silence to wait for before finalizing speech
18             endpointing=300
19         )

Results

The JSON message sent when the start of speech is detected looks similar to this:

JSON

1 {
2   "type": "SpeechStarted",
3   "channel": [
4     0,
5     1
6   ],
7   "timestamp": 9.54
8 }

The type field is always SpeechStarted for this event.
The channel field is interpreted as [A,B], where A is the channel index, and B is the total number of channels. The above example is channel 0 of single-channel audio.
The timestamp field is the time at which speech was first detected.

The timestamp doesn’t always match the start time of the first word in the next transcript because the systems for transcribing and timing words work independently of the speech detection system.