For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Ask AIPlaygroundLoginFree API Key
HomeAPI ReferenceVoice AgentSpeech-to-TextText-to-SpeechIntelligenceSelf-Hosted Deployments
HomeAPI ReferenceVoice AgentSpeech-to-TextText-to-SpeechIntelligenceSelf-Hosted Deployments
    • Getting Started with Speech to Text
  • Pre-Recorded Audio
    • Getting Started
    • Feature Overview
    • Template Apps
  • Streaming Audio
    • Compare Flux to Nova-3
  • Models and Languages
    • Models & Languages Overview
    • Languages Support
    • Language Detection
    • Multilingual Codeswitching
    • Model Options
    • Version
  • Formatting
    • Speaker Diarization
    • Dictation
    • Filler Words
    • Measurements
    • Numerals
    • Paragraphs
    • Profanity Filtering
    • Punctuation
    • Redaction
    • Smart Formatting
    • Supported Entity Types
    • Utterances
    • Utterance Split
  • Custom Vocabulary
    • Find and Replace
    • Keyterm Prompting
    • Keywords
    • Search
  • Media Input Settings
    • Channels
    • Encoding
    • Multichannel
    • Sample Rate
  • Results Processing
    • Understanding Word Confidence Scores
    • STT Callback
    • STT Tagging
    • Extra Metadata
  • Migrating
    • Migrating From Amazon Web Services (AWS) Transcribe to Deepgram
    • Migrating From Google Speech-to-Text (STT) to Deepgram
    • Migrating From OpenAI Whisper to Deepgram
    • Migrating from AssemblyAI Speech-to-Text to Deepgram
LogoLogo
Ask AIPlaygroundLoginFree API Key
On this page
  • Enable Feature
  • Analyze Response
  • Advanced Processing
Formatting

Utterances

Utterances segments speech into meaningful semantic units.
Was this page helpful?
Previous

Utterance Split

Utterance Split detects pauses between words in submitted audio. Used when the Utterances feature is enabled for pre-recorded audio.

Next
Built with
Deepgram API Playground
Try this feature out in our API Playground.

utterances boolean Default: false

Pre-recorded Streaming:NovaStreaming: Flux All available languages

Deepgram’s Utterances feature allows the chosen model to interact more naturally and effectively with speakers’ spontaneous speech patterns. For example, when humans speak to each other conversationally, they often pause mid-sentence to reformulate their thoughts, or stop and restart a badly-worded sentence.

Enable Feature

To enable utterances, use the following parameter in the query string when you call Deepgram’s /listen endpoint :

utterances=true

To transcribe audio from a file on your computer, run the following curl command in a terminal or your favorite API client.

cURL
$curl \
> --request POST \
> --header 'Authorization: Token YOUR_DEEPGRAM_API_KEY' \
> --header 'Content-Type: audio/wav' \
> --data-binary @youraudio.wav \
> --url 'https://api.deepgram.com/v1/listen?utterances=true'

Replace YOUR_DEEPGRAM_API_KEY with your Deepgram API Key.

Analyze Response

When the file is finished processing, you’ll receive a JSON response. Let’s look more closely at the utterances object:

JSON
1...
2"utterances":[
3 {
4 "start":0.41915998,
5 "end":5.43012,
6 "confidence":0.88172823,
7 "channel":0,
8 "transcript":"four score and seven years ago our fathers brought forth on this continent a new nation",
9 "words":[
10 {"word":"four","start":0.41915998,"end":0.85827994,"confidence":0.57893836},
11 ...
12 ],
13 "id":"2d8211a4-3a5b-4053-8939-edf2b2b389fa"
14 },
15 ...
16 }
17]
18...

In this response, we see that each utterance contains:

  • start: Start time (in seconds) from the beginning of the audio stream.
  • end: End time (in seconds) from the beginning of the audio stream.
  • confidence: Floating point value between 0 and 1 that indicates overall transcript reliability. Larger values indicate higher confidence.
  • channel: Audio channel to which the utterance belongs. When using multichannel audio, utterances are chronologically ordered by channel.
  • transcript: Transcript for the audio segment being processed.
  • words: Object containing each word in the transcript, along with its start time and end time (in seconds) from the beginning of the audio stream, and a confidence value.

Advanced Processing

You may also want to enable speaker diarization, which will detect and identify speakers for utterances, and punctuation.

To transcribe audio from a file on your computer, run the following curl command in a terminal or your favorite API client.

cURL
$curl \
> --request POST \
> --header 'Authorization: Token YOUR_DEEPGRAM_API_KEY' \
> --header 'Content-Type: audio/wav' \
> --data-binary @gettysburg.wav \
> --url 'https://api.deepgram.com/v1/listen?utterances=true&diarize=true&punctuate=true'

Replace YOUR_DEEPGRAM_API_KEY with your Deepgram API Key.

When the file is finished processing, you’ll receive a JSON response that has the same basic structure as before. Let’s take a closer look at the new utterances object:

JSON
1...
2"utterances":[
3 {
4 "start":0.41874,
5 "end":5.42518,
6 "confidence":0.88211584,
7 "channel":0,
8 "transcript":"four score and seven years ago, our fathers brought forth on this continent a new nation",
9 "words":[
10 {"word":"four","start":0.41874,"end":0.85742,"confidence":0.5821198,"speaker":0,"punctuated_word":"four"},
11 ...
12 ],
13 "speaker":0,
14 "id":"ec11ce4b-2d5c-4b95-9183-ba102bea1d62"
15 },
16...
17]
18...

In this response, notice that the content of transcript in each utterance is now punctuated, and each word object in the words array contains two new parameters:

  • speaker: Integer indicating the speaker who is saying the word being processed.
  • punctuated_word: Word being processed with added punctuation, if any.

To improve readability, you can use a JSON processor to parse the JSON. In this example, we use JQ.

cURL
$curl \
>--request POST \
>--header 'Authorization: Token YOUR_DEEPGRAM_API_KEY' \
>--header "Content-Type: audio/wav" \
>--data-binary @gettysburg.wav \
>--url "https://api.deepgram.com/v1/listen?utterances=true&diarize=true&punctuate=true" | jq -r '.results.utterances[] | "[Speaker:\(.speaker)] \(.transcript)"'

Replace YOUR_DEEPGRAM_API_KEY with your Deepgram API Key.

When the file is finished processing, you’ll receive the following response:

[Speaker:0] four score and seven years ago, our fathers brought forth on this continent a new nation
[Speaker:0] conceived liberty and dedicated to the proposition that all men are created equal.
[Speaker:0] Now we are engaged in a great civil war testing whether that nation or any nation so conceived and so dedicated can long endure.