Model Options
Model options allows you to supply a model to use for speech-to-text.
model
string. Default: base-general
Deepgram’s Model feature allows you to supply a model to use when processing submitted audio. To learn more about the pricing for our different models, see Deepgram Pricing & Plans.
Models & Model Options
Below are a list of all model and model options that can be used with the Deepgram API.
The concept of Tiers is now deprecated but still available in the Deepgram API. Please see our documentation on Tiers for how they still can be used.
Nova-2
Examples
https://api.deepgram.com/v1/listen?model=nova-2
https://api.deepgram.com/v1/listen?model=nova-2-phonecall
Nova-2 expands on Nova-1's advancements with speech-specific optimizations to the underlying Transformer architecture, advanced data curation techniques, and a multi-stage training methodology. These changes yield reduced word error rate (WER) and enhancements to entity recognition (i.e. proper nouns, alphanumerics, etc.), punctuation, and capitalization.
Nova-2 has the following model options which can be called by using the following syntax:
model=nova-2-{option}
general
: Optimized for everyday audio processing.meeting
: Optimized for conference room settings, which include multiple speakers with a single microphone.phonecall
: Optimized for low-bandwidth audio phone calls.voicemail
: Optimized for low-bandwidth audio clips with a single speaker. Derived from the phonecall model.finance
: Optimized for multiple speakers with varying audio quality, such as might be found on a typical earnings call. Vocabulary is heavily finance oriented.conversationalai
: Optimized for use cases in which a human is talking to an automated bot, such as IVR, a voice assistant, or an automated kiosk.video
: Optimized for audio sourced from videos.medical
: Optimized for audio with medical oriented vocabulary.drivethru
: Optimized for audio sources from drivethrus.automotive
: Optimized for audio with automative oriented vocabulary.atc
: Optimized for audio from air traffic control.
Nova
Examples
https://api.deepgram.com/v1/listen?model=nova
https://api.deepgram.com/v1/listen?model=nova-phonecall
Nova is the predecessor to Nova-2. Training on this model spans over 100 domains and 47 billion tokens, making it the deepest-trained automatic speech recognition (ASR) model to date. Nova doesn't just excel in one specific domain — it is ideal for a wide array of voice applications that require high accuracy in diverse contexts.
Nova has the following model options which can be called by using the following syntax:
model=nova-{option}
-
general
: Optimized for everyday audio processing. Likely to be more accurate than any region-specific Base model for the language for which it is enabled. If you aren't sure which model to select, start here. -
phonecall
: Optimized for low-bandwidth audio phone calls.
Enhanced
Examples
https://api.deepgram.com/v1/listen?model=enhanced
https://api.deepgram.com/v1/listen?model=enhanced-phonecall
Enhanced models are still some of our most powerful speech-to-text models; they generally have higher accuracy and better word recognition than our base models, and they handle uncommon words significantly better.
Enhanced has the following model options which can be called by using the following syntax:
model=enhanced-{option}
-
general
: Optimized for everyday audio processing. Likely to be more accurate than any region-specific Base model for the language for which it is enabled. If you aren't sure which model to select, start here. -
meeting
beta: Optimized for conference room settings, which include multiple speakers with a single microphone. -
phonecall
: Optimized for low-bandwidth audio phone calls. -
finance
beta: Optimized for multiple speakers with varying audio quality, such as might be found on a typical earnings call. Vocabulary is heavily finance oriented.
The Enhanced models can be called with the following syntax:
Base
Examples
https://api.deepgram.com/v1/listen?model=base
https://api.deepgram.com/v1/listen?model=base-phonecall
Base models are built on our signature end-to-end deep learning speech-to-text model architecture. They offer a solid combination of accuracy and cost effectiveness in some cases.
Base has the following model options which can be called by using the following syntax:
model=base-{option}
general
: (Default) Optimized for everyday audio processing.meeting
: Optimized for conference room settings, which include multiple speakers with a single microphone.phonecall
: Optimized for low-bandwidth audio phone calls.voicemail
: Optimized for low-bandwidth audio clips with a single speaker. Derived from the phonecall model.finance
: Optimized for multiple speakers with varying audio quality, such as might be found on a typical earnings call. Vocabulary is heavily finance oriented.conversationalai
: Optimized for use cases in which a human is talking to an automated bot, such as IVR, a voice assistant, or an automated kiosk.video
: Optimized for audio sourced from videos.
Custom
You may also use a custom, trained model associated with your account by including its custom_id
.
Custom models are only available to Enterprise customers. See Deepgram Pricing & Plans for more details.
Whisper
Examples
https://api.deepgram.com/v1/listen?model=whisper
https://api.deepgram.com/v1/listen?model=whisper-SIZE
Whisper models are less scalable than all other Deepgram models due to their inherent model architecture. All non-Whisper models will return results faster and scale to higher load.
Deepgram's Whisper Cloud is a fully managed API that gives you access to Deepgram's version of OpenAI’s Whisper model. Read our guide Deepgram Whisper Cloud for a deeper dive into this offering.
Deepgram's Whisper models have the following size options:
tiny
: Contains 39 M parameters. The smallest model available.base
: Contains 74 M parameters.small
: Contains 244 M parameters.medium
: Contains 769 M parameters. The default model if you don't specify a size.large
: Contains 1550 M parameters. The largest model available. Defaults to OpenAI’s Whisper large-v2.
Additional rate limits apply to Whisper due to poor scalability. Requests to Whisper are limited to 15 concurrent requests with a paid plan and 5 concurrent requests with the pay-as-you-go plan. Long audio files are supported up to a maximum of 20 minutes of processing time (the maximum length of the audio depends on the size of the Whisper model).
Try it out
To transcribe audio from a file on your computer using a particular model, run the following curl command in a terminal or your favorite API client.
curl \
--request POST \
--header 'Authorization: Token YOUR_DEEPGRAM_API_KEY' \
--header 'Content-Type: audio/wav' \
--data-binary @youraudio.wav \
--url 'https://api.deepgram.com/v1/listen?model=OPTION'
Replace
YOUR_DEEPGRAM_API_KEY
with your Deepgram API Key.
Updated about 1 month ago