Entity Detection
detect_entities boolean. Default: false
Entity Detection
When Entity Detection is enabled, the Punctuation feature will be enabled by default.
Model Support
Entity Detection is available for both pre-recorded and streaming speech-to-text.
Streaming: Entity Detection for streaming is supported on Nova, Nova-2, Nova-3, and Enhanced models. It is not available for Base models or Flux.
Pre-recorded: Entity Detection for pre-recorded audio is available on all models.
Enable Feature
To enable Entity Detection, when you call Deepgram’s API, add a detect_entities parameter set to true in the query string:
detect_entities=true
When Entity Detection is enabled, Punctuation will also be enabled by default.
Pre-recorded Audio
To transcribe audio from a file on your computer, run the following curl command in a terminal or your favorite API client.
Replace YOUR_DEEPGRAM_API_KEY with your Deepgram API Key.
Streaming Audio
To enable Entity Detection for streaming audio, establish a WebSocket connection with the detect_entities=true parameter. Remember that streaming Entity Detection is supported on Nova, Nova-2, Nova-3, and Enhanced models.
Analyze Response
The response structure differs between pre-recorded and streaming transcription.
Pre-recorded Response
When the file is finished processing (often after only a few seconds), you’ll receive a JSON response that has the following basic structure:
Let’s look more closely at the alternatives object:
Streaming Response
For streaming transcription, entities are included in final results only (when is_final: true). Interim results do not contain the entities array.
Here’s an example of a streaming response with Entity Detection enabled:
Streaming Behavior:
- The
entitiesarray is only present in final results (is_final: true). - If
detect_entitiesis enabled but no entities are detected, an empty array is returned:"entities": []. - To ensure complete entities are detected, the system may wait for entity completion before finalizing. See Streaming Finalization Behavior below.
Streaming Finalization Behavior
When using Entity Detection with streaming audio, Deepgram will attempt to detect and format entities as they are spoken. For entities that seem like they may be incomplete, our system will:
- Wait until the speaker continues to non-entity speech, OR
- Finalize the transcript after 3 seconds of silence, OR
- Receive a Finalize control message
- Return only completed entities based on the available audio at that point
This approach ensures transcripts are returned promptly while maintaining entity detection precision.
Using No Delay
Setting no_delay=true forces immediate finalization of streaming transcripts without waiting for entity completion.
This will result in entities being missed or incomplete in many cases. Only use no_delay=true if low latency is more important than entity detection accuracy.
To use no_delay with Entity Detection:
Understanding Entity Fields
Each entity object in the entities array contains the following fields:
label: Type of entity identified (e.g., NAME, PHONE_NUMBER, EMAIL, ADDRESS).value: The formatted text of the entity. When Smart Formatting is enabled, this field reflects the formatted output.raw_value: (Streaming only, when formatting is enabled) The original, non-formatted text as spoken. This field is only included when formatting features are enabled.confidence: Floating point value between 0 and 1 that indicates overall transcript reliability. Larger values indicate higher confidence.start_word: Index of the first word of the entity in the transcript.end_word: Index of the last word of the entity in the transcript.
Key Differences Between Pre-recorded and Streaming:
Identifiable Entities
View all options here: Supported Entity Types
Use Cases
Some examples of uses for Entity Detection include:
- Customers who want to improve Conversational AI and Voice Assistant by triggering particular workflows and responses based on identified name, address, location, and other key entities.
- Customers who want to enhance customer service and user experience by extracting meaningful and relevant information about key entities such as a person, organization, email, and phone number.
- Customers who want to derive meaningful and actionable insights from the audio data based on identified entities in conversations.