Entity Detection

Entity Detection identifies and extracts key entities from content in submitted audio.

detect_entities boolean. Default: false

Pre-recorded Streaming:Nova English (all available regions)

Entity Detection

When Entity Detection is enabled, the Punctuation feature will be enabled by default.

Model Support

Entity Detection is available for both pre-recorded and streaming speech-to-text.

Streaming: Entity Detection for streaming is supported on Nova, Nova-2, Nova-3, and Enhanced models. It is not available for Base models or Flux.

Pre-recorded: Entity Detection for pre-recorded audio is available on all models.

Enable Feature

To enable Entity Detection, when you call Deepgram’s API, add a detect_entities parameter set to true in the query string:

detect_entities=true

When Entity Detection is enabled, Punctuation will also be enabled by default.

Pre-recorded Audio

To transcribe audio from a file on your computer, run the following curl command in a terminal or your favorite API client.

cURL
$curl \
> --request POST \
> --header 'Authorization: Token YOUR_DEEPGRAM_API_KEY' \
> --header 'Content-Type: audio/wav' \
> --data-binary @youraudio.wav \
> --url 'https://api.deepgram.com/v1/listen?detect_entities=true'

Replace YOUR_DEEPGRAM_API_KEY with your Deepgram API Key.

Streaming Audio

To enable Entity Detection for streaming audio, establish a WebSocket connection with the detect_entities=true parameter. Remember that streaming Entity Detection is supported on Nova, Nova-2, Nova-3, and Enhanced models.

1// Example filename: index.js
2
3const { createClient, LiveTranscriptionEvents } = require("@deepgram/sdk");
4const fetch = require("cross-fetch");
5const dotenv = require("dotenv");
6dotenv.config();
7
8// URL for the realtime streaming audio you would like to transcribe
9const url = "http://stream.live.vc.bbcmedia.co.uk/bbc_world_service";
10
11const live = async () => {
12 // STEP 1: Create a Deepgram client using the API key
13 const deepgram = createClient(process.env.DEEPGRAM_API_KEY);
14
15 // STEP 2: Create a live transcription connection with entity detection enabled
16 const connection = deepgram.listen.live({
17 model: "nova-3",
18 language: "en-US",
19 smart_format: true,
20 detect_entities: true,
21 });
22
23 // STEP 3: Listen for events from the live transcription connection
24 connection.on(LiveTranscriptionEvents.Open, () => {
25 console.log("Connection opened.");
26
27 connection.on(LiveTranscriptionEvents.Close, () => {
28 console.log("Connection closed.");
29 });
30
31 connection.on(LiveTranscriptionEvents.Transcript, (data) => {
32 const transcript = data.channel.alternatives[0].transcript;
33
34 // Only process final results which contain entities
35 if (data.is_final) {
36 const entities = data.channel.alternatives[0].entities;
37
38 if (entities && entities.length > 0) {
39 console.log("\nTranscript:", transcript);
40 console.log("Entities detected:");
41 entities.forEach(entity => {
42 console.log(` - ${entity.label}: ${entity.value} (confidence: ${entity.confidence})`);
43 // raw_value is only present when formatting features (like smart_format) are enabled
44 if (entity.raw_value) {
45 console.log(` Raw value: ${entity.raw_value}`);
46 }
47 });
48 }
49 }
50 });
51
52 connection.on(LiveTranscriptionEvents.Error, (err) => {
53 console.error("Error:", err);
54 });
55
56 // STEP 4: Fetch the audio stream and send it to the live transcription connection
57 fetch(url)
58 .then((r) => r.body)
59 .then((res) => {
60 res.on("readable", () => {
61 connection.send(res.read());
62 });
63 });
64 });
65};
66
67live();

Analyze Response

The response structure differs between pre-recorded and streaming transcription.

Pre-recorded Response

When the file is finished processing (often after only a few seconds), you’ll receive a JSON response that has the following basic structure:

JSON
1{
2 "metadata": {
3 "transaction_key": "string",
4 "request_id": "string",
5 "sha256": "string",
6 "created": "string",
7 "duration": 0,
8 "channels": 0
9 },
10 "results": {
11 "channels": [
12 {
13 "alternatives":[],
14 }
15 ]
16 }
17}

Let’s look more closely at the alternatives object:

JSON
1"alternatives":[
2 {
3 "transcript":"Welcome to the Ai show. I'm Scott Stephenson, cofounder of Deepgram...",
4 "confidence":0.9816771,
5 "words": [...],
6 "entities":[
7 {
8 "label":"NAME",
9 "value":" Scott Stephenson",
10 "confidence":0.9999924,
11 "start_word":6,
12 "end_word":8
13 },
14 {
15 "label":"ORGANIZATION",
16 "value":" Deepgram",
17 "confidence":0.9999757,
18 "start_word":10,
19 "end_word":11
20 },
21 {
22 "label": "CARDINAL",
23 "value": "one",
24 "confidence": 1,
25 "start_word": 186,
26 "end_word": 187
27 },
28 ...
29 ]
30 }
31]

Streaming Response

For streaming transcription, entities are included in final results only (when is_final: true). Interim results do not contain the entities array.

Here’s an example of a streaming response with Entity Detection enabled:

JSON - Final Result with Entities
1{
2 "type": "Results",
3 "channel_index": [0, 1],
4 "duration": 4.64,
5 "start": 0.0,
6 "is_final": true,
7 "speech_final": true,
8 "channel": {
9 "alternatives": [
10 {
11 "transcript": "Hi, I'm calling to update my account. My name is Jane Doe and my phone number is (555) 123-4567. You can reach me at jane.doe@email.com.",
12 "confidence": 0.99,
13 "words": [...],
14 "entities": [
15 {
16 "label": "NAME",
17 "value": "Jane Doe",
18 "raw_value": "jane doe",
19 "confidence": 0.9999,
20 "start_word": 9,
21 "end_word": 11
22 },
23 {
24 "label": "PHONE_NUMBER",
25 "value": "(555) 123-4567",
26 "raw_value": "five five five one two three four five six seven",
27 "confidence": 0.9998,
28 "start_word": 15,
29 "end_word": 16
30 },
31 {
32 "label": "EMAIL_ADDRESS",
33 "value": "jane.doe@email.com",
34 "raw_value": "jane dot doe at email dot com",
35 "confidence": 0.9999,
36 "start_word": 21,
37 "end_word": 22
38 }
39 ]
40 }
41 ]
42 }
43}

Streaming Behavior:

  • The entities array is only present in final results (is_final: true).
  • If detect_entities is enabled but no entities are detected, an empty array is returned: "entities": [].
  • To ensure complete entities are detected, the system may wait for entity completion before finalizing. See Streaming Finalization Behavior below.

Streaming Finalization Behavior

When using Entity Detection with streaming audio, Deepgram will attempt to detect and format entities as they are spoken. For entities that seem like they may be incomplete, our system will:

  • Wait until the speaker continues to non-entity speech, OR
  • Finalize the transcript after 3 seconds of silence, OR
  • Receive a Finalize control message
  • Return only completed entities based on the available audio at that point

This approach ensures transcripts are returned promptly while maintaining entity detection precision.

Using No Delay

Setting no_delay=true forces immediate finalization of streaming transcripts without waiting for entity completion.

This will result in entities being missed or incomplete in many cases. Only use no_delay=true if low latency is more important than entity detection accuracy.

To use no_delay with Entity Detection:

1const connection = deepgram.listen.live({
2 model: "nova-3",
3 language: "en-US",
4 detect_entities: true,
5 no_delay: true // Forces immediate finalization, may miss entities
6});

Understanding Entity Fields

Each entity object in the entities array contains the following fields:

  • label: Type of entity identified (e.g., NAME, PHONE_NUMBER, EMAIL, ADDRESS).
  • value: The formatted text of the entity. When Smart Formatting is enabled, this field reflects the formatted output.
  • raw_value: (Streaming only, when formatting is enabled) The original, non-formatted text as spoken. This field is only included when formatting features are enabled.
  • confidence: Floating point value between 0 and 1 that indicates overall transcript reliability. Larger values indicate higher confidence.
  • start_word: Index of the first word of the entity in the transcript.
  • end_word: Index of the last word of the entity in the transcript.

Key Differences Between Pre-recorded and Streaming:

FieldPre-recordedStreaming
valueAlways includedAlways included
raw_valueNot includedIncluded when formatting is enabled
AvailabilityAlwaysOnly in is_final: true messages

Identifiable Entities

View all options here: Supported Entity Types

Use Cases

Some examples of uses for Entity Detection include:

  • Customers who want to improve Conversational AI and Voice Assistant by triggering particular workflows and responses based on identified name, address, location, and other key entities.
  • Customers who want to enhance customer service and user experience by extracting meaningful and relevant information about key entities such as a person, organization, email, and phone number.
  • Customers who want to derive meaningful and actionable insights from the audio data based on identified entities in conversations.