Get and Configure Deepgram Products

Last updated 09/14/2021

In this guide, you will learn how to obtain, configure, and run isolated instances of Deepgram products.

To learn about running a coordinated set of Deepgram services using Docker Compose or Docker Swarm, see Deployment.

Get Deepgram Products

Deepgram makes all of its products available through Docker Hub. To download the latest products:

  1. Log in to your Docker Hub account from one of your servers.

  2. Run the following command:

    $ docker pull [PRODUCT-IMAGE]
    

Configure Products

Choose the appropriate Deepgram product to learn how to configure it.

Speech Engine performs the computationally intensive task of speech analytics. It typically manages GPU devices and responds to requests from the API layer. Because it is decoupled from the API, you can scale it independently from the API.

Configure Speech Engine

To configure Speech Engine, you will need:

  • Trained model artifacts. You will need at least one model to start, but Speech Engine can host many simultaneous models from the same logical instance. You can get models directly from Deepgram or as part of an on-premise training pipeline. They are typically named MODEL.dg.
  • Speech Engine configuration file. These files are written in TOML to promote easy human editing.
  • Ancillary artifacts, such as diarization, punctuation, or phoneme production.

Example configuration file

# Keep in mind that all paths are in-container paths and do not need to exist
# on the host machine.

# Configure license validation.
[license]
  # Your license key
  key = "36c0e479-2ab9-471e-a217-5dd809a236bc"

# Enable ancillary models
# To disable any of these features, just remove to comment out the respective
# feature section.
[features]
  [[features.punctuator]]
    weights = "/engine/punctuator.dg"
    # specify the version as "beta" if using a v2 punctuator; otherwise no need to specify version
    version = "beta"
  [[features.diarizer]]
    weights = "/engine/diarizer.dg"
    type = "v2"
  [features.g2p]
    path = "/engine/g2p.dg"

# Configure the server to listen for requests from the API.
[server]
  # The base URL (prefix) for requests from the API.
  base_url = "/v2"
  # The IP address to listen on. Since this is likely running in a Docker
  # container, you will probably want to listen on all interfaces.
  host = "0.0.0.0"
  # The port to listen on
  port = 8080

# Speech models. Each model will have its own section. You can specify
# multiple models.
[[products]]
  # The name of the model (if no model is specified, the API will try to find
  # the "general" model).
  name = "general"
  # Version string.
  generation = "alpha"
  version = "v1"
  # Path to the weights on disk.
  path = "/engine/general.dg"

Launch Speech Engine

Assuming your model files are in a folder named /path/to/engine and your configuration file is at /path/to/config.toml, launch Speech Engine by running:

$ docker run \
      -d \
      --runtime=nvidia \
      -v "/path/to/engine:/engine:ro" \
      -v "/path/to/config.toml:/config.toml:ro" \
      -p 127.0.0.1:8080:8080 \
      deepgram/onprem-engine:latest \
      -v serve /config.toml

Configure Speech API

To configure Speech API, you will need a configuration file, which is written in TOML to promote easy human editing.

Example configuration file

# Keep in mind that all paths are in-container paths and do not need to exist
# on the host machine.

# Configure license validation.
[license]
  # Your license key.
  key = "36c0e479-2ab9-471e-a217-5dd809a236bc"

# Configure how the API will listen for your requests
[server]
  # The base URL (prefix) for requests to the API.
  base_url = "/v2"
  # The IP address to listen on. Since this is likely running in a Docker
  # container, you will probably want to listen on all interfaces.
  host = "0.0.0.0"
  # The port to listen on
  port = 8080

  # How long to wait for a connection to a callback URL (in seconds)
  callback_conn_timeout = 1
  # How long to wait for a response to a callback URL (in seconds)
  callback_timeout = 10

  # How long to wait for a connection to a fetch URL (in seconds)
  fetch_conn_timeout = 1
  # How long to wait for a response to a fetch URL (in seconds)
  fetch_timeout = 60

# Configure the DNS resolver, overriding the system default.
# Typically not needed, although we document it here for completeness.
# [resolver]
#   # List of nameservers to use to resolver DNS queries.
#   nameservers = ["127.0.0.11 53 udp"]
#   # Override the TTL in the DNS response (in seconds).
#   max_ttl = 10

# Configure the backend pool of speech engines (generically referred to as
# "drivers" here). There are two pools: "standard" and "failover". The API will
# load-balance among drivers in the standard pool; if a standard driver fails,
# the next one will be tried. If all drivers in the standard pool fail, then
# the API will load-balance among drivers in the failover pool; if a failover
# driver fails, the next one will be tried.
#
# Each driver URL will have its hostname resolved to an IP address. If a domain
# name resolves to multiple IP addresses, the API will load-balance across each
# IP address.
#
# This behavior is provided for convenience, and in a production environment
# other tools can be used, such as HAProxy.

# A new Speech Engine ("driver") in the "standard" pool.
[[driver_pool.standard]]
  # Host to connect to. Here, we use "tasks.engine", which is the Docker Swarm
  # method for resolving the IP addresses of all "engine" services. If you are
  # using Docker Compose, then this should just be "engine" instead of
  # "tasks.engine". If you rename the "engine" service in the Docker Compose
  # file, then change it accordingly here. Additionally, the port and prefix
  # should match those defined in the Engine configuration file.
  # NOTE: This must be HTTPS.
  url = "https://tasks.engine:8080/v2"

  # How long to wait for a connection to be established (in seconds).
  conn_timeout = 5
  # Once a connection is established, how many seconds to wait for a response.
  timeout = 400
  # Factor to increase the timeout by for each additional retry (for
  # exponential backoff).
  timeout_backoff = 1.2

  # If you fail to get a valid response (timeout or unexpected error), then
  # how many attempts should be made in total, including the initial attempt?
  # This is applied *per IP address* that the domain name in the URL resolves
  # to. If your domain resolves to multiple IPs, then "1" may be sufficient.
  retries_per_ip = 1

  # Before attempting a retry, sleep for this long (in seconds)
  retry_sleep = 2
  # Factor to increase the retry sleep by for each additional retry (for
  # exponential backoff).
  retry_backoff = 1.6

  # Maximum response to deserialize from Driver (in bytes)
  max_response_size = 1073741824

# Additional speech engines ("drivers") can be defined here, either in the
# standard pool using [[driver_pool.standard]], or in the failover pool by
# using [[driver_pool.failover]].

Metrics Server is an optional component that provides aggregate metrics about system performance. We recommend running a single instance of the metrics server in your deployment.

Configure Metrics Server

You can configure the metrics server using a configuration file, which is written in TOML to promote easy human editing, or via environment variables.

The metrics server accepts the path to the configuration file as a command line argument:

$ docker run \
    -d \
    -v /path/to/config.toml:/config.toml:ro \
    -p 8000:8000 \
    deepgram/metrics-server:latest /config.toml

Example configuration file

# config.toml
server_address = "0.0.0.0:8000"
If the metrics server is run with no command line arguments, it may be configured via environment variables:
$ docker run \
    -d \
    -e SERVER_ADDRESS=0.0.0.0:8000 \
    -p 8000:8000 \
    deepgram/metrics-server:latest

Collect Metrics

When running a metrics server, you must configure the speech engine and speech API to send metrics to it. To do so, add the following to the configuration files for both the speech engine and speech API:

# speech-engine/config.toml
# -- AND --
# speech-api/config.toml

[metrics]
  url = "http://tasks.metrics-server:8000"

Status Response

When the metrics server is configured and available, the /status endpoint of the speech API will be augmented with a status field, which will contain aggregate metrics for several time periods.

An example response:

// GET http://deepgram-api/v2/status
{
  "system_health": "Healthy",
  "active_batch_requests": 3,
  "active_stream_requests": 0,
  "status": [
    {
      // Average gpu utilization as a percentage of available
      // GPU compute over the past M minutes
      "average_gpu_utilization": 0.3333333333333333,

      // Average latency measured from when the request is received
      // until it completes transcription (in seconds) over the past
      // M minutes. Only valid for pre-recorded requests
      "average_latency": 0,

      // Average queue size (i.e., transcription requests to be processed)
      // over the past M minutes
      "average_request_queue_size": 0.6666666666666666,

      // Average throughput in audio time per processing time over the past M minutes
      "average_throughput": 0,

      // The time period over which these metrics were measured
      "minutes": 1,

      // Models that have been used in the last M minutes
      "models_in_use": [
        "\"general.latest\""
      ]
    },
    {
      "average_gpu_utilization": 0.1282051282051282,
      "average_latency": 14.574824810028076,
      "average_request_queue_size": 0.23333333333333336,
      "average_throughput": 0.13333332061767578,
      "minutes": 5,
      "models_in_use": [
        "\"general.latest\""
      ]
    },
    {
      "average_gpu_utilization": 0.09195402298850576,
      "average_latency": 14.358976364135742,
      "average_request_queue_size": 0.17592592592592593,
      "average_throughput": 0.08888888041178386,
      "minutes": 15,
      "models_in_use": [
        "\"general.latest\""
      ]
    }
  ]
}

Interpret Metrics

Depending on what you are optimizing for, you may focus on different metrics from the above response.

If you are optimizing for high throughput, then as a general rule you should strive to keep average_gpu_utilization close to 1. A simple way to accomplish this is to scale the number of concurrent requests (or the number of speech engine instances) to keep average_latency at a modestly high value. This ensures that the speech engine always has active requests to work on.

If you are optimizing for low latency, then you should scale the number of concurrent requests (or the number of speech engine instances) to keep average_latency inside your desired range. If you are trying to keep latency as low as possible, then note that if average_gpu_utilization is close to 1, that is an indication that latency could potentially be lower, because the GPU resources are not idle.

Configure Hotpepper

Hotpepper requires paths to four resources:

  • Directory to store the Hotpepper database
  • Directory for storing input datasets (collections of audio files to transcribe)
  • Directory for outputting packaged datasets
  • Configuration file, which is written in TOML to promote easy human editing
If you will be using Docker Swarm, these resources will need to be stored on a distributed filesystem.

Example configuration file

# The container path to the Hotpepper database. (Note: Naming conventions 
# may mention Dashscript, a previous version of Hotpepper; they are the same tool.)
db = "/db/dashscript.db"

# A directory for storing input datasets (collections of audio files to
# transcribe).
# Path to the directory containing input datasets. New datasets are
# created by adding subdirectories to this folder and placing audio data
# there. This directory should be structured like so:
# /datasets/
#   |_ dataset1/
#      |_ audio1.mp3
#      |_ audio2.mp3
#   ...
datasets = "/datasets"

# A directory for outputting packaged datasets.
packaged_dataset_location = "/packaged"

[server]
  port = 80

# For customers who have ASR enabled, the following configuration will place
# a button "Get ASR" on the L1 transcription page to pre-populate the
# transcript field with ASR.
# Ensure that the endpoint points to your on-premise API instance!
[asr]
  endpoint = "http://tasks.api:8080/v2/listen?punctuate=true"  # Eg. docker swarm
  # endpoint = "http://api:8080/v2/listen?punctuate=true"  # Eg. docker compose

Allowing Automatic Transcription

Hotpepper can be configured to allow users labeling at level L1 to submit assigned files to an on-premise Deepgram Speech Engine for automatic speech recognition (ASR) and transcription. When ASR is used, the Hotpepper server sends the assigned audio file to the configured Speech API endpoint, parses a transcript from the results, and automatically populates the Transcript textarea of the labeling view with the returned transcript. In our experience, users value this feature highly when labeling.

In Hotpepper’s config.toml file, notice the following lines, which enable ASR:

# Ensure that the endpoint points to your Speech API instance

[asr]
  # docker swarm
  endpoint = "http://tasks.api:8080/v2/listen?punctuate=true"

  # docker compose
  # endpoint = "http://api:8080/v2/listen?punctuate=true"

When Hotpepper is properly configured for ASR, the File Details area of the Hotpepper labeling view will include a Get ASR button. To learn more, see Hotpepper User Guide: Labeling Data.