RHEL 8

Cloud-hosted on-premises (on-prem) deployments are the most common type of deployment performed by customers who want to leverage Deepgram within their own infrastructure. Specific configurations vary, but the Cloud Service Providers that are typically used include Amazon Web Services (AWS), Azure, and Google Cloud Platform (GCP). This guide details steps to set up a deployment to the cloud on an AWS instance running Red Hat Enterprise Linux (RHEL) 8.

In this guide, you will learn how to expose and access your application in a secure and stable manner. You will need to perform some of these steps in the AWS Management Console and some in your local terminal.

Prerequisites

Before you begin, you will need a Deepgram Account Representative. To reach one, contact us!

Your Deepgram Account Representative will guide you through the process of setting up:

a Deepgram product contract
a Deepgram Console account, so we can connect your license to your projects
an account on Docker Hub, so you can conveniently get the latest Deepgram software and updates.

Ahead of your planned on-prem deployment, your Deepgram Account Representative will need:

your Deepgram Console account email address
your Deepgram Console Project ID
your Docker Hub Account ID

Providing this information will allow Deepgram to supply:

download links to your customized configuration files, which will include your license ID
download links to required models, including at least one pre-trained AI model for testing purposes

In addition, you will need to make sure you have a valid RHEL subscription with Red Hat.

Accessing Your Cloud Environment

AWS uses public-key cryptography to secure login information for your instance. In a Linux instance, password authentication is disabled; you use a key pair to log in to your instance securely.

Create an Amazon EC2 Key Pair

If you don’t already have an Amazon EC2 key pair, you will need to create one in order to access the AWS EC2 Virtual Machine. To learn how, read Create a key pair using Amazon EC2 in Amazon’s documentation.

ℹ️
Keys must be created in each AWS region in which you will deploy Deepgram on-premises.

At the end of this process, your browser should download a private-key.pem file for your key pair. Move this file to a secure and memorable location.

Create an Amazon EC2 Instance

To begin your on-prem installation, you must create an Amazon EC2 instance.

New AWS Launch Experience

ℹ️
If you're not yet using the new AWS launch experience, skip to instructions for creating an Amazon EC2 Instance using the old AWS launch experience.

If you're using the new AWS launch experience:

Navigate to the EC2 Dashboard and confirm that the proper AWS Region is configured, then choose Launch Instance and select Launch Instance in the dropdown to open the wizard.
Under Name and Tags, for Name, enter "Deepgram On-Premises (RHEL 8)".
For Applications and OS Images (Amazon Machine Image), enter "Red Hat Enterprise Linux 8" in the search bar, and search.
Next to the Red Hat Enterprise Linux 8 image by Amazon Web Services with Ver Red Hat Enterprise Linux 8.6 (HVM), click Select. In the modal that appears, confirm the latest version of the image is Red Hat Enterprise Linux 8.6 (HVM) or greater, and select Continue.
For Instance Type, in the search bar dropdown, locate and select "g4dn.2xlarge". Ensure it has 8 vCPU and 32GiB memory.
For Key pair (login), select the key pair you created in the Create an Amazon EC2 Key Pair section of this guide. If you didn't previously create a key pair, select Create a new key pair and follow the instructions provided in the modal.

ℹ️
Keys must be created in each AWS region in which you will deploy Deepgram on-premises.
For Network Settings, select Create security group and configure it (or select an existing security group that supports the following rules):
1. Ensure Allow SSH traffic from is selected by default and that its value is Anywhere.
2. Ensure Allow HTTP traffic from the internet is selected by default.
3. If you intend to use an SSL proxy in conjunction with Deepgram on-premises, select Allow HTTPS traffic from the internet.
For Configure storage, increase the gp2 Root Volume value to at least 32 GiB. We recommend more if you intend on deploying multiple models, starting at 100 GiB for a DGTools model training environment.
For Summary, ensure the Number of instances value is at least 1 and confirm that the rest of your instance settings are correct. They should consist of the following values:
- Software Image (AMI): Red Hat Enterprise Linux 8 ami-08970fb2e5767e3b8
- Virtual server type (instance type): g4dn.2xlarge
- Firewall (security group): either New Security Group or the name of your selected security group
- Storage (volumes): 2 volume(s)
After you have confirmed these values, select Launch instance.
Once the instance successfully launches, review its Launch log and other "Next Steps", then select View all instances.
From the list of instances, select your new Deepgram On-Premises (RHEL 8), and make note of or copy its Public IPv4 DNS entry.

Old AWS Launch Experience

If you're using the old AWS launch experience:

Navigate to the EC2 Dashboard and confirm that the proper AWS Region is configured, then choose Launch Instance to open the wizard.
For the Choose an Amazon Machine Image (AMI) wizard step, choose a basic configuration to serve as a template for your instance:
1. Search for Red Hat Enterprise Linux 8.
2. If you want the latest version of the RHEL 8 image, ensure the version is labeled Ver Red Hat Enterprise Linux 8.6 (HVM).
3. Ensure the Architecture version is x86_64.
4. Choose Select.
For the Choose an Instance Type wizard step, select the hardware configuration of your instance:
1. Filter by the g4dn family type.
2. Select the g4dn.2xlarge instance.
3. Select Next: Configure Instance Details.
  
  Avoid selecting Review and Launch. If you do so, the wizard will complete the other configuration settings for you and will miss some important settings.
For the Configure Instance Details wizard step, review the default instance detail selections, then select Next: Add Storage.
For the Add Storage wizard step, increase the size of the Root volume to 32 GB, and then select Next: Add Tags.
For the Add Tags wizard step, select “Add Tag”:
1. For the key, type Name.
2. For the value, type On-Premises.
3. Select Next: Configure Security Group.
For the Configure Security Group wizard step, select Create a new security group.
1. For the name, type On-Premises.
2. For the description, type On-Premises Security Group.
3. Ensure the first rule (of type SSH) is listed, and provide the description SSH for Administration.
  
  Open SSH to only necessary IP addresses.
4. Select Add Rule, then select HTTPS from the dropdown, and provide the description HTTPS for Deepgram API.
5. Select Review and Launch.
For the Review Instance Launch wizard step, ensure all of the instance details match what you entered in the previous steps, and then select Launch.
In the Select an existing key pair or create a new key pair modal, ensure that choose an existing key pair is selected, then choose the key pair you created in the Create an Amazon EC2 Key Pair section. Check the acknowledgement box, and select Launch Instances.
Once the instance successfully launches, review its details in the EC2 instances list, and make note of its Public IPv4 DNS entry.

Log in to the AWS EC2 instance

To complete the rest of the installation, including configuring your environment and transferring files between your local computer and your AWS instance, you must connect to the AWS EC2 instance that you launched.

Open the terminal application for your computer.
Connect to your AWS instance:
```
ssh -i private-key.pem ec2-user@PUBLIC_IPV4_DNS_VALUE
```
⚠️
Be sure to replace the PUBLIC_IPV4_DNS_VALUE placeholder value with the DNS value for your instance. Also check that the path to your private-key.pem file is correct.
If you receive a message that indicates that the authenticity of the host can’t be established, type yes, then press the Enter key on your keyboard.

Configuring Your Cloud Environment

Once you have successfully logged in to your instance, you must remove incompatible kernel drivers and install all of the necessary utilities to run GPU-accelerated containers in Docker.

Update Red Hat with DNF Package Management Tool

Update the system to RHEL 8.7.

ℹ️
We recommend using the RHEL 8.6 image from the AWS image base because it is more recently updated.

Update your AWS instance’s operating system package manager to get information on updated versions of packages or their dependencies:
```
sudo dnf update -y
```
Verify that the update was successful:
```
cat /etc/redhat-release
```
You should expect the command to return the following output:
```
>Red Hat Enterprise Linux Server release 8.7 (Ootpa)
```

Install Podman as the container engine for RHEL 8

Install Podman which is the RHEL 8 native container engine:

Install Podman
```
sudo dnf -y install podman
```

Verify that the Podman installation was sucessful:

podman version

You should expect the command to return the following output:

>Client:       Podman Engine
>Version:      4.2.0
>API Version:  4.2.0
>Go Version:   go1.18.4
>Built:        Fri Jan 13 13:45:32 2023
>OS/Arch:      linux/amd64

Install RHEL 8 Kernel Development Tools

Install the Linux kernel development tools for RHEL 8 to support installing the NVIDIA drivers

Install the RHEL 8 kernel development tools and associated headers:

sudo dnf -y install kernel-devel-`uname -r` kernel-headers-`uname -r`

Install the RHEL 8 Extra Packages for Enterprise Linux (EPEL)

sudo dnf -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm

Install the Dynamic Kernel Module Support (DKMS) framework
```
sudo dnf -y install dkms
```

Install NVIDIA Drivers

Install the latest available NVIDIA drivers for the g4dn instance which has the Tesla T4 GPU:

Remove the "nouveau" module so as to avoid conflict with the NVIDIA drivers
```
modprobe -r nouveau
```
Install GNU Wget (wget), so you can retrieve the latest NVIDIA driver for your GPU:
```
sudo dnf -y install wget
```
Identify the latest driver for the GPU you are using and retrieve its download URL:
1. Go to NVIDIA Official Drivers.
2. Select the correct product and a corresponding CUDA toolkit 11 or greater.
3. Select Search and check that the correct driver is displayed, then select Download.
4. Right-click Agree & Download, then copy the link to save the download URL to your clipboard.

Download the latest driver for your GPU:

# Example: wget https://developer.download.nvidia.com/compute/cuda/12.0.1/local_installers/cuda-repo-rhel8-12-0-local-12.0.1_525.85.12-1.x86_64.rpm
wget LINK_TO_LATEST_NVIDIA_GPU_DRIVER

⚠️
Be sure to replace the LINK_TO_LATEST_NVIDIA_GPU_DRIVER placeholder value with the URL to the latest driver for the GPU you are using.

Add the downloaded driver package to RPM

# Example: sudo rpm -i cuda-repo-rhel8-12-0-local-12.0.1_525.85.12-1.x86_64.rpm
sudo rpm -i DOWNLOADED_FILE_NAME

Clean the DNF cache
```
sudo dnf clean all
```

Install the NVIDIA drivers

sudo dnf -y module install nvidia-driver:latest-dkms

Install the CUDA environment
```
sudo dnf -y install cuda
```
Load the NVIDIA kernel module and then load the NVIDIA unified memory kernel module
```
nvidia-modprobe && nvidia-modprobe -u
```
Test that the NVIDIA drivers and CUDA environment were successfully installed:

nvidia-smi

Setting up the NVIDIA Container Toolkit

ℹ️
This is from the NVIDIA Installation Guide https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#podman

Install the NVIDIA Container Toolkit

Setup the package repository and GPG key:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
 && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

Install the NVIDIA Container Toolkit:

sudo dnf clean expire-cache \
 && sudo dnf install -y nvidia-container-toolkit

WGET the RHEL 7 NVIDIA container toolkit

wget https://raw.githubusercontent.com/NVIDIA/dgx-selinux/master/bin/RHEL7/nvidia-container.pp

Add the NVIDIA container to the SELinux policy modules
```
sudo semodule -i nvidia-container.pp
```
List the NVIDIA container kernel modules and restore the security context of those modules
```
nvidia-container-cli -k list | restorecon -v -f -
```
Restore the security context of the contents of /dev
```
restorecon -Rv /dev
```

Test that the NVIDIA Container Toolkit was successfully installed

sudo podman run --user 1000:1000 --security-opt=no-new-privileges --cap-drop=ALL \
--security-opt label=type:nvidia_container_t  \
docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1

Run podman as non-root

Modify the NVIDIA container runtime to enable running podman as non-root (a.k.a. "rootless")

Edit the /etc/nvidia-container-runtime/config.toml file
```
sudo vi /etc/nvidia-container-runtime/config.toml
```

Modify the file contents to match:

[nvidia-container-cli]
#no-cgroups = false
no-cgroups = true
# ...
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
debug = "~/.local/nvidia-container-runtime.log"

Test rootless podman and the NVIDIA container toolkit/CUDA environment

podman run --security-opt=no-new-privileges --cap-drop=ALL --security-opt \
label=type:nvidia_container_t --hooks-dir=/usr/share/containers/oci/hooks.d/ \
docker.io/nvidia/cuda:10.2-base nvidia-smi

Log in to Docker Hub with Podman

Once you are satisfied that Podman is installed and configured correctly along with the NVIDIA Container Toolkit and the CUDA environment, cache your credentials locally by logging in to Docker Hub. Once your credentials are cached locally, you should not have to log in again (until after you log out (podman logout)).

⚠️
We recommend that you use an access token when logging in to Docker Hub. Access tokens are used like passwords, but limit administrative access to your account. To learn how to create and use access tokens with Docker, read Docker’s documentation on Access Tokens.

Setting up the Deepgram Engine and API Containers

Now that your environment is set up, you can download and set up your Deepgram products. For ease of use, Deepgram provides its products in Docker containers.

Get Deepgram Products

Download the latest Deepgram Engine image from Docker:
```
podman pull docker.io/deepgram/onprem-engine:3.45.3
```
Download the latest Deepgram API image from Docker:
```
podman pull docker.io/deepgram/onprem-api:1.88.0
```
Download the latest Deepgram License Proxy Server image from Docker:
```
podman pull docker.io/deepgram/onprem-license-proxy:1.4.0
```

Import Your Container Configuration and Model Files

Before you can run your on-prem deployment, you must configure the required components. To do this, you will need to create a directory for and customize your configuration files, as well as create a directory to house models that have been encrypted for use in your requests.

For your deployment, we provide models and configuration files to you via Amazon S3 buckets, so you can download directly to your AWS RHEL instance. If you don’t have customized configuration files, you can create configuration files using the examples included in Customize Your Configuration.

To house your configuration files, create a directory named config:
```
mkdir config
```
ℹ️
You can name this directory whatever you like, as long as you update the podman commands accordingly.
In your new directory, download each TOML configuration file using the links provided by Deepgram:
```
cd config
wget LINK_TO_TOML_CONFIGURATION_FILE_PROVIDED_BY_DEEPGRAM
```
⚠️
Be sure to replace each LINK_TO_TOML_CONFIGURATION_FILE_PROVIDED_BY_DEEPGRAM placeholder value with the URL to the TOML configuration file provided by your Deepgram Account Representative.
To house models that have been encrypted for use in your requests, create a directory named models:
```
mkdir models
```
ℹ️
You can name this directory whatever you like, as long as you update the podman command and engine.toml file accordingly.
In your new directory, download each model using the links provided by Deepgram:
```
cd models
wget LINK_TO_MODEL_PROVIDED_BY_DEEPGRAM
```
⚠️
Be sure to replace each LINK_TO_MODEL_PROVIDED_BY_DEEPGRAM placeholder value with the URL to the model provided by your Deepgram Account Representative.

Customize Your Configuration

Once your configuration files have been created in your AWS instance, you can update your configuration to customize it for your use case.

Configuration Files

⚠️
Your Deepgram Account Representative should provide you with download links to customized configuration files. Unless further customized, the example files included in this section are basic and should be used only to spin up a standard proof of concept (POC) deployment or to test your system.

`api.toml`

The API configuration file determines how the API container built from the Docker image will be run. It includes important settings for exposing API endpoints and for enabling features and post-processing plugins for requests.

# Keep in mind that all paths are in-container paths, and do not need to exist
# on the host machine.

# Configure license validation.
[license]
  # Your license key.
  key = "LICENSE_KEY"
  server_url = "https://license-proxy:8443"

# Configure how the API will listen for your requests
[server]
  # The base URL (prefix) for requests to the API.
  base_url = "/v1"
  # The IP address to listen on. Since this is likely running in a Docker
  # container, you will probably want to listen on all interfaces.
  host = "0.0.0.0"
  # The port to listen on
  port = 8080

  # How long to wait for a connection to a callback URL (in seconds)
  callback_conn_timeout = 1
  # How long to wait for a response to a callback URL (in seconds)
  callback_timeout = 10

  # How long to wait for a connection to a fetch URL (in seconds)
  fetch_conn_timeout = 1
  # How long to wait for a response to a fetch URL (in seconds)
  fetch_timeout = 60

[features]
  endpointing_on_by_default = true

  # Enable topic detection by setting this to true, set to false to disable
  topic_detection = true # or false

  # Enable summarization by setting this to true, set to false to disable
  summarization = true # or false

# Configure the DNS resolver, overriding the system default.
# Typically not needed, although we document it here for completeness.
# [resolver]
#   # List of nameservers to use to resolver DNS queries.
#   nameservers = ["127.0.0.11 53 udp"]
#   # Override the TTL in the DNS response (in seconds).
#   max_ttl = 10

# Configure the backend pool of speech engines (generically referred to as
# "drivers" here). The API will load-balance among drivers in the standard
# pool; if one standard driver fails, the next one will be tried.
#
# Each driver URL will have its hostname resolved to an IP address. If a domain
# name resolves to multiple IP addresses, the API will load-balance across each
# IP address.
#
# This behavior is provided for convenience, and in a production environment
# other tools can be used, such as HAProxy.

# A new Speech Engine ("driver") in the "standard" pool.
[[driver_pool.standard]]
  # Host to connect to. Here, we use "tasks.engine", which is the Docker Swarm
  # method for resolving the IP addresses of all "engine" services. If you are
  # using Docker Compose, then this should just be "engine" instead of
  # "tasks.engine". If you rename the "engine" service in the Docker Compose
  # file, then change it accordingly here. Additionally, the port and prefix
  # should match those defined in the Engine configuration file.
  # NOTE: This must be HTTPS.
  url = "https://engine:8081/v2"

  # How long to wait for a connection to be established (in seconds).
  conn_timeout = 5
  # Once a connection is established, how many seconds to wait for a response.
  timeout = 400
  # Factor to increase the timeout by for each additional retry (for
  # exponential backoff).
  timeout_backoff = 1.2

  # If you fail to get a valid response (timeout or unexpected error), then
  # how many attempts should be made in total, including the initial attempt?
  # This is applied *per IP address* that the domain name in the URL resolves
  # to. If your domain resolves to multiple IPs, then "1" may be sufficient.
  retries_per_ip = 1

  # Before attempting a retry, sleep for this long (in seconds)
  retry_sleep = 2
  # Factor to increase the retry sleep by for each additional retry (for
  # exponential backoff).
  retry_backoff = 1.6
  max_response_size = 1073741824

# Additional speech engines ("drivers") can be defined here, they too should
# be in the standard pool using [[driver_pool.standard]]

`engine.toml`

The Deepgram Engine configuration file determines how the Deepgram Engine container built from the Docker image will be run. It includes important file paths and settings related to models that you will be using.

# Keep in mind that all paths are in-container paths and do not need to exist
# on the host machine.

# Configure license validation.
[license]
  # Your license key
  key = "LICENSE_KEY"
  server_url = "https://license-proxy:8443"

# Configure the server to listen for requests from the API.
[server]
  # The IP address to listen on. Since this is likely running in a Docker
  # container, you will probably want to listen on all interfaces.
  host = "0.0.0.0"
  # The port to listen on
  port = 8080

# Speech models. You can place these in one or multiple directories.
[model_manager]
 search_paths = ["/models"]

# Enable ancillary features
# To disable any of these features, just remove to comment out the respective
# feature section.
[features]
 # Enable Language Detection by setting this to true, set to false to disable
 language_detection = true # or false

# Size of audio chunks to send in seconds. Min/max default to 7/10s,
# which is ideal for batch requests. If streaming, we generally
# recommend 0.5-3s chunks for lowest latency
# [chunking]
#  [chunking.streaming]
#   min_duration = 0.5
#   max_duration = 3.0
#
# Engine will automatically enable half precision operations if your GPU supports
# them. You can explicitly enable or disable this behavior with the state parameter
# which supports enabled, disabled, and auto (the default).
[half_precision]
   state = "enabled" # or "disabled" or "auto"

Starting and Using Your Container

To make sure your Deepgram on-prem deployment is properly configured and running, you will want to start the containers with Podman, and make a sample request to Deepgram.

Start Podman

Now that you have your configuration files set up and in the correct location to be used by the container, use Podman to run the containers:

Navigate to your home directory where you created ./config and ./models

cd ~

Start License Proxy

ℹ️
Be sure to replace the LICENSE_KEY placeholder in this command with your Deepgram provided license key.

podman run --rm -d --name deepgram-onprem-license-proxy --network=host \
  docker.io/deepgram/onprem-license-proxy:1.4.0 -v serve \
  --license-key "LICENSE_KEY" --host 0.0.0.0 --port 8443 --status-port 8089

Verify that License Proxy started properly

podman container logs deepgram-onprem-license-proxy

Start Engine

podman run --security-opt=no-new-privileges --cap-drop=ALL --security-opt \
  label=type:nvidia_container_t --hooks-dir=/usr/share/containers/oci/hooks.d/ \
  --rm -d --name deepgram-onprem-engine --network=host \
  -v ./config:/config:z -v ./models:/models:z \
  docker.io/deepgram/onprem-engine:3.43.4 -v serve /config/engine.toml

Verify that Engine started properly

podman container logs deepgram-onprem-engine

Start API

podman run --rm -d --name deepgram-onprem-api --network=host -v ./config:/config:z \
  docker.io/deepgram/onprem-api:1.83.0 \
  -v serve /config/api.toml

Verify that API started properly

podman container logs deepgram-onprem-api

Test Your Deepgram Setup with a Sample Request

Test your environment and container setup with a local file:

curl -X POST -T FILE_PATH http://localhost:8080/v1/listen

Finally, test that you can view the model that was automatically loaded by Deepgram Engine. To do this, download the American English Harvard Sentences file from the Open Speech Repository and run it through the model:

wget https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0019_8k.wav

curl -X POST --data-binary @OSR_us_000_0019_8k.wav "http://localhost:8080/v2/listen?model=general-dQw4w9WgXcQ&version=2021-08-18"; echo

Adding the License Proxy Server

For customers deploying Deepgram’s on-premises solution in production, Deepgram recommends the License Proxy Server, which is a caching proxy that communicates with the Deepgram-hosted license server to ensure uptime and simplify network security.

⚠️
If you aren't certain which products your contract includes or if you are interested in adding the License Proxy Server to your on-premises deployment, please consult a Deepgram Account Representative. To reach one, contact us!

There are many benefits of using the License Proxy Server:

For production deployments, the License Proxy Server allows your deployed on-premises services to continue to run even if your deployment loses connectivity to the Deepgram license server.
If network security requirements dictate that traffic over the web is allowed from only certain hosts, the License Proxy Server can be deployed statically while ASR services scale elastically.
If your customers will deploy Deepgram software as part of your on-premises solution and their traffic must flow back to your environment, you may deploy the License Proxy Server to relay traffic on to the Deepgram license server.

To see an architecture overview or learn about monitoring the License Proxy Server, see License Proxy Server.

Installing

Deepgram makes all of its products available through Docker Hub.

Run the following command:

$ podman pull docker.io/deepgram/onprem-license-proxy:1.4.0

Deploying

Before you can run your on-prem deployment with the License Proxy Server, you must configure the License Proxy Server. To do this, you will need to update your Docker Compose and container configuration files, then restart Docker.

Update Your Container Configuration Files

In your api.toml and engine.toml files, add a line that specifies the URL to your deployed proxy immediately below your license key configuration:

[license]
key = "YOUR_LICENSE_KEY"
server_url = "https://localhost:8443"

Start Podman

Start Podman to begin directing licensing requests through the proxy:

podman run --rm -d --name deepgram-onprem-license-proxy --network=host \
  docker.io/deepgram/onprem-license-proxy:1.4.0 -v serve \
  --license-key "LICENSE_KEY" --host 0.0.0.0 --port 8443 --status-port 8089

After restarting, test that an ASR request is processed as expected.