1. On Prem
  2. Optional Components
  3. DGTools

DGTools

For customers with Deepgram’s on-premises solution who want to train their own custom models, DGTools offers a path forward. Users of DGTools can take a general domain model and train it with production-representative data, or trainable data, to produce a targeted domain model.

The quality of your trainable data matters. Errors in transcription will lead to suboptimal outcomes.

Use Cases

  • Reducing word error rate (WER) by teaching an uncommon word to a general domain model, producing a targeted domain model.
  • Reducing WER by improving targeted domain model confidence with additional trainable data.

Deploying DGTools

Before You Begin

Prepare Your Linux Environment

This guide is written assuming you are running Ubuntu 20.04 LTS and the environment has been prepared to run Deepgram on-premises. If you are unfamiliar with the on-prem cloud environment requirements, please review Deploying Base Components before proceeding with this guide.

Log in to Docker

Docker Hub

DGTools is provided as a Docker container image stored at the Docker Hub repository deepgram/onprem-dgtools. You will need to be logged into Docker in the installation environment with the same Docker Hub account used to access the Deepgram on-premises base component repositories deepgram/onprem-api and deepgram/onprem-engine.

Docker As a Non-Root User

DGTools must be run via Docker as a non-root user. For instructions, please read Post-installation Steps for Linux.

Prepare the NVIDIA Runtime

DGTools requires CUDA 11 and an NVIDIA GPU to work.

Set Up Deepgram On-premises

If you intend to run DGTools in the same environment as your Deepgram on-premises installation, you must stop Deepgram API and Deepgram Engine before you start DGTools.

You will need your Deepgram on-premises license key to use DGTools. It can be found in your api.toml or engine.toml file and looks like this:

key = "01234567-89ab-cdef-0123-456789abcdef"

Installing

Create the dgtools directory for the license file and training data

Create the dgtools directory and its dgdatasets subdirectory in your home directory $HOME. This directory will serve as the installation location of the DGTools license file as well as any user-provided data sets that will be used for model training.

mkdir -p $HOME/dgtools/dgdatasets

Create the DGTOOLS_RESOURCE_DIR environment variable, which points to the new dgtools directory. This statement will also add the DGTOOLS_RESOURCE_DIR environment variable to your .bashrc file.

export DGTOOLS_RESOURCE_DIR=$HOME/dgtools; \
echo "export DGTOOLS_RESOURCE_DIR=\$HOME/dgtools" >> $HOME/.bashrc

You should make this directory findable within your path by adding DGTOOLS_RESOURCE_DIR to your path and adding it to your .bashrc file.

export PATH=$DGTOOLS_RESOURCE_DIR:$PATH; \
echo "export PATH=\$DGTOOLS_RESOURCE_DIR:\$PATH" >> $HOME/.bashrc

Create the dgtools script

Use the following command to create the dgtools script in the dgtools directory:

cat > $DGTOOLS_RESOURCE_DIR/dgtools <<'EOF'
set -e
DGTOOLS_IMAGE=${DGTOOLS_IMAGE:-"deepgram/onprem-dgtools:2.0.0"}
docker pull $DGTOOLS_IMAGE
docker run \
-t \
-i \
--gpus all \
--volume $DGTOOLS_RESOURCE_DIR:/dgtraining \
--volume $HOME:$HOME \
--user $(id -u):$(id -g) \
--rm $DGTOOLS_IMAGE \
"$(pwd)" "$@"
EOF

Make the script executable:

chmod +x $DGTOOLS_RESOURCE_DIR/dgtools

Create the dgtools.toml license file

Use the following command to create the dgtools.toml license file in the dgtools directory:

echo -e "[license]\n\nkey = \"REPLACE_WITH_YOUR_KEY\"\nserver_url = \"https://license.deepgram.com\"" > $DGTOOLS_RESOURCE_DIR/dgtools.toml

Copy any data sets to be used for model training into the new dgtools/dgdatasets subdirectory:

cp [DATASET_DIRECTORY] $DGTOOLS_RESOURCE_DIR/dgdatasets/.

Testing the Installation

Download the Docker container image for DGTools and test the installation:

dgtools version

The output should look similar to:

2.0.0: Pulling from deepgram/onprem-dgtools
Digest: sha256:cf4174d64da30db9c233c9ffc85e035341b838e938082f76a74f7c5ea76db746
Status: Image is up to date for deepgram/onprem-dgtools:2.0.0
docker.io/deepgram/onprem-dgtools:2.0.0
Copyright © 2022 Deepgram, Inc.
DGTools v2.0.0

Using DGTools

If you are migrating from dgtoolsbeta

  1. Delete the old alignment data in any data set subdirectory by deleting the resource_properties folder.

  2. Delete the old feature sets in the $DGTOOLS_RESOURCE_DIR location by deleting the .features_cache hidden directory.

  3. Delete the old training splits metadata (which files are trained with, which files are validated with) in the $DGTOOLS_RESOURCE_DIR/.train_meta location by deleting the datasetsplits.meta_jm hidden directory.

DGTools autotrain subcommand

The primary subcommand for dgtools is dgtools autotrain. It allows a user to custom-train an initial, base model provided by Deepgram. Additionally, users can rename models and “top-off” a model with additional training data.

Training Data

dgtools/dgdatasets is where DGTools will search for training data. You can copy new data sets into subdirectories within dgtools/dgdatasets, and DGTools will automatically find them the next time it’s run. Data sets must contain audio files and associated labeled audio transcript text files. A given audio file and its associated labeled audio transcript must share the same file name with different file extensions. Data sets must be self-contained within their own subdirectory (for example, $DGTOOLS_RESOURCE_DIR/dgdatasets/example_data_set).

Terminal sessions and tmux

We recommend running DGTools inside a tmux session to ensure that it is not interrupted during training.

You can list the current tmux sessions with the tmux ls command.

You can reattach to a tmux session with the tmux attach -t SESSION_ID command, where SESSION_ID matches your session ID from the current sessions list.

Using autotrain from scratch

To use DGTools to train an initial, base model, provide the path location of that model file to the dgtools autotrain subcommand. DGTools autotrain takes the path to the base model to train as its only required parameter. It will train the model with the data sets from the dgdatasets directory and produce a *.last.dg file (for example, general.last.dg).

dgtools autotrain ./general.dg

In the output for the command, there is a a sample countdown which indicates how many samples are remaining in the training process.

Topping-off a model with autotrain --top-off

If you are just adding additional training data to dgdatsets/ and want to re-run autotrain, you can "top-off" the model using the --top-off flag to skip adding new vocabulary words and speed up the operation.

dgtools autotrain ./general.last.dg --top-off

Revisiting dgdatasets/ with --revisit-count N

If you want to adjust the amount of times a given sample within dgdatasets is re-visited during autotrain, you can set the --revisit-count N, where N is the number of times a sample should be revisited. By default, dgtools autotrain revisits samples 20 times, whereas dgtools autotrain --top-off revisits samples 10 times.

dgtools autotrain ./general.last.dg --top-off --revisit-count 100

The default --revisit-count values for autotrain and topping-off were chosen to find a balance with training and the risk of “over-fitting” a model. A model is considered to be “over-fit” when training starts to negatively impact its validation accuracy. Validation accuracy is a useful measure for determining when a model’s training is “complete” and is something which DGTools uses to create model checkpoints.

Continuous Checkpointing

DGTools creates “checkpoint” models as it trains. DGTools does this by tracking the validation accuracy of the intermediate models it produces while training and preserves the best two copies as a *.best.VALIDATION_ACCURACY.dg file (for example, general.best.842.dg) where VALIDATION_ACCURACY is represented as the number 842 and can be read as a percentage “84.2%.”

DGTools also preserves the last two checkpoint models in addition to the two best checkpoint models. A model is only saved as a checkpoint model if it is neither of the best models.

Marking dgdatasets/ Data Samples for Inclusion in Training or Validation

In a given data set subdirectory within dgdatasets, there may exist two different files:

  • keep_train.lst may contain a source identifier (for example, a filename without the path or file extension) of files to attempt to include in the training list.
  • keep_validate.lst may contain a source identifier (for example, a filename without the path or file extension) of files to attempt to include in the validation list.

Comments (#) in either file will be ignored.

In the event of conflict where a source identifier is in either the training or validation lists for previous model training iterations, a warning will be printed to the screen and the new inclusion designation will be ignored. This designation will persist in DGTools unless the steps described in the “If you are a migrating from dgtoolsbeta" section of this guide are followed.

DGTools makedeployable subcommand

Providing makedeployable an output model name and tags

To output a model with a specific filename, use the flag --name and --tags to provide the name and any tags for the model’s metadata information. You can specify an arbitrary number of String tags for the model via the --tags flag. If you do not use these flags, the default name will be derived from the model's metadata (as opposed to updating the new model's metadata with the name you supplied). The .last model file should generally be used as the deployable file, but as a best practice we recommend evaluating either of the .best model files for deployment in addition to the .last model file.

$ dgtools makedeployable general.last.dg --name deployed_model --tags 1
Initializing the environment...
Complete.
Created deployed_model_en__1_deployable.dg.

$ ls
initial.dg general.last.dg deployed_model_en__1_deployable.dg

Other DGTools subcommands

In addition to autotrain, these subcommands are useful for determining how to use DGTools, listing the supported training languages, retrieving and printing a model file’s metadata information, and printing the DGTools version information.

help

The help subcommand can be used to print out the built-in help information for DGTools:

$ dgtools help
Usage:
dgtools [SUBCOMMAND] [OPTIONS]
Valid subcommands are: autotrain, makedeployable, modellistlang, help, version. For help on individual subcommands, type the '--help' option after the subcommand.
See https://developers.deepgram.com/on-prem/optional-components/dgtools/ for more information.

listlang

The listlang subcommand can be used to report the supported training languages:

$ dgtools listlang
Currently supported languages: en

modelinfo

The modelinfo subcommand can be used to retrieve and print a given model’s metadata information:

$ dgtools modelinfo ./deployed_model_en__1_deployable.dg
Name: deployed_model
UUID: a965179c-a8c3-457d-bc67-32b66f579656
Version: 2021-08-18
Tags: topped-off, example
Languages: en

version

The version subcommand can be used to print the current DGTools version information:

$ dgtools version
Copyright © 2022 Deepgram, Inc. 
DGTools v2.0.0

You can also use the following docker command to output the DGTools container image information:

docker image inspect deepgram/onprem-dgtools:2.0.0

FEEDBACK