Managing Datasets
Hotpepper provides administrators with tools that allow them to manage datasets, including adding datasets, activating and deactivating datasets, prioritizing datasets, and packaging labeled data. To access these tools, navigate to Dashboard > Admin > Datasets.

Reviewing Progress

Hotpepper allows administrators to track progress for dataset processing, including:
- Total audio hours completed in the past 24 hours
- Total files completed in the past 24 hours
- Total audio hours completed over the past 7 days
- Total files completed over the past 7 days
- Total files and audio hours completed in the last 24 hours per level for each dataset
- Total audio hours completed for all time and file status for each dataset
Showing/Hiding Datasets

Administrators can control which audio files show in the labeling level queues by activating and deactivating datasets. These actions are not permanent and may be changed at any time.
Show Dataset
To show a dataset:
-
Navigate to Dashboard > Admin > Datasets, and locate the dataset listing.
-
Locate the dataset you want to show, and in the Status column, select Activate.
Hide Dataset
To hide a dataset:
-
Navigate to Dashboard > Admin > Datasets, and locate the dataset listing.
-
Locate the dataset you want to show, and in the Status column, select Deactivate.
Prioritizing Datasets

Administrators can control the order in which files are assigned to users for labeling by prioritizing datasets. Files from datasets marked as higher priority are assigned to users before files from datasets marked as lower priority.
To prioritize a dataset:
-
Navigate to Dashboard > Admin > Datasets, and locate the dataset listing.
-
Locate the dataset you want to prioritize, and in the Priority column, enter the desired location in the queue.
-
Select the green checkmark to save.
Packaging Labeled Data

Once all audio files in a dataset have been labeled, administrators can package the data so it can be used to train a custom model using Deepgram’s DG Tools.
Configure Package Directory
As part of the required Hotpepper setup, the location where packaged datasets should be stored should already have been configured in Hotpepper’s config.toml
file:
# A directory for outputting packaged datasets.
packaged_dataset_location = "/packaged"
When you configure Hotpepper using the config.toml
file, keep in mind that the configured settings should match what are mapped in the Docker Compose file when deploying Deepgram products. These are not file paths on your actual underlying file system; rather, they are file paths in the Docker container’s file system. To learn more, visit On-Premise Guide: Get and Configure Deepgram Products.
Package Dataset
If the package directory has been configured, you can package the dataset:
-
Navigate to Dashboard > Admin > Datasets, and locate the dataset listing.
-
Locate the dataset you want to package, and select Package. Hotpepper will create a
tar
archive of audio and text label files in the configured directory.
Adding Datasets

Before a new dataset can be added to the Hotpepper database, the audio files must be collected and placed in a subdirectory on the underlying file system. Once the audio files are in place, the dataset can be added to the database through Hotpepper.
Configure Dataset Directory
As part of the required Hotpepper setup, the location where audio data should be stored should already have been configured in Hotpepper’s config.toml
file:
# Path to the directory containing input datasets. New datasets are
# created by adding subdirectories to this folder and placing audio
# data there. This directory should be structured like so:
# /datasets/
# |_ dataset1/
# |_ audio1.mp3
# |_ audio2.mp3
datasets = "/datasets"
When you configure Hotpepper using the config.toml
file, keep in mind that the configured settings should match what are mapped in the Docker Compose file when deploying Deepgram products. These are not file paths on your actual underlying file system; rather, they are file paths in the Docker container’s file system. To learn more, visit Hotpepper User Guide: Setting Up.
Create Audio Collection
If the package directory has been configured, you can create the audio collection(s) in the configured directory:
-
Navigate to the configured directory, and create one or more subdirectories. Each subdirectory represents a dataset.
-
Inside each subdirectory, add one or more audio files that belong to the corresponding dataset. Hotpepper accepts audio formats that can be played by the HTML5 audio player element (
mp3
,wav
, orogg
). To use minimal bandwidth, Deepgram recommendsmp3
.
Pre-populate Transcripts (Optional)
Optionally, you can pre-populate transcripts for reviewers by pre-seeding file transcripts in the Hotpepper database:
-
Locate the audio file for which you would like to pre-seed a transcript (for example,
file1.mp3
). -
Save the transcript of the audio file to a text file with the same name as the audio file (for example,
file1.txt
). -
Move the file containing the transcript into the same subdirectory as the located file.
For example, let’s say our configured datasets
directory is structured like so:
# /datasets/
# |_ dataset1/
# |_ audio1.mp3
# |_ audio2.mp3
To pre-populate the Transcript field when a user reviews the audio1.mp3
file, we would save the transcript of audio1.mp3
to a file called audio1.txt
and move it into the dataset1
subdirectory:
# /datasets/
# |_ dataset1/
# |_ audio1.mp3
# |_ audio1.txt # contains the transcript of audio1.mp3
# |_ audio2.mp3
Add Dataset to Database
Now that you have configured the dataset directory and created the audio collection, you must add the dataset to the Hotpepper database:
-
Navigate to Dashboard > Admin > Datasets, and locate the Add a Dataset section.
-
Enter the Dataset Name.
-
If the dataset contains data subject to HIPAA regulations, select Contains HIPAA data. Access will be restricted to only users marked as HIPAA authorized. To learn how to authorize a HIPAA user, visit Hotpepper User Guide: Adding Users.
-
Select Add Dataset. Hotpepper will scan the configured dataset directory, find all audio files, calculate durations, read any provided pre-seed transcripts, and enter a row for each audio file in the database.