Hotpepper provides administrators with tools that allow them to manage datasets, including adding datasets, activating and deactivating datasets, prioritizing datasets, and packaging labeled data. To access these tools, navigate to Dashboard > Admin > Datasets.
Hotpepper allows administrators to track progress for dataset processing, including:
Administrators can control which audio files show in the labeling level queues by activating and deactivating datasets. These actions are not permanent and may be changed at any time.
To show a dataset:
To hide a dataset:
Administrators can control the order in which files are assigned to users for labeling by prioritizing datasets. Files from datasets marked as higher priority are assigned to users before files from datasets marked as lower priority.
To prioritize a dataset:
Once all audio files in a dataset have been labeled, administrators can package the data so it can be used to train a custom model using Deepgram’s DG Tools.
As part of the required Hotpepper setup, the location where packaged datasets should be stored should already have been configured in Hotpepper’s config.toml
file:
toml# A directory for outputting packaged datasets. packaged_dataset_location = "/packaged"
When you configure Hotpepper using the config.toml
file, keep in mind that the configured settings should match what are mapped in the Docker Compose file when deploying Deepgram products. These are not file paths on your actual underlying file system; rather, they are file paths in the Docker container’s file system. To learn more, visit On-premise Survival Guide: Get and Configure Deepgram Products.
If the package directory has been configured, you can package the dataset:
tar
archive of audio and text label files in the configured directory.Before a new dataset can be added to the Hotpepper database, the audio files must be collected and placed in a subdirectory on the underlying file system. Once the audio files are in place, the dataset can be added to the database through Hotpepper.
As part of the required Hotpepper setup, the location where audio data should be stored should already have been configured in Hotpepper’s config.toml
file:
toml# Path to the directory containing input datasets. New datasets are # created by adding subdirectories to this folder and placing audio # data there. This directory should be structured like so: # /datasets/ # |_ dataset1/ # |_ audio1.mp3 # |_ audio2.mp3 datasets = "/datasets"
When you configure Hotpepper using the config.toml
file, keep in mind that the configured settings should match what are mapped in the Docker Compose file when deploying Deepgram products. These are not file paths on your actual underlying file system; rather, they are file paths in the Docker container’s file system. To learn more, visit On-premise Survival Guide: Get and Configure Deepgram Products.
If the package directory has been configured, you can create the audio collection(s) in the configured directory:
mp3
, wav
, or ogg
). To use minimal bandwidth, Deepgram recommends mp3
.Optionally, you can pre-populate transcripts for reviewers by pre-seeding file transcripts in the Hotpepper database:
file1.mp3
).file1.txt
).For example, let’s say our configured datasets
directory is structured like so:
toml# /datasets/ # |_ dataset1/ # |_ audio1.mp3 # |_ audio2.mp3
To pre-populate the Transcript field when a user reviews the audio1.mp3
file, we would save the transcript of audio1.mp3
to a file called audio1.txt
and move it into the dataset1
subdirectory:
toml# /datasets/ # |_ dataset1/ # |_ audio1.mp3 # |_ audio1.txt # contains the transcript of audio1.mp3 # |_ audio2.mp3
Now that you have configured the dataset directory and created the audio collection, you must add the dataset to the Hotpepper database: