Managing Datasets

Last updated 06/18/2021

Hotpepper provides administrators with tools that allow them to manage datasets, including adding datasets, activating and deactivating datasets, prioritizing datasets, and packaging labeled data. To access these tools, navigate to Dashboard > Admin > Datasets.

Hotpepper Administrator Dataset View

Reviewing Progress

Hotpepper allows administrators to track progress for dataset processing, including:

  • Total audio hours completed in the past 24 hours
  • Total files completed in the past 24 hours
  • Total audio hours completed over the past 7 days
  • Total files completed over the past 7 days
  • Total files and audio hours completed in the last 24 hours per level for each dataset
  • Total audio hours completed for all time and file status for each dataset

Showing/Hiding Datasets

Administrators can control which audio files show in the labeling level queues by activating and deactivating datasets. These actions are not permanent and may be changed at any time.

Show Dataset

To show a dataset:

  1. Navigate to Dashboard > Admin > Datasets, and locate the dataset listing.
  2. Locate the dataset you want to show, and in the Status column, select Activate.

Hide Dataset

To hide a dataset:

  1. Navigate to Dashboard > Admin > Datasets, and locate the dataset listing.
  2. Locate the dataset you want to show, and in the Status column, select Deactivate.

Prioritizing Datasets

Administrators can control the order in which files are assigned to users for labeling by prioritizing datasets. Files from datasets marked as higher priority are assigned to users before files from datasets marked as lower priority.

To prioritize a dataset:

  1. Navigate to Dashboard > Admin > Datasets, and locate the dataset listing.
  2. Locate the dataset you want to prioritize, and in the Priority column, enter the desired location in the queue.
  3. Select the green checkmark to save.

Packaging Labeled Data

Once all audio files in a dataset have been labeled, administrators can package the data so it can be used to train a custom model using Deepgram’s DG Tools.

Configure Package Directory

As part of the required Hotpepper setup, the location where packaged datasets should be stored should already have been configured in Hotpepper’s config.toml file:

# A directory for outputting packaged datasets.
packaged_dataset_location = "/packaged"

When you configure Hotpepper using the config.toml file, keep in mind that the configured settings should match what are mapped in the Docker Compose file when deploying Deepgram products. These are not file paths on your actual underlying file system; rather, they are file paths in the Docker container’s file system. To learn more, visit On-Premise Guide: Get and Configure Deepgram Products.

Package Dataset

If the package directory has been configured, you can package the dataset:

  1. Navigate to Dashboard > Admin > Datasets, and locate the dataset listing.
  2. Locate the dataset you want to package, and select Package. Hotpepper will create a tar archive of audio and text label files in the configured directory.

Adding Datasets

Before a new dataset can be added to the Hotpepper database, the audio files must be collected and placed in a subdirectory on the underlying file system. Once the audio files are in place, the dataset can be added to the database through Hotpepper.

Configure Dataset Directory

As part of the required Hotpepper setup, the location where audio data should be stored should already have been configured in Hotpepper’s config.toml file:

# Path to the directory containing input datasets. New datasets are
# created by adding subdirectories to this folder and placing audio 
# data there. This directory should be structured like so:
# /datasets/
#   |_ dataset1/
#      |_ audio1.mp3
#      |_ audio2.mp3
datasets = "/datasets"

When you configure Hotpepper using the config.toml file, keep in mind that the configured settings should match what are mapped in the Docker Compose file when deploying Deepgram products. These are not file paths on your actual underlying file system; rather, they are file paths in the Docker container’s file system. To learn more, visit On-Premise Guide: Get and Configure Deepgram Products.

Create Audio Collection

If the package directory has been configured, you can create the audio collection(s) in the configured directory:

  1. Navigate to the configured directory, and create one or more subdirectories. Each subdirectory represents a dataset.
  2. Inside each subdirectory, add one or more audio files that belong to the corresponding dataset. Hotpepper accepts audio formats that can be played by the HTML5 audio player element (mp3, wav, or ogg). To use minimal bandwidth, Deepgram recommends mp3.

Pre-populate Transcripts (Optional)

Optionally, you can pre-populate transcripts for reviewers by pre-seeding file transcripts in the Hotpepper database:

  1. Locate the audio file for which you would like to pre-seed a transcript (for example, file1.mp3).
  2. Save the transcript of the audio file to a text file with the same name as the audio file (for example, file1.txt).
  3. Move the file containing the transcript into the same subdirectory as the located file.

For example, let’s say our configured datasets directory is structured like so:

# /datasets/
#   |_ dataset1/
#      |_ audio1.mp3
#      |_ audio2.mp3

To pre-populate the Transcript field when a user reviews the audio1.mp3 file, we would save the transcript of audio1.mp3 to a file called audio1.txt and move it into the dataset1 subdirectory:

# /datasets/
#   |_ dataset1/
#      |_ audio1.mp3
#      |_ audio1.txt  # contains the transcript of audio1.mp3
#      |_ audio2.mp3

Add Dataset to Database

Now that you have configured the dataset directory and created the audio collection, you must add the dataset to the Hotpepper database:

  1. Navigate to Dashboard > Admin > Datasets, and locate the Add a Dataset section.
  2. Enter the Dataset Name.
  3. If the dataset contains data subject to HIPAA regulations, select Contains HIPAA data. Access will be restricted to only users marked as HIPAA authorized. To learn how to authorize a HIPAA user, visit Hotpepper User Guide: Adding Users.
  4. Select Add Dataset. Hotpepper will scan the configured dataset directory, find all audio files, calculate durations, read any provided pre-seed transcripts, and enter a row for each audio file in the database.