Automatically Generating WebVTT & SRT Captions

One use for the Deepgram API includes providing captions for audio and video, which is critical for accessibility. In this guide, you’ll learn how to automatically generate WebVTT and SRT captions for an audio file. We will provide two sets of code samples—one without using the Deepgram SDK so you can see the technique, and one using Deepgram’s SDKs to make it even easier.

If you’d like to learn more about inclusive design and accessibility, we recommend checking out Microsoft’s Inclusive Toolkit.

Before You Begin

Before you run the code, you’ll need to do a few things.

Create a Deepgram Account

Before you can use Deepgram, you’ll need to create a Deepgram account. Signup is free and includes $200 in free credit and access to all of Deepgram’s features!

Create a Deepgram API Key

Before you start, you’ll need to follow the steps in the Make Your First API Request guide to obtain a Deepgram API key, and configure your environment if you are choosing to use a Deepgram SDK.

Configure the Environment

We assume you have already configured a Node.js or Python development environment on your machine. Download Node.js.

If you get stuck at any point, you can find help in our Github community discussions!

Getting Familiar with Captioning Formats

In this tutorial, we are going to work with two common and similar caption formats: WebVTT and SRT. Both formats contain only subtitle information, which must be added to video for a final product. When the caption files are loaded into a compatible video platform, captions will be displayed in the foreground of media, as per the information contained in that file.

WebVTT Files

Web Video Text Track (WebVTT) files generally consist of a sequence of text segments associated with a time-interval, called a cue. It is mainly used to mark up external text track resources in connection with the HTML <track> element. WebVTT files provide captions or subtitles for video content, and also text video descriptions [MAUR], chapters for content navigation, and more generally any form of metadata that is time-aligned with audio or video content. To learn more, visit W3C’s WebVTT: The Web Video Text Tracks Format.

An example WebVTT file:

.vtt

WEBVTT
NOTE
Transcription provided by Deepgram
Request Id: 686278aa-d315-4aeb-b2a9-713615544366
Created: 2023-10-27T15:35:56.637Z
Duration: 25.933313
Channels: 1
00:00:00.080 --> 00:00:03.220
Yeah. As as much as, it's worth celebrating,
00:00:04.400 --> 00:00:05.779
the first, spacewalk,
00:00:06.319 --> 00:00:07.859
with an all female team,
00:00:08.475 --> 00:00:10.715
I think many of us are looking forward
00:00:10.715 --> 00:00:13.215
to it just being normal and
00:00:13.835 --> 00:00:16.480
I think if it signifies anything, It is
00:00:16.779 --> 00:00:18.700
to honor the the women who came before
00:00:18.700 --> 00:00:21.680
us who, were skilled and qualified,
00:00:22.300 --> 00:00:24.779
and didn't get the same opportunities that we
00:00:24.779 --> 00:00:25.439
have today.

SRT Files

SubRip Text (SRT) files also generally consist of a sequence of text segments associated with a time-interval. To learn more, visit the open source Matroska multimedia container format website.

An example SRT file:

.srt

1
00:00:00,080 --> 00:00:03,220
Yeah. As as much as, it's worth celebrating,
2
00:00:04,400 --> 00:00:07,859
the first, spacewalk, with an all female team,
3
00:00:08,475 --> 00:00:10,715
I think many of us are looking forward
4
00:00:10,715 --> 00:00:14,235
to it just being normal and I think
5
00:00:14,235 --> 00:00:17,340
if it signifies anything, It is to honor
6
00:00:17,340 --> 00:00:19,820
the the women who came before us who,
7
00:00:20,140 --> 00:00:23,580
were skilled and qualified, and didn't get the
8
00:00:23,580 --> 00:00:25,439
same opportunities that we have today.

Note that both WebVTT and SRT are similar in their basic forms—the difference is that the millisecond separator is . in WebVTT and , in SRT.

Transcribing Captions

Now that you understand the basics of the WebVTT and SRT captioning formats, you can start transcribing your captions.

This article covers creating subtitle files from files submitted to the Deepgram pre-recorded audio API. For an example of creating subtitle files with the Deepgram live streaming API, check out the streaming test suite.

Choose an Audio File

Locate a hosted audio file that you would like to caption and make note of its URL. If you can’t find one, you can use <https://static.deepgram.com/examples/deep-learning-podcast-clip.wav>.

Install the SDK

Open your terminal, navigate to the location on your drive where you want to create your project, and install the Deepgram SDK.

$ npm i @deepgram/sdk
> # captions also available as a standalone package
> # npm i @deepgram/captions

Write the Code

In your terminal, create a new index.js or index.py file in your project’s location, open it in your code editor, and populate it with code.

Set Up Dependencies

Initialize your dependencies:

1 import "fs" from "fs";
2 import { createClient, webvtt, srt } from "@deepgram/sdk";
3 
4 const deepgram = createClient("YOUR_API_KEY");

Replace YOUR_DEEPGRAM_API_KEY with your Deepgram API Key.

Get the Transcript

To receive timestamps of phrases to include in your caption files, ask Deepgram to include utterances (a chain of words or, more simply, a phrase):

1 const { result, error } =  deepgram.listen.prerecorded.transcribeUrl(
2   {
3 	  url: "YOUR_FILE_LOCATION",
4   },
5   { smart_format: true, utterances: true }
6 );

Replace YOUR_DEEPGRAM_API_KEY with your Deepgram API Key.

Create a Write Stream

With JavaScript, open a writable stream, so you will be able to insert text directly into your file. When you open your stream, you should pass in the a flag, so that any time you write data to the stream, it will be appended to the end. Add inside the .then() block.

JavaScript

1 // WebVTT Filename
2 const stream = fs.createWriteStream("output.vtt", { flags: "a" });
3 
4 // SRT Filename
5 const stream = fs.createWriteStream("output.srt", { flags: "a" });

Write the Captions

The WebVTT and SRT formats are very similar, and each requires a block of text per utterance.

1 const captions = webvtt(result);
2 
3 stream.write(captions);

Deepgram provides seconds back as a number (15.4 means 15.4 seconds), but both formats require times as HH:MM:SS.milliseconds and getting the end of a Date().toISOString() will achieve this for us.

1 const captions = srt(result);
2 
3 stream.write(captions);

The differences between the non-SDK WebVTT and SRT code include:

The WebVTT code has a WEBVTT line at the top, whereas the SRT code does not.
The millisecond separator is . for WebVTT whereas it is , for SRT.
In the WebVTT file, there is a - before the utterance, whereas in the SRT code, there is not.