Migrating From Google Speech-to-Text (STT) to Deepgram
Learn how to migrate from Google’s Speech-to-Text to Deepgram. For developers who are using Google Speech-to-Text and want to migrate to Deepgram.
This guide covers migrating your transcription processes from Google Speech-to-Text (STT) to Deepgram, including key differences, code examples, and best practices.
Getting Started
Before you can use Deepgram, you’ll need to create a Deepgram account. Signup is free and includes $200 in free credit and access to all of Deepgram’s features!
Before you start, you’ll need to follow the steps in the Make Your First API Request guide to obtain a Deepgram API key, and configure your environment if you are choosing to use a Deepgram SDK.
Migration Process
During the migration process, you will need to perform the following tasks:
Differences
Once you’ve selected your model, Deepgram provides many features and capabilities to help you transcribe and classify your audio. However, some capabilities and concepts are implemented differently from Google STT.
Key Implementation Differences
Both APIs share common features but differ in terminology and implementation:
General
Both Deepgram and Google provide you with the following values by default:
- Words
- Timing
- Confidence
Additionally, Google provides the following values by default:
alternatives
channel_tag
language_code
Deepgram Default JSON Response
Interim Result
Final Result
Search
Deepgram provides acoustic-based search that identifies phrases by audio patterns rather than text matches. This approach finds phrases even when transcription is imperfect, offering higher accuracy than text-based search methods.
Deepgram’s Search feature uses a query that allows you to pass in a word or phrase to find and then returns results in the response JSON object. In the JSON response, we return information including the start time and end time of when that phrase was possibly uttered and a confidence rating for each match. So, in the response, you will see the query (the word or phrase you searched for), and then hits (an array of objects that give you the confidence, start, end, and snippet of each possible match to your search). You can include up to 25 search terms per request.
Sample code
You can search for multiple terms individually: search=epistemology&search=warwick
You can search for a phrase. URL-encode the phrase when submitting it: search=social%20epistemology
As an example, we used a WAV audio file that contains the first 20 seconds of a college lecture on epistemology. The term “epistemology” in this audio file is sufficiently technical that our model will not transcribe it accurately, but our phonetic search will still be able to find it.
Input
search=epistemology
Response
Custom Vocabulary
Both Deepgram and Google provide users the ability to improve the accuracy of specific keywords or vocabulary. Deepgram has two methods to perform this: our Keyword Boosting feature and data-driven AI model training. Google has one primary way to perform this, which they call speech/model adaptation and model adaptation boost.
Deepgram Keyword Boosting
Deepgram’s Keywords feature improves transcription accuracy by focusing on specific terms you provide with your audio request.
Deepgram AI Model Training
For more than 200 unique keywords or domain-specific terms that Keyword Boosting doesn’t handle effectively, upgrade to a trained AI model. Deepgram can train a custom model using your audio samples and transcripts to learn your specific language, accents, and terminology.
Google Speech Model Adaptation
You can use the model adaptation feature to help Speech-to-Text recognize specific words or phrases more frequently than other options that might otherwise be suggested. For example, suppose that your audio data often includes the word “weather”. When Google encounters the word “weather,” you want it to transcribe the word as “weather” more often than “whether.” In this case, you might use model adaptation to bias Speech-to-Text toward recognizing “weather.”
Google Model adaptation is helpful in the following use cases:
- Improving the accuracy of words and phrases that occur frequently in your audio data. For example, you can alert the recognition model to voice commands that are typically spoken by your users.
- Expanding the vocabulary of words recognized by Google STT. If your audio data often contains words that are rare in general language use (such as proper names or domain-specific words), you can add them using model adaptation.
- Assigning weighted values to phrase items. You can use Google Model Adaptation boost to assign a weighted value to phrase items in a PhraseSet resource. Google Speech-to-Text refers to this weighted value when selecting a possible transcription for words in your audio data. The higher the value, the higher the likelihood that Speech-to-Text chooses that word or phrase from the possible alternatives.
Google Speech Adaptation also has some drawbacks. To increase the probability that Google STT recognizes a word, Google recommends that you submit key phrases through its SpeechContext object. In addition to submitting keywords and phrases, Google also recommends including Class tokens by language to identify phone numbers, addresses, radio stations, and so on. In situations where model adaptation + boosting + speech context + class tokens are necessary with Google, Deepgram’s system may already perform well without the aid of additional requests. If Deepgram does not recognize these words out of the box, you can leverage Keyword Boosting or consider upgrading to a trained AI model. To learn more, contact your Customer Success Manager.
What to Expect in the JSON Response
Both Deepgram and Google will provide you with the following values:
transcript
start_time
(duration)end_time
(duration)word
(string)confidence
(float)
In addition, by default, Google will provide:
name
(if you have a set of phrases)value
(single phrase)phrases
Boost
Speaker_tag
Don’t transfer keyword lists from other vendors. Start without keywords, as Deepgram may already perform well. Add keywords gradually and increase boosts carefully to avoid negative effects. See our Keyword Boosting documentation for details.
Batch Transcription
Google uses different methods based on file length, while Deepgram uses one approach for all audio lengths, from one minute to multiple hours.
Asynchronous vs. Synchronous
Google uses synchronous recognition for audio under 60 seconds and asynchronous for longer files (up to 480 minutes), requiring Google Cloud storage. Deepgram supports any audio length without storage restrictions.
What to Expect in the JSON Response
Both Deepgram and Google will provide you with the following values:
transcript
start_time
(duration)end_time
(duration)word
(string)confidence
(float)
Sample Code
Deepgram supports both transcription of files on your local machine and transcription of files stored remotely.
Deepgram
To see more, visit Deepgram’s Python SDK GitHub repo.
To see more, visit Google’s documentation on transcribing a file in Cloud Storage using a transcription model.
Live Streaming
Both APIs provide streaming transcription. Deepgram offers lower latency, higher accuracy, and easier stream management.
When using Deepgram to transcribe live streaming audio, two of the features you can use to further understand your audio include Endpointing and Interim Results. Both of these features monitor incoming live streaming audio and can indicate the end of a type of processing, but they are used in very different ways:
- Deepgram live streaming looks for any deviation in the natural flow of speech and returns a finalized response at these places. To learn more about this feature, see Endpointing.
- Deepgram live streaming can also return a series of interim transcripts followed by a final transcript. To learn more, see Interim Results.
- If your downstream natural language processing (NLP) requires complete utterances, review our Best Practices for Utterance Segmentation.
Google Speech-to-Text
Google can transcribe streaming audio such as the input from a microphone. There are two ways to do this, through Streaming speech recognition and through Endless streaming. The key differences include:
-
There are audio limits for streaming speech recognition requests:
- Synchronous Requests: 1 Minute
- Asynchronous Requests: 480 Minutes
- Streaming Requests: 5 Minutes
-
Streaming speech recognition is available via gRPC only.
-
Audio longer than approximately one minute must use a custom field to reference an audio file in Google Cloud Storage.
-
If you need to stream content for more than five minutes, you need to use endless streaming.
What to Expect in the JSON Response
By default, both Deepgram and Google will provide you with the following values:
transcript
start_time
(duration)end_time
(duration)word
(string)confidence
(float)
Additionally, Google and Deepgram provide the following:
Configure Environment
We provide sample scripts in Python and Node.js and assume you have already configured either a Python or Node development environment. System requirements will vary depending on the programming language you use:
- cross-fetch >= 3.1.5Python- python >= 3.7
- aiohttp >= 3.8.1 |
Get Transcripts and Customize Response
To get transcripts and customize your response, you can use our SDKs and code samples for:
Migration Best Practices
Each API service has different limitations to ensure performance. Known differences between Deepgram and Google STT are detailed below, along with information about how our customers address these differences during the migration process.