Deploy Deepgram on Amazon SageMaker
Deepgram can be deployed into your own Amazon Virtual Private Cloud (VPC) environment using Amazon SageMaker AI. Simply subscribe to the Deepgram product in the AWS Marketplace and then deploy a SageMaker Endpoint, using our pre-made SageMaker Model Package.
For an overview of running Deepgram on SageMaker, including benefits, tradeoffs, and pricing, see Amazon SageMaker.
Supported Products
Follow this AWS Marketplace link to see the Deepgram products that are supported on the SageMaker AI platform. No login to your AWS account is required to view this public AWS Marketplace website.
For Speech-to-Text (STT), Deepgram publishes a separate product listing for each combination of:
- Model family — such as Nova-3 or Flux
- Language coverage — monolingual or multilingual
- Processing mode — streaming or batch
For example, Deepgram Voice AI- Nova-3 Monolingual Speech-to-Text (STT) Streaming is one listing.
For Text-to-Speech (TTS), Deepgram publishes a single product listing per model family (such as Aura-2), with no separate listings for language coverage or processing mode. Subscribe to and deploy a SageMaker Endpoint for each product you wish to utilize. Your application code will need to route requests to the SageMaker Endpoint for the product you wish to run inference against.
Within a listing, individual languages are delivered as versions of the model package. A monolingual listing may offer one version covering English and French, and another covering Vietnamese and Thai. Read the version name and its release notes to understand the set of languages each version provides, and select the version that matches the languages you need when deploying.
Language Requests: If there is a transcription language that is not currently available on the AWS Marketplace, please work with your account manager to request additional language models to be added. For a full list of the Deepgram supported transcription languages, check out this document. You can also view the Changelog to see recent product announcements.
Limitations
When using Deepgram services in Amazon SageMaker, please be aware of the following limitations.
- Deepgram cannot call Large Langage Model (LLM) services
- Deepgram cannot invoke user-defined callback URLs
- Passing a JSON payload for transcription (e.g., referencing a file stored in cloud storage via URL) is unsupported, as the SageMaker isolation model prevents the container from reaching out to external cloud storage
- Deepgram custom metrics are not currently available through Amazon SageMaker Endpoints
- For streaming invocations, the connection remains open until you explicitly close the input stream or the endpoint closes the connection, supporting up to 30 minutes of connection time.
- For non-streaming invocations, the maximum size of the input data is 25 MB for real-time endpoints. For larger files, use asynchronous endpoints, which support payloads up to 1 GB with near real-time latency and can scale to zero when there are no requests to be processed.
Prerequisites
- An AWS account
- AWS IAM permissions to SageMaker and Marketplace
- IAM Policy: AWSMarketplaceManageSubscriptions
- IAM Policy: AmazonSageMakerFullAccess
Subscribe to Deepgram Products
Before you can deploy Deepgram on Amazon SageMaker AI, you’ll need to subscribe to the product in the AWS Marketplace. Keep in mind that you are not billed for the product until you deploy an Amazon SageMaker AI Endpoint resource.
Login to the AWS Management Console for the account you’d like to deploy in
Create AWS IAM Role for SageMaker Execution
Follow the AWS documentation to create an AWS Identity & Access Management (IAM) role that will be used to run SageMaker Model Endpoints. You only need to create a single SageMaker execution role, and can reuse this IAM Role to deploy multiple SageMaker Endpoints.
Deploy Deepgram Model Package for SageMaker AI
Once you’ve subscribed to the Deepgram product on AWS Marketplace, you can deploy a SageMaker AI Endpoint. The SageMaker “Endpoint” resource represents the compute instance that runs the Deepgram Voice AI services. It will take several minutes to deploy a SageMaker Endpoint, once you initiate the resource creation.
Which endpoint type should you deploy? Deploy a real-time endpoint for live streaming and synchronous (single-file) transcription, or an asynchronous endpoint for large pre-recorded files (up to 1 GB) and scale-to-zero. See Auto-Scaling SageMaker Endpoints for a full comparison.
In the AWS Management Console, navigate to the AWS Marketplace Manage subscriptions console
On the Active subscriptions tab, find the subscription for the Deepgram product you want to deploy (eg. Deepgram Voice AI- Nova-3 Monolingual Speech-to-Text (STT) Streaming)
Under the Version header, select the product version from the dropdown. If the listing has more than one version, read the version name and the release notes to understand the set of languages (or features) each version provides, and choose the version that matches your needs
Under IAM Role, select the SageMaker execution role that you created
(Asynchronous endpoints only) Configure async invocation.
If you are deploying an asynchronous endpoint, expand the Async invocation config section and toggle it on, then set the S3 output path — the S3 location (for example, s3://your-bucket/output/) where transcription results are written. The remaining fields are optional.
For a real-time endpoint (streaming and synchronous invocation), leave Async invocation config turned off and continue to the next step unchanged.
To autoscale an asynchronous endpoint — including scaling to zero when idle — see Auto-Scaling Asynchronous Endpoints.
After following these steps, you should see a new Endpoint in your AWS account.
If you don’t see the Endpoint, ensure that you have selected the correct AWS region in the AWS Management Console.
It may take several minutes for the Endpoint to change to status InService.
Once the Endpoint status has changed to InService, you can monitor the Amazon CloudWatch Logs for the Endpoint to ensure normal operation of the Deepgram services.
Inference
Once your endpoint is deployed and in service, you invoke it to transcribe audio. The endpoint supports three invocation modes, depending on which endpoint type you deployed and how you need the response returned.
Not sure which endpoint type you need? See Auto-Scaling SageMaker Endpoints for a full comparison of real-time and asynchronous endpoints and guidance on choosing between them.
Passing Deepgram parameters. For synchronous and asynchronous invocations, the Deepgram model and feature parameters are passed in the CustomAttributes field (the X-Amzn-SageMaker-Custom-Attributes header) as v1/listen?model=...&language=.... For streaming, the same values are split across ModelInvocationPath (v1/listen) and ModelQueryString. In all cases an API path such as v1/listen is required — without it the container returns a 404. The examples on this page use v1/listen (speech-to-text), but other routes are available (for example, v1/speak for text-to-speech).
Complete, runnable examples for all three modes — in Python, TypeScript, and Java — are maintained in the deepgram-devs/dg-sagemaker repository. The sections below explain each mode and link to the corresponding example. See the repository’s README for setup and prerequisites.
Use the Deepgram SDKs with the SageMaker transport
You don’t have to call the AWS APIs directly. The Deepgram SDKs can target a SageMaker endpoint through a SageMaker transport, so you keep the same client-side request and response patterns whether you call the Deepgram-hosted API or your own SageMaker deployment. You swap the transport; your listen request and result-handling code stays the same.
For example, the Deepgram Java SDK pairs with the Deepgram SageMaker transport (com.deepgram:deepgram-sagemaker):
The remaining sections show the underlying AWS APIs directly, which apply to any language.
Streaming (real-time)
Use streaming for live, interactive transcription over a persistent bidirectional connection. You send audio chunks and receive transcription results as the audio is processed, up to 30 minutes per connection.
Streaming uses the HTTP/2 bidirectional streaming client (@aws-sdk/client-sagemaker-runtime-http2 in TypeScript, aws_sdk_sagemaker_runtime_http2 in Python) against the SageMaker bidirectional runtime endpoint (https://runtime.sagemaker.<region>.amazonaws.com:8443). The request Body is an async iterable of payload parts:
- Binary audio is sent as a
Bytespayload withDataType: "BINARY". - Control messages (for example,
KeepAliveandCloseStream) are sent as UTF-8 encoded JSON withDataType: "UTF8".
For the complete examples — file and microphone capture, payload wrapping, keepalive handling, and stream processing — see:
- TypeScript:
js-stt/stt.file.tsandstt.microphone.ts - Python:
python-stt/stt_wav_stress.py(streamsubcommand)
Synchronous (real-time)
Use synchronous invocation to transcribe a single pre-recorded file and receive the full transcript in one immediate response. This is Deepgram’s “batch” transcription on a real-time endpoint — there is no streaming connection and no queue. The request body is capped at 25 MB; use streaming or asynchronous invocation for larger audio.
You send the audio as the request body to InvokeEndpoint, pass the Deepgram parameters via CustomAttributes, and parse the transcript from the JSON response.
For the complete example, see python-stt/stt_wav_stress.py (batch subcommand) in the repository.
Asynchronous
Use asynchronous invocation for large or long-form pre-recorded files — up to 1 GB, with up to one hour of processing time. Requests are queued and processed with near real-time latency, and the result is written back to Amazon S3.
The flow is:
- Upload the audio file to an S3 bucket.
- Call
InvokeEndpointAsyncwithInputLocationpointing to the uploaded file and the Deepgram parameters inCustomAttributes. - SageMaker immediately returns an
OutputLocationand aFailureLocationin S3, and processes the request from the queue. - Poll the
OutputLocation(success) andFailureLocation(error) prefixes until one appears, then download and parse the result — or react to an Amazon SNS notification, if configured.
Asynchronous invocation requires an endpoint deployed with Async invocation config enabled (with an S3 output path), as described in the deployment steps above. To autoscale an asynchronous endpoint — including scaling to zero when idle — see Auto-Scaling Asynchronous Endpoints.
For the complete example — S3 upload, invocation, and polling for results — see python-stt/stt_wav_async.py in the repository.
Troubleshooting
If you’re experiencing any issues with your Deepgram deployment on Amazon SageMaker AI, you can obtain the Deepgram container logs from the Amazon CloudWatch service. If you open the SageMaker AI Endpoint resource details, there will be a link to open the Amazon CloudWatch Log Group for that endpoint. Within the CloudWatch Log Group, there should be a Log Stream that contains the Deepgram logs for all components. You can use the Amazon CloudWatch Logs Live Tail feature to watch logs in near-real-time while you are sending requests to the Deepgram API, via the SageMaker AI APIs.
To use the CloudWatch Logs Live Tail feature locally, from the AWS CLI tool, you can use the following command.
Checklist
If you experience any issues using Deepgram services running on the Amazon SageMaker AI platform, please review this checklist before contacting Deepgram support.
- Ensure that your application’s AWS IAM User or IAM Role has permission to call the
InvokeEndpointWithBidirectionalStreamSageMaker AI action. - Ensure your application is targeting the correct AWS account and region, where your SageMaker Endpoint exists.
- Ensure the Deepgram product you’ve deployed (eg. streaming Speech-to-Text), from the AWS Marketplace, corresponds to the Deepgram API you’re calling.
- There is a known compatibility issue using pre-Blackwell NVIDIA GPUs with the latest SageMaker-provided AMI named
al2023-ami-sagemaker-inference-gpu-4-1which includes the NVIDIA 580 driver version. When creating your SageMaker Endpoint Configuration resource, using a g4dn, g5, g6, or g6e instance family, please be sure that you are using one of the AMIs before this version. You can also reference this AWS supported configurations table. - If you have received a SageMaker private offer for a management account of an AWS organization, you may use AWS License Manager to grant usage of the SageMaker private offer to member accounts within your AWS organization as a Marketplace license entitlement.