For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Ask AIPlaygroundLoginFree API Key
HomeAPI ReferenceVoice AgentSpeech-to-TextText-to-SpeechIntelligenceSelf-Hosted Deployments
HomeAPI ReferenceVoice AgentSpeech-to-TextText-to-SpeechIntelligenceSelf-Hosted Deployments
    • Introduction
    • Deployment Environments
  • Amazon SageMaker
      • Validate a Deepgram SageMaker Endpoint
      • Update an Amazon SageMaker Endpoint
      • Auto-Scaling
        • Auto-Scaling Real-Time Endpoints
        • Auto-Scaling Asynchronous Endpoints
  • Docker/Podman
    • Drivers and Container Orchestration Tools
  • Kubernetes
    • Securing Your Cluster
    • Troubleshooting
  • Deployment
    • Self Service Licensing & Credentials
    • Deploy STT Services
    • Deploy Flux Model (STT)
    • Deploy TTS Services
    • Deploy Voice Agent
    • Status Endpoint
    • Certificate Status
  • Partner Deployment
  • Scaling and Deployment Strategies
    • System Maintenance
    • Blue-Green Deployment
    • Auto-Scaling
    • Metrics Guide
    • Ingress Authentication
    • Redact Usage
    • Log Formats
    • Using Private Container Registries
  • Features
    • Smart Formatting
  • Self-Hosted Add Ons
    • License Proxy
    • Prometheus Integration
    • Deepgram UniMRCP Plugin
    • Using SDKs with Self-Hosted
LogoLogo
Ask AIPlaygroundLoginFree API Key
On this page
  • Real-time vs. asynchronous endpoints
  • Which should I use?
  • How autoscaling works
  • Use multiple instance types for resilience
  • Related resources
Amazon SageMakerManage endpoints

Auto-Scaling SageMaker Endpoints

Compare real-time and asynchronous SageMaker endpoint types and choose the right autoscaling strategy for your Deepgram deployment.

Was this page helpful?
Previous

Auto-Scaling Real-Time Endpoints

Use the CloudWatch ConcurrentRequestsPerModel metric to automatically scale your Amazon SageMaker real-time endpoints based on concurrent in-flight requests.

Next
Built with

Deepgram models deployed on Amazon SageMaker can automatically scale the number of instances behind an endpoint in response to load, using AWS Application Auto Scaling. How you configure autoscaling — and whether the endpoint can scale all the way down to zero — depends on which type of endpoint you deploy.

This page explains the two endpoint types and helps you choose between them. For step-by-step setup, follow the guide that matches your deployment:

  • Auto-Scaling Real-Time Endpoints — for streaming requests and synchronous pre-recorded requests that return an immediate response.
  • Auto-Scaling Asynchronous Endpoints — for pre-recorded files processed from a queue, with scale-to-zero support.

Real-time vs. asynchronous endpoints

Both endpoint types can transcribe pre-recorded audio — the difference is how the request is processed and returned, and that determines which signal each scales on.

A real-time endpoint serves two kinds of requests:

  • Streaming — live audio over a bidirectional stream (InvokeEndpointWithBidirectionalStream), up to 30 minutes per connection, with results streamed back as the audio is processed.
  • Synchronous (pre-recorded) — a single file up to 25 MB submitted with InvokeEndpoint, processed in real time and returned in one immediate, synchronous response (Deepgram’s “batch” API).

An asynchronous endpoint serves one kind of request:

  • Pre-recorded files only — a file up to 1 GB, submitted to a queue with InvokeEndpointAsync via an S3 pointer and processed asynchronously with near real-time latency. The result is written back to S3; there is no synchronous response and no streaming.
Real-time endpointAsynchronous endpoint
Supported requestsStreaming and synchronous pre-recordedPre-recorded files only
InvocationInvokeEndpoint (synchronous) or InvokeEndpointWithBidirectionalStream (streaming)InvokeEndpointAsync with an S3 payload pointer
Max input size25 MB per file1 GB per file
Max duration30 min per streaming connectionUp to 1 hour processing per request
ResponseImmediate / synchronous (or streamed)Asynchronous — written to S3 (near real-time)
Scales to zeroNo — minimum 1 instanceYes — MinCapacity=0
Scaling metricConcurrentRequestsPerModelApproximateBacklogSizePerInstance
Setup guideAuto-Scaling Real-Time EndpointsAuto-Scaling Asynchronous Endpoints

Which should I use?

  • Choose a real-time endpoint when you need live streaming transcription, or when you transcribe pre-recorded files of 25 MB or less and want an immediate synchronous response. These endpoints keep a minimum of one instance running at all times.
  • Choose an asynchronous endpoint when you transcribe pre-recorded files — especially large ones (up to 1 GB) — or have spiky or sporadic traffic where paying for idle instances is wasteful. Asynchronous endpoints can autoscale to zero when the queue is empty, so you only pay while requests are processing. They do not support live streaming.

How autoscaling works

Both endpoint types integrate with AWS Application Auto Scaling. At a high level, you:

  1. Register a scalable target — tell Application Auto Scaling which endpoint variant to scale and the minimum and maximum instance counts.
  2. Define a scaling policy — a target-tracking policy that adds or removes instances to keep a chosen CloudWatch metric near a target value.
  3. Apply the policy and monitor scaling activity in CloudWatch.

The difference between the two is the signal they scale on:

  • Real-time endpoints scale on concurrency — the ConcurrentRequestsPerModel metric — because each in-flight request, whether a streaming connection or a synchronous pre-recorded request, represents load on an instance.
  • Asynchronous endpoints scale on queue depth — the ApproximateBacklogSizePerInstance metric — and allow a minimum capacity of zero, enabling scale-to-zero during idle periods.

For the exact metrics, policies, and code, follow the guide for your endpoint type below.

Use multiple instance types for resilience

A single instance type can become temporarily unavailable in a given region or Availability Zone, which may prevent your endpoint from scaling out when traffic increases. To reduce this risk, configure your endpoint variant with multiple instance types in priority order using SageMaker’s heterogeneous instance pools. SageMaker provisions instances from your highest-priority pool first and falls back to lower-priority pools when capacity in the preferred pool is constrained.

This applies to both real-time and asynchronous endpoints.

The following endpoint configuration lists ml.g6.2xlarge as the preferred instance type and falls back to ml.g6e.2xlarge if the first pool is unavailable:

Boto3
1import boto3
2
3sagemaker = boto3.client("sagemaker")
4
5sagemaker.create_endpoint_config(
6 EndpointConfigName="my-deepgram-endpoint-config",
7 ProductionVariants=[
8 {
9 "VariantName": "AllTraffic",
10 "ModelName": "my-model",
11 "InitialInstanceCount": 2,
12 "InstancePools": [
13 {
14 "InstanceType": "ml.g6.2xlarge",
15 "Priority": 1,
16 },
17 {
18 "InstanceType": "ml.g6e.2xlarge",
19 "Priority": 2,
20 },
21 ],
22 "VariantInstanceProvisionTimeoutInSeconds": 600,
23 }
24 ],
25)

When you use multiple instance types, give additional consideration to your scaling policy. The predefined scaling metrics (ConcurrentRequestsPerModel for real-time endpoints, ApproximateBacklogSizePerInstance for async endpoints) do not account for capacity differences between pools, so a target tracking policy that uses them directly may scale unevenly across instance types. For mixed fleets, AWS recommends driving the scaling policy from a weighted custom metric instead. See the AWS heterogeneous endpoints documentation for details on configuring weighted custom metrics for heterogeneous endpoints.

Related resources

  • Auto-Scaling Real-Time Endpoints
  • Auto-Scaling Asynchronous Endpoints
  • Deploy Deepgram on Amazon SageMaker
  • Amazon SageMaker Asynchronous Inference (AWS)
  • Automatic scaling of Amazon SageMaker models (AWS)