For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Ask AIPlaygroundLoginFree API Key
HomeAPI ReferenceVoice AgentSpeech-to-TextText-to-SpeechIntelligenceSelf-Hosted Deployments
HomeAPI ReferenceVoice AgentSpeech-to-TextText-to-SpeechIntelligenceSelf-Hosted Deployments
    • Introduction
    • Deployment Environments
  • Amazon SageMaker
      • Validate a Deepgram SageMaker Endpoint
      • Update an Amazon SageMaker Endpoint
      • Auto-Scaling
        • Auto-Scaling Real-Time Endpoints
        • Auto-Scaling Asynchronous Endpoints
  • Docker/Podman
    • Drivers and Container Orchestration Tools
  • Kubernetes
    • Securing Your Cluster
    • Troubleshooting
  • Deployment
    • Self Service Licensing & Credentials
    • Deploy STT Services
    • Deploy Flux Model (STT)
    • Deploy TTS Services
    • Deploy Voice Agent
    • Status Endpoint
    • Certificate Status
  • Partner Deployment
  • Scaling and Deployment Strategies
    • System Maintenance
    • Blue-Green Deployment
    • Auto-Scaling
    • Metrics Guide
    • Ingress Authentication
    • Redact Usage
    • Log Formats
    • Using Private Container Registries
  • Features
    • Smart Formatting
  • Self-Hosted Add Ons
    • License Proxy
    • Prometheus Integration
    • Deepgram UniMRCP Plugin
    • Using SDKs with Self-Hosted
LogoLogo
Ask AIPlaygroundLoginFree API Key
On this page
  • How it works
  • Prerequisites
  • Use multiple instance types for resilience
  • Register the scalable target
  • Create a target-tracking scaling policy
  • Choose a target value
  • Scale up from zero for new requests
  • Monitor scaling activity
  • FAQ
  • Related resources
Amazon SageMakerManage endpointsAuto-Scaling

Auto-Scaling Asynchronous SageMaker Endpoints

Configure autoscaling with scale-to-zero for asynchronous Deepgram endpoints on Amazon SageMaker using queue-depth metrics.

Was this page helpful?
Previous

Observability for Amazon SageMaker

Monitor Deepgram SageMaker Endpoints with Amazon CloudWatch metrics, container logs, and alarms.
Next
Built with

Deepgram’s speech-to-text models can be deployed on Amazon SageMaker as asynchronous inference endpoints, which queue incoming requests and process them from Amazon S3. Async endpoints handle pre-recorded files only (no streaming) — payloads up to 1 GB, processing times up to one hour, and near real-time latency — and, unlike real-time endpoints, they can autoscale to zero when there are no requests to process, so you only pay while the endpoint is actively working.

This guide covers how to configure autoscaling — including scale-to-zero — for an asynchronous Deepgram endpoint. For a comparison of endpoint types and when to use each, see Auto-Scaling SageMaker Endpoints.

Need live streaming instead? This page covers batch/non-streaming workloads on asynchronous endpoints. For live streaming speech-to-text on real-time endpoints, see Auto-Scaling Real-Time Endpoints. Real-time endpoints scale between a minimum and maximum instance count (both ≥ 1) and cannot scale to zero.

Asynchronous endpoints are the right choice when you process pre-recorded audio, work with large files (up to 1 GB), or have spiky or sporadic traffic — they can autoscale to zero when the queue is empty, so you only pay while requests are processing.

How it works

SageMaker integrates with AWS Application Auto Scaling to adjust the number of instances behind your endpoint in response to load. For asynchronous endpoints, the relevant signal is the request queue depth, exposed through the ApproximateBacklogSizePerInstance CloudWatch metric (the number of queued requests divided by the current instance count).

A target-tracking scaling policy adds instances when the per-instance backlog rises above your target and removes them as the backlog drains. Because async endpoints allow a minimum capacity of zero, the fleet can scale all the way down to no instances during idle periods. Requests received while at zero instances are queued, and the endpoint scales back up to process them.

Scaling up from zero requires an extra policy. By default, a scaled-to-zero endpoint will not scale up until the backlog exceeds your target value — which can mean a long wait for the first request after an idle period. Add the optional scale-up-from-zero policy so the endpoint wakes on the first queued request.

Prerequisites

  • A Deepgram model deployed to a SageMaker asynchronous inference endpoint (an endpoint configuration with an AsyncInferenceConfig object). See Deploy Deepgram on Amazon SageMaker.
  • An Amazon S3 bucket for request and response payloads.
  • IAM permissions to register scalable targets and manage scaling policies (for example, AmazonSageMakerFullAccess plus Application Auto Scaling permissions).
  • The AWS CLI or AWS SDK for Python (Boto3) configured with credentials for your account.

Use multiple instance types for resilience

For improved availability, configure your endpoint with multiple instance types so SageMaker can fall back to an alternative pool when your preferred instance type is constrained. This applies to both real-time and asynchronous endpoints. See Use multiple instance types for resilience in the parent guide for configuration details and code examples.

Register the scalable target

Register your endpoint variant with Application Auto Scaling and set the instance bounds. The key difference from a real-time endpoint is MinCapacity=0, which allows the endpoint to scale down to zero instances.

1import boto3
2
3client = boto3.client('application-autoscaling')
4
5# Application Auto Scaling references the endpoint variant by resource ID
6resource_id = 'endpoint/' + endpoint_name + '/variant/' + variant_name # e.g. 'variant1'
7
8client.register_scalable_target(
9 ServiceNamespace='sagemaker',
10 ResourceId=resource_id,
11 ScalableDimension='sagemaker:variant:DesiredInstanceCount',
12 MinCapacity=0, # Allows scale-to-zero when the queue is empty
13 MaxCapacity=5, # Set to your peak instance count
14)

Create a target-tracking scaling policy

Apply a target-tracking policy on the ApproximateBacklogSizePerInstance custom metric. The TargetValue is the number of queued requests per instance you’re willing to tolerate before adding capacity — start with a small value and tune against your latency requirements.

1client.put_scaling_policy(
2 PolicyName='AsyncBacklogTargetTracking',
3 ServiceNamespace='sagemaker',
4 ResourceId=resource_id,
5 ScalableDimension='sagemaker:variant:DesiredInstanceCount',
6 PolicyType='TargetTrackingScaling',
7 TargetTrackingScalingPolicyConfiguration={
8 'TargetValue': 5.0, # Target ApproximateBacklogSizePerInstance
9 'CustomizedMetricSpecification': {
10 'MetricName': 'ApproximateBacklogSizePerInstance',
11 'Namespace': 'AWS/SageMaker',
12 'Dimensions': [
13 {'Name': 'EndpointName', 'Value': endpoint_name},
14 ],
15 'Statistic': 'Average',
16 },
17 # Optional: tune how quickly the endpoint scales in/out
18 'ScaleInCooldown': 300,
19 'ScaleOutCooldown': 300,
20 },
21)

Choose a target value

Benchmark a single instance with representative audio files to find the per-instance backlog at which queue wait times stay within your latency budget. Set TargetValue to a level below that threshold so the policy adds capacity before the queue grows faster than one instance can drain it.

Scale up from zero for new requests

When an endpoint has scaled down to zero, the target-tracking policy above won’t bring it back until the backlog exceeds your target value — so a single request arriving after an idle period can sit in the queue for a long time. To wake the endpoint on the first queued request, add a step-scaling policy driven by the HasBacklogWithoutCapacity metric, which fires when there are queued requests but zero instances.

1cw_client = boto3.client('cloudwatch')
2
3# 1. Step-scaling policy: add one instance when at zero capacity with a backlog
4response = client.put_scaling_policy(
5 PolicyName='HasBacklogWithoutCapacity-ScalingPolicy',
6 ServiceNamespace='sagemaker',
7 ResourceId=resource_id,
8 ScalableDimension='sagemaker:variant:DesiredInstanceCount',
9 PolicyType='StepScaling',
10 StepScalingPolicyConfiguration={
11 'AdjustmentType': 'ChangeInCapacity',
12 'MetricAggregationType': 'Average',
13 'Cooldown': 300,
14 'StepAdjustments': [
15 {'MetricIntervalLowerBound': 0, 'ScalingAdjustment': 1},
16 ],
17 },
18)
19
20# 2. Alarm on HasBacklogWithoutCapacity that triggers the policy above
21cw_client.put_metric_alarm(
22 AlarmName='HasBacklogWithoutCapacity-Alarm',
23 MetricName='HasBacklogWithoutCapacity',
24 Namespace='AWS/SageMaker',
25 Statistic='Average',
26 EvaluationPeriods=2,
27 DatapointsToAlarm=2,
28 Threshold=1,
29 ComparisonOperator='GreaterThanOrEqualToThreshold',
30 TreatMissingData='missing',
31 Dimensions=[
32 {'Name': 'EndpointName', 'Value': endpoint_name},
33 ],
34 Period=60,
35 AlarmActions=[response['PolicyARN']],
36)

With both policies in place, the target-tracking policy handles scaling under sustained load, while the step-scaling policy ensures a scaled-to-zero endpoint wakes promptly on the first incoming request.

Monitor scaling activity

Watch these CloudWatch metrics (namespace AWS/SageMaker, dimensioned by EndpointName) to confirm scaling behaves as expected:

  • ApproximateBacklogSize — total requests in the queue.
  • ApproximateBacklogSizePerInstance — the target-tracking signal.
  • HasBacklogWithoutCapacity — non-zero when requests are queued but no instances are running (the scale-up-from-zero trigger).
  • Instance count — confirm the endpoint scales to zero when idle and back up under load.

FAQ

Q: Does the asynchronous endpoint scale down to 0 during periods of no traffic?

A: Yes. Set MinCapacity=0 when registering the scalable target. Requests received while at zero instances are queued, and the endpoint scales back up to process them. Add the HasBacklogWithoutCapacity scale-up policy so the endpoint wakes on the first queued request instead of waiting for the backlog to exceed your target value.

Q: How is this different from real-time endpoint autoscaling?

A: Real-time endpoints serve streaming and synchronous pre-recorded requests, scale on concurrent in-flight requests (ConcurrentRequestsPerModel), and require a minimum of 1 instance. Asynchronous endpoints serve pre-recorded files only, scale on queue depth (ApproximateBacklogSizePerInstance), and can scale to zero. See Auto-Scaling Real-Time Endpoints for the real-time guide.

Q: What’s the maximum input size and processing time?

A: Asynchronous endpoints accept payloads up to 1 GB and processing times up to one hour per request. For the 25 MB real-time limit and streaming details, see Deploy Deepgram on Amazon SageMaker.

Related resources

  • Auto-Scaling SageMaker Endpoints
  • Auto-Scaling Real-Time Endpoints
  • Deploy Deepgram on Amazon SageMaker
  • Amazon SageMaker Asynchronous Inference
  • Autoscale an asynchronous endpoint (AWS)