Auto-Scaling SageMaker Endpoints
Auto-Scaling SageMaker Endpoints
Use the CloudWatch ConcurrentRequestsPerModel metric to automatically scale your Amazon SageMaker streaming endpoints based on real-time concurrency.
Deepgram streaming Speech-to-Text (STT) on Amazon SageMaker uses long-lived bidirectional connections. Because each connection holds resources for its entire duration, the number of concurrent streams is the most accurate indicator of instance load. The CloudWatch ConcurrentRequestsPerModel metric tracks in-flight requests, including queued requests, and emits data every 10 seconds. This makes it the recommended metric for scaling streaming workloads.
Before configuring auto scaling, you must have a Deepgram SageMaker Endpoint deployed and running with status InService. See Deploy Deepgram on Amazon SageMaker for setup instructions.
How it works
Amazon SageMaker integrates with AWS Application Auto Scaling to add or remove instances backing your endpoint. When you create a target tracking scaling policy with the ConcurrentRequestsPerModel metric, SageMaker:
- Monitors the number of concurrent bidirectional streaming connections per instance.
- Triggers a scale-out when concurrency exceeds your target value.
- Triggers a scale-in when concurrency drops below the target value.
Because ConcurrentRequestsPerModel is a high-resolution metric (10-second intervals), SageMaker detects the need to scale out up to 6x faster than standard one-minute metrics such as InvocationsPerInstance.
Prerequisites
- A deployed Deepgram SageMaker Endpoint with status
InService - AWS IAM permissions for Application Auto Scaling:
- IAM Policy:
AmazonSageMakerFullAccess - IAM Policy: Application Auto Scaling identity-based policies
- IAM Policy:
- The AWS CLI installed and configured, or the AWS SDK for Python (boto3) available in your environment
Use multiple instance types for resilience
A single instance type can become temporarily unavailable in a given region or Availability Zone, which may prevent your endpoint from scaling out when traffic increases. To reduce this risk, configure your endpoint variant with multiple instance types in priority order using SageMaker’s heterogeneous instance pools. SageMaker provisions instances from your highest-priority pool first and falls back to lower-priority pools when capacity in the preferred pool is constrained.
The following endpoint configuration lists ml.g6.2xlarge as the preferred instance type and falls back to ml.g6e.2xlarge if the first pool is unavailable:
When you use multiple instance types, give additional consideration to your scaling policy. The predefined ConcurrentRequestsPerModel metric does not account for capacity differences between pools, so a target tracking policy that uses it directly may scale unevenly across instance types. For mixed fleets, AWS recommends driving the scaling policy from a weighted custom metric instead. See the AWS heterogeneous endpoints documentation for details on configuring weighted custom metrics for heterogeneous endpoints.
Register the scalable target
Before you can attach a scaling policy, register your SageMaker Endpoint variant as a scalable target with Application Auto Scaling. This defines the minimum and maximum instance count for horizontal scaling.
AWS CLI
Python (boto3)
Replace YOUR_ENDPOINT_NAME with the name of your SageMaker Endpoint. AllTraffic is the default SageMaker Endpoint Variant name assigned when you create an endpoint with a single production variant. If you configured a custom variant name, replace AllTraffic with that name. Adjust --min-capacity and --max-capacity to match your expected traffic range.
Create a target tracking scaling policy
Define a target tracking policy that uses the ConcurrentRequestsPerModel high-resolution metric. The TargetValue represents the desired number of concurrent streaming connections per instance. When the average concurrency across instances exceeds this value, SageMaker adds instances. When it drops below, SageMaker removes instances.
AWS CLI
Python (boto3)
Save the following policy configuration to a file named scaling-policy.json:
Apply the policy:
Configuration parameters
Choose a target value
The correct TargetValue depends on your instance type, Deepgram model, and feature configuration. Streaming connections hold GPU resources for the entire session, so each instance supports a finite number of concurrent streams at acceptable latency.
To determine the right target value:
- Deploy a single instance and open concurrent streams incrementally.
- Monitor response latency. The Measuring streaming latency guide describes how to benchmark.
- Identify the concurrency level at which average latency remains below 400 ms.
- Set
TargetValueto approximately 70-80% of that limit to give the auto scaler time to add capacity before latency degrades.
For example, if a g5.2xlarge instance handles 10 concurrent streams at acceptable latency, set TargetValue to 7 or 8.
If your endpoint uses heterogeneous instance pools, the predefined ConcurrentRequestsPerModel metric is not sufficient on its own because per-instance capacity varies across pools. Follow the AWS guidance on driving the scaling policy from a weighted custom metric for mixed fleets. See Use heterogeneous instance type endpoints for details.
Verify the scaling policy
After applying the policy, confirm it is active:
AWS CLI
Python (boto3)
You can also view the auto scaling configuration in the Amazon SageMaker console under Endpoints > your endpoint > Endpoint runtime settings.
Monitor scaling activity
Amazon CloudWatch automatically creates alarms when you apply a target tracking policy. You can monitor these alarms and the scaling activity in the CloudWatch console.
Key metrics to watch in the AWS/SageMaker namespace:

To view scaling events:
FAQ
Does auto-scaling scale down to 0 during periods of no traffic?
No. SageMaker managed auto-scaling requires a minimum of 1 instance and offers seamless scaling between any minimum or maximum instance count greater than or equal to 1. To save on infrastructure and software licensing costs, the endpoint can be deleted and recreated. This can happen via a schedule (orchestrated with Lambda and cron triggers) or as part of your workload processing to ensure that an endpoint is created (with a pre-configured endpoint config) as needed.
What’s Next