For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Ask AIPlaygroundLoginFree API Key
HomeAPI ReferenceVoice AgentSpeech-to-TextText-to-SpeechIntelligenceSelf-Hosted Deployments
HomeAPI ReferenceVoice AgentSpeech-to-TextText-to-SpeechIntelligenceSelf-Hosted Deployments
    • Introduction
    • Deployment Environments
  • Amazon SageMaker
      • Validate a Deepgram SageMaker Endpoint
      • Update an Amazon SageMaker Endpoint
      • Auto-Scaling
        • Auto-Scaling Real-Time Endpoints
        • Auto-Scaling Asynchronous Endpoints
  • Docker/Podman
    • Drivers and Container Orchestration Tools
  • Kubernetes
    • Securing Your Cluster
    • Troubleshooting
  • Deployment
    • Self Service Licensing & Credentials
    • Deploy STT Services
    • Deploy Flux Model (STT)
    • Deploy TTS Services
    • Deploy Voice Agent
    • Status Endpoint
    • Certificate Status
  • Partner Deployment
  • Scaling and Deployment Strategies
    • System Maintenance
    • Blue-Green Deployment
    • Auto-Scaling
    • Metrics Guide
    • Ingress Authentication
    • Redact Usage
    • Log Formats
    • Using Private Container Registries
  • Features
    • Smart Formatting
  • Self-Hosted Add Ons
    • License Proxy
    • Prometheus Integration
    • Deepgram UniMRCP Plugin
  • Tools
    • Validate Deepgram Self-Hosted TTS
    • Using SDKs with Self-Hosted
LogoLogo
Ask AIPlaygroundLoginFree API Key
On this page
  • How it works
  • Prerequisites
  • Use multiple instance types for resilience
  • Register the scalable target
  • Create a target tracking scaling policy
  • Configuration parameters
  • Choose a target value
  • Verify the scaling policy
  • Monitor scaling activity
  • FAQ
  • Does auto-scaling scale down to 0 during periods of no traffic?
Amazon SageMakerManage endpointsAuto-Scaling

Auto-Scaling Real-Time Endpoints

Use the CloudWatch ConcurrentRequestsPerModel metric to automatically scale your Amazon SageMaker real-time endpoints based on concurrent in-flight requests.

Was this page helpful?
Previous

Auto-Scaling Asynchronous SageMaker Endpoints

Configure autoscaling with scale-to-zero for asynchronous Deepgram endpoints on Amazon SageMaker using queue-depth metrics.

Next
Built with

A Deepgram real-time endpoint serves two kinds of requests: streaming over a bidirectional stream (InvokeEndpointWithBidirectionalStream, up to 30 minutes each) and synchronous pre-recorded requests (InvokeEndpoint — a single file up to 25 MB, returned in one immediate response, Deepgram’s “batch” API). Both are in-flight invocations that load the instance, so concurrent requests is the right scaling signal — which is why the high-resolution ConcurrentRequestsPerModel metric is ideal here.

Need scale-to-zero? Real-time endpoints keep a minimum of one instance and cannot scale to zero. For batch workloads that can scale to zero during idle periods, see Auto-Scaling Asynchronous Endpoints.

Before configuring auto scaling, you must have a Deepgram SageMaker Endpoint deployed and running with status InService. See Deploy Deepgram on Amazon SageMaker for setup instructions.

How it works

Amazon SageMaker integrates with AWS Application Auto Scaling to add or remove instances backing your endpoint. When you create a target tracking scaling policy with the ConcurrentRequestsPerModel metric, SageMaker:

  1. Monitors the number of concurrent in-flight requests (streaming connections and synchronous pre-recorded requests) per instance.
  2. Triggers a scale-out when concurrency exceeds your target value.
  3. Triggers a scale-in when concurrency drops below the target value.

Because ConcurrentRequestsPerModel is a high-resolution metric (10-second intervals), SageMaker detects the need to scale out up to 6x faster than standard one-minute metrics such as InvocationsPerInstance.

Prerequisites

  • A deployed Deepgram SageMaker Endpoint with status InService
  • AWS IAM permissions for Application Auto Scaling:
    • IAM Policy: AmazonSageMakerFullAccess
    • IAM Policy: Application Auto Scaling identity-based policies
  • The AWS CLI installed and configured, or the AWS SDK for Python (boto3) available in your environment

Use multiple instance types for resilience

For improved availability, configure your endpoint with multiple instance types so SageMaker can fall back to an alternative pool when your preferred instance type is constrained. This applies to both real-time and asynchronous endpoints. See Use multiple instance types for resilience in the parent guide for configuration details and code examples.

Register the scalable target

Before you can attach a scaling policy, register your SageMaker Endpoint variant as a scalable target with Application Auto Scaling. This defines the minimum and maximum instance count for horizontal scaling.

AWS CLI
Python (boto3)
$aws application-autoscaling register-scalable-target \
> --service-namespace sagemaker \
> --resource-id endpoint/YOUR_ENDPOINT_NAME/variant/AllTraffic \
> --scalable-dimension sagemaker:variant:DesiredInstanceCount \
> --min-capacity 1 \
> --max-capacity 4

Replace YOUR_ENDPOINT_NAME with the name of your SageMaker Endpoint. AllTraffic is the default SageMaker Endpoint Variant name assigned when you create an endpoint with a single production variant. If you configured a custom variant name, replace AllTraffic with that name. Adjust --min-capacity and --max-capacity to match your expected traffic range.

Create a target tracking scaling policy

Define a target tracking policy that uses the ConcurrentRequestsPerModel high-resolution metric. The TargetValue represents the desired number of concurrent streaming connections per instance. When the average concurrency across instances exceeds this value, SageMaker adds instances. When it drops below, SageMaker removes instances.

AWS CLI
Python (boto3)

Save the following policy configuration to a file named scaling-policy.json:

1{
2 "TargetValue": 5.0,
3 "PredefinedMetricSpecification": {
4 "PredefinedMetricType": "SageMakerVariantConcurrentRequestsPerModelHighResolution"
5 },
6 "ScaleInCooldown": 300,
7 "ScaleOutCooldown": 60
8}

Apply the policy:

$aws application-autoscaling put-scaling-policy \
> --policy-name deepgram-streaming-concurrency-policy \
> --service-namespace sagemaker \
> --resource-id endpoint/YOUR_ENDPOINT_NAME/variant/AllTraffic \
> --scalable-dimension sagemaker:variant:DesiredInstanceCount \
> --policy-type TargetTrackingScaling \
> --target-tracking-scaling-policy-configuration file://scaling-policy.json

Configuration parameters

ParameterDescription
TargetValueThe target number of concurrent requests per instance. Set this based on your benchmarking results.
PredefinedMetricTypeUse SageMakerVariantConcurrentRequestsPerModelHighResolution for the high-resolution concurrency metric.
ScaleOutCooldown(Optional) Seconds to wait after a scale-out before another scale-out can occur. A lower value (such as 60) allows faster reaction to traffic spikes.
ScaleInCooldown(Optional) Seconds to wait after a scale-in before another scale-in can occur. A higher value (such as 300) prevents premature removal of instances while streams are still active.

Choose a target value

The correct TargetValue depends on your instance type, Deepgram model, and feature configuration. Streaming connections hold GPU resources for the entire session, so each instance supports a finite number of concurrent streams at acceptable latency.

To determine the right target value:

  1. Deploy a single instance and open concurrent streams incrementally.
  2. Monitor response latency. The Measuring streaming latency guide describes how to benchmark.
  3. Identify the concurrency level at which average latency remains below 400 ms.
  4. Set TargetValue to approximately 70-80% of that limit to give the auto scaler time to add capacity before latency degrades.

For example, if a g5.2xlarge instance handles 10 concurrent streams at acceptable latency, set TargetValue to 7 or 8.

If your endpoint uses heterogeneous instance pools, the predefined ConcurrentRequestsPerModel metric is not sufficient on its own because per-instance capacity varies across pools. Follow the AWS guidance on driving the scaling policy from a weighted custom metric for mixed fleets. See Use heterogeneous instance type endpoints for details.

Verify the scaling policy

After applying the policy, confirm it is active:

AWS CLI
Python (boto3)
$aws application-autoscaling describe-scaling-policies \
> --service-namespace sagemaker \
> --resource-id endpoint/YOUR_ENDPOINT_NAME/variant/AllTraffic

You can also view the auto scaling configuration in the Amazon SageMaker console under Endpoints > your endpoint > Endpoint runtime settings.

Monitor scaling activity

Amazon CloudWatch automatically creates alarms when you apply a target tracking policy. You can monitor these alarms and the scaling activity in the CloudWatch console.

Key metrics to watch in the AWS/SageMaker namespace:

CloudWatch Metrics console showing the ConcurrentRequestsPerModel metric for a SageMaker Endpoint Variant

MetricDescription
ConcurrentRequestsPerModelNumber of in-flight requests per instance, including queued requests. Emitted every 10 seconds.
InvocationsPerInstanceNumber of invocations per instance per minute. Useful as a secondary metric.

To view scaling events:

$aws application-autoscaling describe-scaling-activities \
> --service-namespace sagemaker \
> --resource-id endpoint/YOUR_ENDPOINT_NAME/variant/AllTraffic

FAQ

Does auto-scaling scale down to 0 during periods of no traffic?

Not for real-time endpoints. SageMaker managed auto-scaling for real-time endpoints requires a minimum of 1 instance and scales between your configured minimum and maximum (both ≥ 1). To reduce costs during idle periods, you can delete and recreate endpoints via scheduled orchestration or workload-triggered provisioning — or use Auto-Scaling Asynchronous Endpoints, which do support scaling to zero.


What’s Next

  • Auto-Scaling SageMaker Endpoints
  • Auto-Scaling Asynchronous Endpoints
  • Deploy Deepgram on Amazon SageMaker
  • Measuring Streaming Latency