Auto-Scaling SageMaker Endpoints
Auto-Scaling SageMaker Endpoints
Compare real-time and asynchronous SageMaker endpoint types and choose the right autoscaling strategy for your Deepgram deployment.
Auto-Scaling SageMaker Endpoints
Compare real-time and asynchronous SageMaker endpoint types and choose the right autoscaling strategy for your Deepgram deployment.
Deepgram models deployed on Amazon SageMaker can automatically scale the number of instances behind an endpoint in response to load, using AWS Application Auto Scaling. How you configure autoscaling — and whether the endpoint can scale all the way down to zero — depends on which type of endpoint you deploy.
This page explains the two endpoint types and helps you choose between them. For step-by-step setup, follow the guide that matches your deployment:
Both endpoint types can transcribe pre-recorded audio — the difference is how the request is processed and returned, and that determines which signal each scales on.
A real-time endpoint serves two kinds of requests:
InvokeEndpointWithBidirectionalStream), up to 30 minutes per connection, with results streamed back as the audio is processed.InvokeEndpoint, processed in real time and returned in one immediate, synchronous response (Deepgram’s “batch” API).An asynchronous endpoint serves one kind of request:
InvokeEndpointAsync via an S3 pointer and processed asynchronously with near real-time latency. The result is written back to S3; there is no synchronous response and no streaming.Both endpoint types integrate with AWS Application Auto Scaling. At a high level, you:
The difference between the two is the signal they scale on:
ConcurrentRequestsPerModel metric — because each in-flight request, whether a streaming connection or a synchronous pre-recorded request, represents load on an instance.ApproximateBacklogSizePerInstance metric — and allow a minimum capacity of zero, enabling scale-to-zero during idle periods.For the exact metrics, policies, and code, follow the guide for your endpoint type below.
A single instance type can become temporarily unavailable in a given region or Availability Zone, which may prevent your endpoint from scaling out when traffic increases. To reduce this risk, configure your endpoint variant with multiple instance types in priority order using SageMaker’s heterogeneous instance pools. SageMaker provisions instances from your highest-priority pool first and falls back to lower-priority pools when capacity in the preferred pool is constrained.
This applies to both real-time and asynchronous endpoints.
The following endpoint configuration lists ml.g6.2xlarge as the preferred instance type and falls back to ml.g6e.2xlarge if the first pool is unavailable:
When you use multiple instance types, give additional consideration to your scaling policy. The predefined scaling metrics (ConcurrentRequestsPerModel for real-time endpoints, ApproximateBacklogSizePerInstance for async endpoints) do not account for capacity differences between pools, so a target tracking policy that uses them directly may scale unevenly across instance types. For mixed fleets, AWS recommends driving the scaling policy from a weighted custom metric instead. See the AWS heterogeneous endpoints documentation for details on configuring weighted custom metrics for heterogeneous endpoints.