Observability for Amazon SageMaker
Amazon SageMaker publishes endpoint metrics and container logs to Amazon CloudWatch automatically. This page explains which metrics matter most for Deepgram workloads, how to monitor latency for streaming and pre-recorded traffic, how to enable per-instance enhanced metrics, how to access Deepgram container logs, and how to configure alarms that alert you before performance degrades.
Before configuring observability, you must have a Deepgram SageMaker Endpoint deployed and running with status InService. See Deploy Deepgram on Amazon SageMaker for setup instructions.
CloudWatch metrics
SageMaker publishes metrics to two CloudWatch namespaces. The metrics most relevant to Deepgram streaming workloads are listed below.
Invocation metrics (AWS/SageMaker)
These metrics track request-level behavior for your endpoint variant.
Dimensions for invocation metrics: EndpointName, VariantName. Invocation metrics publish at 1-minute granularity, except ConcurrentRequestsPerModel, which publishes every 10 seconds. Each streaming session also publishes one sample to the error metrics (0 on success), so error-rate alarms cover streaming and pre-recorded traffic alike.
Endpoint instance metrics (/aws/sagemaker/Endpoints)
These metrics track resource utilization on the EC2 instances backing your endpoint.
Dimensions for endpoint instance metrics: EndpointName, VariantName. With enhanced metrics enabled, per-instance and per-GPU variants of these metrics become available.
Focused: ConcurrentRequestsPerModel
For Deepgram streaming workloads, ConcurrentRequestsPerModel is the most important metric. It reflects the actual concurrency load because Deepgram uses long-lived bidirectional streaming connections where each connection holds GPU resources for the entire session.
Key characteristics:
- High-resolution: Emitted every 10 seconds, making it up to 6x faster than standard 1-minute metrics at detecting load changes.
- Includes queued requests: The count includes requests waiting in the SageMaker queue, not just those actively being processed.
- Directly maps to capacity: Each instance supports a finite number of concurrent streams at acceptable latency. See Auto-Scaling SageMaker Endpoints for guidance on determining the right concurrency limit for your instance type.
To query this metric with the AWS CLI:
AWS documents Min and Max as the valid statistics for ConcurrentRequestsPerModel — use Maximum for capacity monitoring and alarms. Set --period 10 to see the full 10-second resolution.
Monitor latency
Total request latency has three components: network latency (client ↔ SageMaker runtime, not visible in CloudWatch — measure it client-side), OverheadLatency (SageMaker routing and platform processing), and ModelLatency (processing time inside the Deepgram container).
All SageMaker latency metrics are reported in microseconds: a 2-second threshold is 2000000.
Pre-recorded requests
ModelLatency and OverheadLatency are emitted once per InvokeEndpoint request. Use percentile statistics rather than averages — averages hide tail latency. ModelLatency scales with the duration of the submitted audio, so tune thresholds against your own traffic baseline:
Streaming sessions
ModelLatency is not emitted for streaming sessions, and OverheadLatency carries no useful signal for them. Monitor FirstChunkLatency instead: it records one sample per session, measuring the time from session start until the first response chunk reaches the client — how quickly a new connection becomes productive. It supports percentile statistics, so alert on p99 (see Streaming latency alarm).
MidStreamErrors counts errors that occur after a response has started streaming. It is only populated when mid-stream failures occur — on a healthy endpoint this metric may not appear in CloudWatch at all.
Enable enhanced metrics
By default, SageMaker reports utilization metrics aggregated across all instances behind the endpoint. Because Deepgram streaming connections are long-lived and remain on the instance that accepted them, load is not always spread evenly — an endpoint-level average can look healthy while one instance or GPU is saturated.
Enhanced metrics add per-instance and per-GPU utilization visibility, the same way per-instance Log Streams isolate logs:
CPUUtilizationNormalizedandMemoryUtilizationgain anInstanceIddimension.GPUUtilizationNormalizedandGPUMemoryUtilizationNormalizedgainInstanceIdand per-GPUGpuIddimensions.- The
*Normalizedvariants report 0–100% regardless of core or GPU count, unlike their summed counterparts. - Utilization metrics publish at a configurable interval: 10, 30, 60 (default), 120, 180, 240, or 300 seconds.
Enhanced metrics are enabled with MetricsConfig on the endpoint configuration:
AWS CLI
Python (boto3)
To enable enhanced metrics on an endpoint that is already serving traffic, create a new endpoint configuration with the same production variants plus MetricsConfig, then update the endpoint. The update runs as a blue/green deployment (typically several minutes) and the endpoint stays InService throughout; the new metrics appear once it completes. See Update an Amazon SageMaker Endpoint for how rollouts interact with long-lived streaming sessions.
After the deployment completes and traffic arrives, list the per-instance and per-GPU metric streams:
Metrics published at intervals below 60 seconds are high-resolution and are retained by CloudWatch for only 3 hours. CloudWatch bills utilization metrics per metric stream, and per-instance and per-GPU dimensions add streams as the endpoint scales out. See Amazon CloudWatch pricing.
CloudWatch Logs
SageMaker streams container output to Amazon CloudWatch Logs automatically. Deepgram server logs — including startup messages, inference activity, configuration application, and error details — appear in these log streams.
Log Group and Log Streams
Each SageMaker Endpoint writes to a CloudWatch Log Group named:
Within the Log Group, SageMaker creates one Log Stream per EC2 instance backing the endpoint. If auto-scaling adds instances, new Log Streams appear automatically. The Log Stream name includes the variant name and instance identifier, following the pattern:
This per-instance separation lets you isolate issues to a specific instance when diagnosing problems across a scaled-out fleet.
View logs in the console
Open the SageMaker console
Navigate to the Amazon SageMaker AI console and select Endpoints from the left menu.
View logs with the AWS CLI
Stream logs in near-real-time using the CloudWatch Logs Live Tail feature:
To filter for a specific Log Stream (a specific instance):
To search recent logs for errors:
What to look for in Deepgram logs
Configure CloudWatch alarms
CloudWatch alarms monitor a metric and trigger notifications or actions when the metric crosses a threshold. Use alarms to detect capacity issues, error spikes, and infrastructure problems before they affect users.
Recommended alarms
The following alarms cover the most critical failure and capacity scenarios for Deepgram on SageMaker.
High concurrency alarm
Alert when concurrent streams approach the capacity limit of your instances. Set the threshold to approximately 80–90% of the maximum concurrent streams your instance type can handle at acceptable latency.
AWS CLI
Python (boto3)
Adjust Threshold based on your benchmarking results. For example, if a g5.2xlarge instance handles 100 concurrent streams at acceptable latency, set the alarm threshold to 80.
Error rate alarm
Alert when the endpoint returns errors. A sustained error rate indicates a container or model issue that needs investigation.
AWS CLI
Python (boto3)
Streaming latency alarm
Alert when the time to first response on new streaming sessions degrades. FirstChunkLatency supports percentiles — alarm on p99 so a handful of slow session starts is caught without being averaged away. Baseline your endpoint first and set the threshold above the steady-state p99 (in microseconds).
AWS CLI
Python (boto3)
FirstChunkLatency only receives samples when new sessions start, so keep --treat-missing-data notBreaching to avoid alarming during idle periods. For pre-recorded endpoints, create the equivalent alarm on ModelLatency with p95 or p99, and remember that its value scales with the duration of the submitted audio.
GPU utilization alarm
GPU utilization is not a primary metric for Deepgram workloads — ConcurrentRequestsPerModel is a more direct indicator of load. However, GPU utilization is still useful for gauging the general level of load that a particular GPU instance is under and for detecting hardware-level saturation.
AWS CLI
Python (boto3)
For instances with a single GPU (such as g5.2xlarge), GPUUtilization ranges from 0–100%. For multi-GPU instances, the range is 0%–(100% × number of GPUs). Adjust the threshold accordingly.
Alarm actions
Each alarm example above sends a notification to an Amazon SNS topic. Create an SNS topic and subscribe your preferred notification channel (email, Slack via AWS Chatbot, PagerDuty, or a Lambda function) before configuring alarms.
View alarms
List active alarms for your endpoint:
You can also view alarms in the CloudWatch console under Alarms → All alarms.
Build a CloudWatch dashboard
Combine the key metrics into a single dashboard for at-a-glance monitoring. The following AWS CLI command creates a dashboard with widgets for concurrency, GPU utilization, errors, and latency.
After creating the dashboard, view it in the CloudWatch console under Dashboards.