Observability for Amazon SageMaker
Amazon SageMaker publishes endpoint metrics and container logs to Amazon CloudWatch automatically. This page explains which metrics matter most for Deepgram workloads, how to access Deepgram container logs, and how to configure alarms that alert you before performance degrades.
Before configuring observability, you must have a Deepgram SageMaker Endpoint deployed and running with status InService. See Deploy Deepgram on Amazon SageMaker for setup instructions.
CloudWatch metrics
SageMaker publishes metrics to two CloudWatch namespaces. The metrics most relevant to Deepgram streaming workloads are listed below.
Invocation metrics (AWS/SageMaker)
These metrics track request-level behavior for your endpoint variant.
Dimensions for invocation metrics: EndpointName, VariantName.
Endpoint instance metrics (/aws/sagemaker/Endpoints)
These metrics track resource utilization on the EC2 instances backing your endpoint.
Dimensions for endpoint instance metrics: EndpointName, VariantName.
Focused: ConcurrentRequestsPerModel
For Deepgram streaming workloads, ConcurrentRequestsPerModel is the most important metric. It reflects the actual concurrency load because Deepgram uses long-lived bidirectional streaming connections where each connection holds GPU resources for the entire session.
Key characteristics:
- High-resolution: Emitted every 10 seconds, making it up to 6x faster than standard 1-minute metrics at detecting load changes.
- Includes queued requests: The count includes requests waiting in the SageMaker queue, not just those actively being processed.
- Directly maps to capacity: Each instance supports a finite number of concurrent streams at acceptable latency. See Auto-Scaling SageMaker Endpoints for guidance on determining the right concurrency limit for your instance type.
To query this metric with the AWS CLI:
CloudWatch Logs
SageMaker streams container output to Amazon CloudWatch Logs automatically. Deepgram server logs — including startup messages, inference activity, configuration application, and error details — appear in these log streams.
Log Group and Log Streams
Each SageMaker Endpoint writes to a CloudWatch Log Group named:
Within the Log Group, SageMaker creates one Log Stream per EC2 instance backing the endpoint. If auto-scaling adds instances, new Log Streams appear automatically. The Log Stream name includes the variant name and instance identifier, following the pattern:
This per-instance separation lets you isolate issues to a specific instance when diagnosing problems across a scaled-out fleet.
View logs in the console
Open the SageMaker console
Navigate to the Amazon SageMaker AI console and select Endpoints from the left menu.
View logs with the AWS CLI
Stream logs in near-real-time using the CloudWatch Logs Live Tail feature:
To filter for a specific Log Stream (a specific instance):
To search recent logs for errors:
What to look for in Deepgram logs
Configure CloudWatch alarms
CloudWatch alarms monitor a metric and trigger notifications or actions when the metric crosses a threshold. Use alarms to detect capacity issues, error spikes, and infrastructure problems before they affect users.
Recommended alarms
The following alarms cover the most critical failure and capacity scenarios for Deepgram on SageMaker.
High concurrency alarm
Alert when concurrent streams approach the capacity limit of your instances. Set the threshold to approximately 80–90% of the maximum concurrent streams your instance type can handle at acceptable latency.
AWS CLI
Python (boto3)
Adjust Threshold based on your benchmarking results. For example, if a g5.2xlarge instance handles 100 concurrent streams at acceptable latency, set the alarm threshold to 80.
Error rate alarm
Alert when the endpoint returns errors. A sustained error rate indicates a container or model issue that needs investigation.
AWS CLI
Python (boto3)
GPU utilization alarm
GPU utilization is not a primary metric for Deepgram workloads — ConcurrentRequestsPerModel is a more direct indicator of load. However, GPU utilization is still useful for gauging the general level of load that a particular GPU instance is under and for detecting hardware-level saturation.
AWS CLI
Python (boto3)
For instances with a single GPU (such as g5.2xlarge), GPUUtilization ranges from 0–100%. For multi-GPU instances, the range is 0%–(100% × number of GPUs). Adjust the threshold accordingly.
Alarm actions
Each alarm example above sends a notification to an Amazon SNS topic. Create an SNS topic and subscribe your preferred notification channel (email, Slack via AWS Chatbot, PagerDuty, or a Lambda function) before configuring alarms.
View alarms
List active alarms for your endpoint:
You can also view alarms in the CloudWatch console under Alarms → All alarms.
Build a CloudWatch dashboard
Combine the key metrics into a single dashboard for at-a-glance monitoring. The following AWS CLI command creates a dashboard with widgets for concurrency, GPU utilization, and errors.
After creating the dashboard, view it in the CloudWatch console under Dashboards.