Observability for Amazon SageMaker

Monitor Deepgram SageMaker Endpoints with Amazon CloudWatch metrics, container logs, and alarms.

Amazon SageMaker publishes endpoint metrics and container logs to Amazon CloudWatch automatically. This page explains which metrics matter most for Deepgram workloads, how to access Deepgram container logs, and how to configure alarms that alert you before performance degrades.

Before configuring observability, you must have a Deepgram SageMaker Endpoint deployed and running with status InService. See Deploy Deepgram on Amazon SageMaker for setup instructions.

CloudWatch metrics

SageMaker publishes metrics to two CloudWatch namespaces. The metrics most relevant to Deepgram streaming workloads are listed below.

Invocation metrics (AWS/SageMaker)

These metrics track request-level behavior for your endpoint variant.

MetricDescriptionRelevance to Deepgram
ConcurrentRequestsPerModelNumber of in-flight requests per instance, including queued requests. Emitted every 10 seconds (high-resolution).Primary metric for Deepgram streaming. Each bidirectional streaming connection holds GPU resources for its entire duration. This metric directly reflects the load on each instance.
InvocationsTotal number of InvokeEndpoint requests sent to the endpoint.Useful for tracking overall request volume and correlating with billing.
InvocationsPerInstanceInvocations normalized by instance count. Emitted every minute.Secondary throughput indicator. Less useful for streaming workloads than ConcurrentRequestsPerModel because it counts completed invocations, not active connections.
ModelLatencyTime taken by the Deepgram container to process a request and return a response. Units: microseconds.Tracks inference latency inside the container. For streaming, this measures the per-chunk response time.
OverheadLatencyTime added by SageMaker infrastructure (request routing, response serialization) outside the model container. Units: microseconds.High values indicate SageMaker platform overhead rather than Deepgram processing delays.
Invocation4XXErrorsCount of requests that returned a 4xx HTTP status.Indicates client-side issues such as malformed requests or unsupported parameters.
Invocation5XXErrorsCount of requests that returned a 5xx HTTP status.Indicates server-side failures. Spikes may signal container crashes, out-of-memory errors, or GPU issues.
InvocationModelErrorsCount of requests that did not result in a 2xx response, including timeouts and malformed responses.Broader error metric that captures both 4xx and 5xx errors plus low-level failures.

Dimensions for invocation metrics: EndpointName, VariantName.

Endpoint instance metrics (/aws/sagemaker/Endpoints)

These metrics track resource utilization on the EC2 instances backing your endpoint.

MetricDescriptionRelevance to Deepgram
CPUUtilizationSum of utilization across all CPU cores. Range: 0%–(100% × number of cores).Deepgram is primarily GPU-bound, but high CPU utilization can indicate bottlenecks in audio preprocessing or API request handling.
GPUUtilizationSum of utilization across all GPUs. Range: 0%–(100% × number of GPUs).Key resource metric. Deepgram inference runs on GPU. Sustained values near maximum indicate the instance is at capacity.
GPUMemoryUtilizationPercentage of GPU memory in use, summed across GPUs.High GPU memory utilization can preclude loading additional models or serving more concurrent streams.
MemoryUtilizationPercentage of system RAM in use.Monitor to detect memory leaks or insufficient instance sizing.
DiskUtilizationPercentage of disk space in use.Models and logs consume disk. Alert if utilization approaches capacity.

Dimensions for endpoint instance metrics: EndpointName, VariantName.

Focused: ConcurrentRequestsPerModel

For Deepgram streaming workloads, ConcurrentRequestsPerModel is the most important metric. It reflects the actual concurrency load because Deepgram uses long-lived bidirectional streaming connections where each connection holds GPU resources for the entire session.

Key characteristics:

  • High-resolution: Emitted every 10 seconds, making it up to 6x faster than standard 1-minute metrics at detecting load changes.
  • Includes queued requests: The count includes requests waiting in the SageMaker queue, not just those actively being processed.
  • Directly maps to capacity: Each instance supports a finite number of concurrent streams at acceptable latency. See Auto-Scaling SageMaker Endpoints for guidance on determining the right concurrency limit for your instance type.

To query this metric with the AWS CLI:

$aws cloudwatch get-metric-statistics \
> --namespace AWS/SageMaker \
> --metric-name ConcurrentRequestsPerModel \
> --dimensions Name=EndpointName,Value=YOUR_ENDPOINT_NAME \
> Name=VariantName,Value=AllTraffic \
> --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
> --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
> --period 60 \
> --statistics Average Maximum \
> --region YOUR_AWS_REGION

CloudWatch Logs

SageMaker streams container output to Amazon CloudWatch Logs automatically. Deepgram server logs — including startup messages, inference activity, configuration application, and error details — appear in these log streams.

Log Group and Log Streams

Each SageMaker Endpoint writes to a CloudWatch Log Group named:

/aws/sagemaker/Endpoints/YOUR_ENDPOINT_NAME

Within the Log Group, SageMaker creates one Log Stream per EC2 instance backing the endpoint. If auto-scaling adds instances, new Log Streams appear automatically. The Log Stream name includes the variant name and instance identifier, following the pattern:

AllTraffic/i-0123456789abcdef0

This per-instance separation lets you isolate issues to a specific instance when diagnosing problems across a scaled-out fleet.

View logs in the console

1

Open the SageMaker console

Navigate to the Amazon SageMaker AI console and select Endpoints from the left menu.

2

Select your endpoint

Click the endpoint name to open its details page.

3

Open CloudWatch Logs

Under Monitor, click the View logs link. This opens the CloudWatch Log Group for the endpoint.

4

Select a Log Stream

Choose a Log Stream corresponding to the instance you want to inspect. Each stream contains the full Deepgram container output for that instance.

View logs with the AWS CLI

Stream logs in near-real-time using the CloudWatch Logs Live Tail feature:

$aws logs tail --follow \
> /aws/sagemaker/Endpoints/YOUR_ENDPOINT_NAME \
> --region YOUR_AWS_REGION

To filter for a specific Log Stream (a specific instance):

$aws logs tail --follow \
> /aws/sagemaker/Endpoints/YOUR_ENDPOINT_NAME \
> --log-stream-name-prefix AllTraffic/ \
> --region YOUR_AWS_REGION

To search recent logs for errors:

$aws logs filter-log-events \
> --log-group-name /aws/sagemaker/Endpoints/YOUR_ENDPOINT_NAME \
> --filter-pattern "ERROR" \
> --start-time $(date -u -d '1 hour ago' +%s000) \
> --region YOUR_AWS_REGION

What to look for in Deepgram logs

Log patternMeaning
INFO Starting Deepgram SageMaker configuration...Container is applying environment variable overrides. See Configure Amazon SageMaker Deployments.
INFO Configuration complete.Environment variables were applied successfully.
INFO Deepgram Engine is readyThe inference engine has loaded models and is accepting requests.
ERRORAn error occurred. Check the message for details — common causes include GPU initialization failures, model loading errors, or out-of-memory conditions.
thread 'main' panicked atA Rust-level application panic. The container process has crashed and the instance will restart. Capture the full backtrace from the log stream for diagnosis.
WARNING entries with skippedAn environment variable override was not applied, usually due to a parse error in the TOML expression.

Configure CloudWatch alarms

CloudWatch alarms monitor a metric and trigger notifications or actions when the metric crosses a threshold. Use alarms to detect capacity issues, error spikes, and infrastructure problems before they affect users.

The following alarms cover the most critical failure and capacity scenarios for Deepgram on SageMaker.

High concurrency alarm

Alert when concurrent streams approach the capacity limit of your instances. Set the threshold to approximately 80–90% of the maximum concurrent streams your instance type can handle at acceptable latency.

$aws cloudwatch put-metric-alarm \
> --alarm-name deepgram-high-concurrency \
> --namespace AWS/SageMaker \
> --metric-name ConcurrentRequestsPerModel \
> --dimensions Name=EndpointName,Value=YOUR_ENDPOINT_NAME \
> Name=VariantName,Value=AllTraffic \
> --statistic Maximum \
> --period 60 \
> --evaluation-periods 3 \
> --threshold 80 \
> --comparison-operator GreaterThanOrEqualToThreshold \
> --alarm-actions arn:aws:sns:YOUR_REGION:YOUR_ACCOUNT_ID:YOUR_SNS_TOPIC \
> --region YOUR_AWS_REGION

Adjust Threshold based on your benchmarking results. For example, if a g5.2xlarge instance handles 100 concurrent streams at acceptable latency, set the alarm threshold to 80.

Error rate alarm

Alert when the endpoint returns errors. A sustained error rate indicates a container or model issue that needs investigation.

$aws cloudwatch put-metric-alarm \
> --alarm-name deepgram-invocation-errors \
> --namespace AWS/SageMaker \
> --metric-name Invocation5XXErrors \
> --dimensions Name=EndpointName,Value=YOUR_ENDPOINT_NAME \
> Name=VariantName,Value=AllTraffic \
> --statistic Sum \
> --period 300 \
> --evaluation-periods 2 \
> --threshold 5 \
> --comparison-operator GreaterThanOrEqualToThreshold \
> --alarm-actions arn:aws:sns:YOUR_REGION:YOUR_ACCOUNT_ID:YOUR_SNS_TOPIC \
> --region YOUR_AWS_REGION

GPU utilization alarm

GPU utilization is not a primary metric for Deepgram workloads — ConcurrentRequestsPerModel is a more direct indicator of load. However, GPU utilization is still useful for gauging the general level of load that a particular GPU instance is under and for detecting hardware-level saturation.

$aws cloudwatch put-metric-alarm \
> --alarm-name deepgram-high-gpu \
> --namespace /aws/sagemaker/Endpoints \
> --metric-name GPUUtilization \
> --dimensions Name=EndpointName,Value=YOUR_ENDPOINT_NAME \
> Name=VariantName,Value=AllTraffic \
> --statistic Average \
> --period 300 \
> --evaluation-periods 3 \
> --threshold 80 \
> --comparison-operator GreaterThanOrEqualToThreshold \
> --alarm-actions arn:aws:sns:YOUR_REGION:YOUR_ACCOUNT_ID:YOUR_SNS_TOPIC \
> --region YOUR_AWS_REGION

For instances with a single GPU (such as g5.2xlarge), GPUUtilization ranges from 0–100%. For multi-GPU instances, the range is 0%–(100% × number of GPUs). Adjust the threshold accordingly.

Alarm actions

Each alarm example above sends a notification to an Amazon SNS topic. Create an SNS topic and subscribe your preferred notification channel (email, Slack via AWS Chatbot, PagerDuty, or a Lambda function) before configuring alarms.

$aws sns create-topic --name deepgram-sagemaker-alerts --region YOUR_AWS_REGION
$aws sns subscribe \
> --topic-arn arn:aws:sns:YOUR_REGION:YOUR_ACCOUNT_ID:deepgram-sagemaker-alerts \
> --protocol email \
> --notification-endpoint your-team@example.com \
> --region YOUR_AWS_REGION

View alarms

List active alarms for your endpoint:

$aws cloudwatch describe-alarms \
> --alarm-name-prefix deepgram- \
> --region YOUR_AWS_REGION

You can also view alarms in the CloudWatch console under AlarmsAll alarms.

Build a CloudWatch dashboard

Combine the key metrics into a single dashboard for at-a-glance monitoring. The following AWS CLI command creates a dashboard with widgets for concurrency, GPU utilization, and errors.

Create a CloudWatch dashboard
$aws cloudwatch put-dashboard \
> --dashboard-name deepgram-sagemaker \
> --dashboard-body '{
> "widgets": [
> {
> "type": "metric",
> "properties": {
> "title": "Concurrent Requests Per Model",
> "metrics": [
> ["AWS/SageMaker", "ConcurrentRequestsPerModel",
> "EndpointName", "YOUR_ENDPOINT_NAME",
> "VariantName", "AllTraffic"]
> ],
> "period": 60,
> "stat": "Maximum",
> "region": "YOUR_AWS_REGION"
> }
> },
> {
> "type": "metric",
> "properties": {
> "title": "GPU Utilization",
> "metrics": [
> ["/aws/sagemaker/Endpoints", "GPUUtilization",
> "EndpointName", "YOUR_ENDPOINT_NAME",
> "VariantName", "AllTraffic"]
> ],
> "period": 300,
> "stat": "Average",
> "region": "YOUR_AWS_REGION"
> }
> },
> {
> "type": "metric",
> "properties": {
> "title": "Invocation Errors (5xx)",
> "metrics": [
> ["AWS/SageMaker", "Invocation5XXErrors",
> "EndpointName", "YOUR_ENDPOINT_NAME",
> "VariantName", "AllTraffic"]
> ],
> "period": 300,
> "stat": "Sum",
> "region": "YOUR_AWS_REGION"
> }
> },
> {
> "type": "metric",
> "properties": {
> "title": "Model Latency (avg, microseconds)",
> "metrics": [
> ["AWS/SageMaker", "ModelLatency",
> "EndpointName", "YOUR_ENDPOINT_NAME",
> "VariantName", "AllTraffic"]
> ],
> "period": 60,
> "stat": "Average",
> "region": "YOUR_AWS_REGION"
> }
> }
> ]
> }' \
> --region YOUR_AWS_REGION

After creating the dashboard, view it in the CloudWatch console under Dashboards.