Deploy with Terraform

Use Terraform to deploy Deepgram on Amazon SageMaker from an AWS Marketplace Model Package subscription.

This guide provides a complete Terraform configuration for deploying Deepgram on Amazon SageMaker. The configuration creates an IAM execution role, a SageMaker Model from your AWS Marketplace subscription, an Endpoint Configuration, and a live Endpoint. An optional module adds auto-scaling. The same configuration can deploy either a real-time endpoint (the default) or an asynchronous endpoint that processes large pre-recorded files from S3 and can scale to zero — set enable_async_inference = true.

Before running Terraform, you must subscribe to a Deepgram product on the AWS Marketplace and note the Model Package ARN. See Subscribe to Deepgram Products for instructions.

Prerequisites

  • Terraform 1.5 or later
  • AWS credentials configured for the target account (via environment variables, shared credentials file, or an IAM role)
  • An active AWS Marketplace subscription to a Deepgram SageMaker product
  • The Model Package ARN for the subscribed product (found in the SageMaker console under Marketplace Model PackagesAWS Marketplace Subscriptions)

Project layout

deepgram-sagemaker-terraform/
├── main.tf # Provider and resource definitions
├── variables.tf # Input variables
├── outputs.tf # Endpoint name, ARN, and status outputs
└── terraform.tfvars # Your variable values (do not commit secrets)

Variables

Create variables.tf with the input variables the configuration needs. The only required value is the Model Package ARN from your Marketplace subscription.

variables.tf
1variable "aws_region" {
2 description = "AWS region where the SageMaker Endpoint will be deployed."
3 type = string
4 default = "us-east-1"
5}
6
7variable "model_package_arn" {
8 description = "ARN of the Deepgram Model Package from AWS Marketplace."
9 type = string
10}
11
12variable "model_name" {
13 description = "Name for the SageMaker Model resource."
14 type = string
15 default = "deepgram-stt"
16}
17
18variable "endpoint_name" {
19 description = "Name for the SageMaker Endpoint."
20 type = string
21 default = "deepgram-stt-endpoint"
22}
23
24variable "instance_type" {
25 description = "SageMaker instance type for the endpoint."
26 type = string
27 default = "ml.g5.2xlarge"
28}
29
30variable "initial_instance_count" {
31 description = "Number of instances to launch at endpoint creation."
32 type = number
33 default = 1
34}
35
36variable "variant_name" {
37 description = "Name of the production variant."
38 type = string
39 default = "AllTraffic"
40}
41
42variable "deepgram_engine_env" {
43 description = "Map of DEEPGRAM_ENGINE_* environment variables for TOML overrides."
44 type = map(string)
45 default = {}
46}
47
48variable "deepgram_api_env" {
49 description = "Map of DEEPGRAM_API_* environment variables for TOML overrides."
50 type = map(string)
51 default = {}
52}
53
54variable "enable_autoscaling" {
55 description = "Enable auto-scaling for the endpoint."
56 type = bool
57 default = false
58}
59
60variable "autoscaling_min_capacity" {
61 description = "Minimum instance count for auto-scaling."
62 type = number
63 default = 1
64}
65
66variable "autoscaling_max_capacity" {
67 description = "Maximum instance count for auto-scaling."
68 type = number
69 default = 4
70}
71
72variable "autoscaling_target_value" {
73 description = "Target concurrent requests per instance for the scaling policy."
74 type = number
75 default = 5.0
76}
77
78variable "enable_async_inference" {
79 description = "Deploy an asynchronous endpoint (queued, S3 in/out) instead of a real-time endpoint. Async endpoints accept only asynchronous invocations."
80 type = bool
81 default = false
82}
83
84variable "async_s3_output_path" {
85 description = "S3 URI for async transcription output, e.g. s3://my-bucket/output/. Required when enable_async_inference = true."
86 type = string
87 default = ""
88
89 validation {
90 condition = var.async_s3_output_path == "" || can(regex("^s3://", var.async_s3_output_path))
91 error_message = "async_s3_output_path must be an s3:// URI."
92 }
93}
94
95variable "async_s3_failure_path" {
96 description = "Optional S3 URI for async failure output, e.g. s3://my-bucket/failures/."
97 type = string
98 default = ""
99}

In async mode, autoscaling_target_value is interpreted as the target ApproximateBacklogSizePerInstance (queued requests per instance), and autoscaling_min_capacity may be set to 0 to enable scale-to-zero. In real-time mode it remains concurrent-requests-per-instance with a minimum of 1.

Main configuration

Create main.tf with the provider, IAM role, and SageMaker resources. The configuration uses the Model Package ARN from your AWS Marketplace subscription to create the model without referencing a container image directly.

main.tf
1###############################################################################
2# Provider
3###############################################################################
4
5terraform {
6 required_version = ">= 1.5"
7
8 required_providers {
9 aws = {
10 source = "hashicorp/aws"
11 version = ">= 5.0"
12 }
13 }
14}
15
16provider "aws" {
17 region = var.aws_region
18}
19
20###############################################################################
21# IAM Role — SageMaker Execution
22###############################################################################
23
24data "aws_iam_policy_document" "sagemaker_assume_role" {
25 statement {
26 actions = ["sts:AssumeRole"]
27
28 principals {
29 type = "Service"
30 identifiers = ["sagemaker.amazonaws.com"]
31 }
32 }
33}
34
35resource "aws_iam_role" "sagemaker_execution" {
36 name = "${var.model_name}-execution-role"
37 assume_role_policy = data.aws_iam_policy_document.sagemaker_assume_role.json
38}
39
40resource "aws_iam_role_policy_attachment" "sagemaker_full_access" {
41 role = aws_iam_role.sagemaker_execution.name
42 policy_arn = "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"
43}
44
45###############################################################################
46# Async S3 access (only when enable_async_inference = true)
47###############################################################################
48
49locals {
50 async_buckets = var.enable_async_inference ? toset(compact([
51 try(split("/", replace(var.async_s3_output_path, "s3://", ""))[0], ""),
52 try(split("/", replace(var.async_s3_failure_path, "s3://", ""))[0], ""),
53 ])) : toset([])
54}
55
56resource "aws_iam_role_policy" "async_s3" {
57 count = var.enable_async_inference ? 1 : 0
58
59 name = "${var.model_name}-async-s3"
60 role = aws_iam_role.sagemaker_execution.id
61
62 policy = jsonencode({
63 Version = "2012-10-17"
64 Statement = [
65 {
66 Effect = "Allow"
67 Action = ["s3:GetObject", "s3:PutObject"]
68 Resource = [for b in local.async_buckets : "arn:aws:s3:::${b}/*"]
69 },
70 {
71 Effect = "Allow"
72 Action = ["s3:ListBucket"]
73 Resource = [for b in local.async_buckets : "arn:aws:s3:::${b}"]
74 },
75 ]
76 })
77}
78
79###############################################################################
80# Merge Deepgram environment variables
81###############################################################################
82
83locals {
84 deepgram_env = merge(
85 { for k, v in var.deepgram_engine_env : "DEEPGRAM_ENGINE_${k}" => v },
86 { for k, v in var.deepgram_api_env : "DEEPGRAM_API_${k}" => v },
87 )
88}
89
90###############################################################################
91# SageMaker Model — from AWS Marketplace Model Package
92###############################################################################
93
94resource "aws_sagemaker_model" "deepgram" {
95 name = var.model_name
96 execution_role_arn = aws_iam_role.sagemaker_execution.arn
97 enable_network_isolation = true
98
99 primary_container {
100 model_package_name = var.model_package_arn
101 environment = local.deepgram_env
102 }
103}
104
105###############################################################################
106# SageMaker Endpoint Configuration
107###############################################################################
108
109resource "aws_sagemaker_endpoint_configuration" "deepgram" {
110 name = "${var.endpoint_name}-config"
111
112 production_variants {
113 variant_name = var.variant_name
114 model_name = aws_sagemaker_model.deepgram.name
115 initial_instance_count = var.initial_instance_count
116 instance_type = var.instance_type
117 }
118
119 dynamic "async_inference_config" {
120 for_each = var.enable_async_inference ? [1] : []
121 content {
122 output_config {
123 s3_output_path = var.async_s3_output_path
124 s3_failure_path = var.async_s3_failure_path != "" ? var.async_s3_failure_path : null
125 }
126 }
127 }
128
129 lifecycle {
130 precondition {
131 condition = !var.enable_async_inference || var.async_s3_output_path != ""
132 error_message = "async_s3_output_path is required when enable_async_inference = true."
133 }
134 }
135}
136
137###############################################################################
138# SageMaker Endpoint
139###############################################################################
140
141resource "aws_sagemaker_endpoint" "deepgram" {
142 name = var.endpoint_name
143 endpoint_config_name = aws_sagemaker_endpoint_configuration.deepgram.name
144}
145
146###############################################################################
147# Auto-Scaling (optional)
148###############################################################################
149
150resource "aws_appautoscaling_target" "sagemaker" {
151 count = var.enable_autoscaling ? 1 : 0
152
153 max_capacity = var.autoscaling_max_capacity
154 min_capacity = var.autoscaling_min_capacity
155 resource_id = "endpoint/${aws_sagemaker_endpoint.deepgram.name}/variant/${var.variant_name}"
156 scalable_dimension = "sagemaker:variant:DesiredInstanceCount"
157 service_namespace = "sagemaker"
158}
159
160resource "aws_appautoscaling_policy" "sagemaker" {
161 count = var.enable_autoscaling ? 1 : 0
162
163 name = "${var.endpoint_name}-concurrency-policy"
164 policy_type = "TargetTrackingScaling"
165 resource_id = aws_appautoscaling_target.sagemaker[0].resource_id
166 scalable_dimension = aws_appautoscaling_target.sagemaker[0].scalable_dimension
167 service_namespace = aws_appautoscaling_target.sagemaker[0].service_namespace
168
169 target_tracking_scaling_policy_configuration {
170 target_value = var.autoscaling_target_value
171
172 # Real-time: scale on concurrent requests per model.
173 dynamic "predefined_metric_specification" {
174 for_each = var.enable_async_inference ? [] : [1]
175 content {
176 predefined_metric_type = "SageMakerVariantConcurrentRequestsPerModelHighResolution"
177 }
178 }
179
180 # Async: scale on queue depth per instance.
181 dynamic "customized_metric_specification" {
182 for_each = var.enable_async_inference ? [1] : []
183 content {
184 metric_name = "ApproximateBacklogSizePerInstance"
185 namespace = "AWS/SageMaker"
186 statistic = "Average"
187 dimensions {
188 name = "EndpointName"
189 value = aws_sagemaker_endpoint.deepgram.name
190 }
191 }
192 }
193
194 scale_in_cooldown = 300
195 scale_out_cooldown = 60
196 }
197}
198
199###############################################################################
200# Async scale-from-zero (only when async + autoscaling are enabled)
201###############################################################################
202
203resource "aws_appautoscaling_policy" "async_scale_from_zero" {
204 count = var.enable_async_inference && var.enable_autoscaling ? 1 : 0
205
206 name = "${var.endpoint_name}-scale-from-zero"
207 policy_type = "StepScaling"
208 resource_id = aws_appautoscaling_target.sagemaker[0].resource_id
209 scalable_dimension = aws_appautoscaling_target.sagemaker[0].scalable_dimension
210 service_namespace = aws_appautoscaling_target.sagemaker[0].service_namespace
211
212 step_scaling_policy_configuration {
213 adjustment_type = "ChangeInCapacity"
214 metric_aggregation_type = "Average"
215 cooldown = 300
216
217 step_adjustment {
218 metric_interval_lower_bound = 0
219 scaling_adjustment = 1
220 }
221 }
222}
223
224resource "aws_cloudwatch_metric_alarm" "async_has_backlog" {
225 count = var.enable_async_inference && var.enable_autoscaling ? 1 : 0
226
227 alarm_name = "${var.endpoint_name}-has-backlog-without-capacity"
228 namespace = "AWS/SageMaker"
229 metric_name = "HasBacklogWithoutCapacity"
230 statistic = "Average"
231 period = 60
232 evaluation_periods = 2
233 datapoints_to_alarm = 2
234 threshold = 1
235 comparison_operator = "GreaterThanOrEqualToThreshold"
236 treat_missing_data = "missing"
237
238 dimensions = {
239 EndpointName = aws_sagemaker_endpoint.deepgram.name
240 }
241
242 alarm_actions = [aws_appautoscaling_policy.async_scale_from_zero[0].arn]
243}

Outputs

Create outputs.tf to surface the endpoint details after terraform apply completes.

outputs.tf
1output "endpoint_name" {
2 description = "Name of the deployed SageMaker Endpoint."
3 value = aws_sagemaker_endpoint.deepgram.name
4}
5
6output "endpoint_arn" {
7 description = "ARN of the deployed SageMaker Endpoint."
8 value = aws_sagemaker_endpoint.deepgram.arn
9}
10
11output "model_name" {
12 description = "Name of the SageMaker Model."
13 value = aws_sagemaker_model.deepgram.name
14}
15
16output "execution_role_arn" {
17 description = "ARN of the IAM execution role."
18 value = aws_iam_role.sagemaker_execution.arn
19}
20
21output "async_s3_output_path" {
22 description = "S3 location where async transcription results are written (async mode only)."
23 value = var.enable_async_inference ? var.async_s3_output_path : null
24}

Example variable values

Create a terraform.tfvars file with your specific values. Replace the model_package_arn with the ARN from your AWS Marketplace subscription.

terraform.tfvars
1aws_region = "us-east-1"
2model_package_arn = "arn:aws:sagemaker:us-east-1:123456789012:model-package/deepgram-stt-nova-3/1"
3model_name = "deepgram-streaming-stt"
4endpoint_name = "my-deepgram-stt"
5instance_type = "ml.g5.2xlarge"
6
7# Optional: Deepgram configuration overrides
8deepgram_engine_env = {
9 "01" = "max_active_requests=120"
10}
11deepgram_api_env = {
12 "01" = "features.listen_v2=true"
13}
14
15# Optional: Enable auto-scaling
16enable_autoscaling = true
17autoscaling_min_capacity = 1
18autoscaling_max_capacity = 4
19autoscaling_target_value = 5.0
20
21# Optional: deploy an asynchronous endpoint instead of real-time
22# enable_async_inference = true
23# async_s3_output_path = "s3://my-deepgram-async/output/"
24# async_s3_failure_path = "s3://my-deepgram-async/failures/"
25# enable_autoscaling = true
26# autoscaling_min_capacity = 0 # async supports scale-to-zero
27# autoscaling_target_value = 5.0 # target ApproximateBacklogSizePerInstance

Do not commit terraform.tfvars to version control if it contains sensitive values. Add it to .gitignore or use environment variables instead.

Deploy

1

Initialize the Terraform working directory

$terraform init
2

Preview the resources Terraform will create

$terraform plan

Verify the plan shows the expected resources: an IAM role, a SageMaker Model, an Endpoint Configuration, and an Endpoint.

3

Apply the configuration

$terraform apply

Terraform creates the resources and waits for the SageMaker Endpoint to reach InService status. This typically takes several minutes.

4

Verify the endpoint

Confirm the endpoint is running:

$aws sagemaker describe-endpoint \
> --endpoint-name $(terraform output -raw endpoint_name) \
> --region $(terraform output -raw aws_region 2>/dev/null || echo "us-east-1") \
> --query "EndpointStatus"

The output should be "InService".

Validate the endpoint

After the endpoint reaches InService, run a test inference to confirm it returns results. See Validate a Deepgram SageMaker Endpoint for the full testing guide using the dg-sagemaker test clients.

Customize the deployment

Instance types

Choose an instance type based on the Deepgram product you are deploying. GPU-accelerated instances are required.

ProductRecommended instance typeNotes
Speech-to-Text (Nova-3, Flux)ml.g5.2xlargeSingle NVIDIA A10G GPU, 32 GB GPU RAM
Text-to-Speech (Aura)ml.g5.12xlarge4 NVIDIA A10G GPUs (TTS requires 2+ GPUs)

For a full list of compatible instances, see the Deployment Environments hardware specifications.

Environment variable overrides

Pass Deepgram configuration overrides through the deepgram_engine_env and deepgram_api_env variables. Each map key becomes the suffix (for example, "01", "02"), and the value is the TOML expression. See Configure Amazon SageMaker Deployments for the full reference.

1deepgram_engine_env = {
2 "01" = "flux.max_streams=25"
3 "02" = "chunking.streaming.step=0.5"
4}

VPC configuration

To deploy the endpoint inside a VPC, add a vpc_config block to the aws_sagemaker_model resource:

VPC configuration
1resource "aws_sagemaker_model" "deepgram" {
2 name = var.model_name
3 execution_role_arn = aws_iam_role.sagemaker_execution.arn
4 enable_network_isolation = true
5
6 primary_container {
7 model_package_name = var.model_package_arn
8 environment = local.deepgram_env
9 }
10
11 vpc_config {
12 subnets = ["subnet-0123456789abcdef0", "subnet-0123456789abcdef1"]
13 security_group_ids = ["sg-0123456789abcdef0"]
14 }
15}

Asynchronous endpoints

By default this configuration deploys a real-time endpoint for streaming and synchronous transcription. To instead deploy an asynchronous endpoint — for large pre-recorded files (up to 1 GB), queued processing, and scale-to-zero — set enable_async_inference = true and provide an async_s3_output_path.

Asynchronous inference is a distinct endpoint mode: an async endpoint accepts only asynchronous invocations (InvokeEndpointAsync with S3 input/output) and cannot serve streaming or synchronous requests. Switching enable_async_inference replaces the endpoint configuration and endpoint.

When async is enabled, the configuration also:

  • grants the execution role s3:GetObject, s3:PutObject, and s3:ListBucket on the output and failure buckets;
  • switches the autoscaling target metric to ApproximateBacklogSizePerInstance and allows autoscaling_min_capacity = 0 for scale-to-zero;
  • adds a scale-from-zero policy so the endpoint wakes on the first queued request instead of waiting for the backlog to exceed the target value.

For invocation details, see Deploy Deepgram on Amazon SageMaker. For autoscaling details, see Auto-Scaling Asynchronous Endpoints.

Tear down

To delete all resources created by this configuration:

$terraform destroy

This removes the SageMaker Endpoint, Endpoint Configuration, Model, auto-scaling resources (if enabled), and the IAM execution role. You are no longer billed for SageMaker compute after the endpoint is deleted. Your AWS Marketplace subscription remains active.