Configure Modal Resources

Configure hardware, autoscaling, and region selection for a Deepgram Modal deployment.

# modal_deepgram/app.py
@app.cls(
    image=engine_base_image.env({"DEPLOY_LABEL": DEPLOY_LABEL}),
    volumes={
        MODELS_PATH: models_vol,
        CACHE_PATH: cache_vol,
    },
    gpu="L4",
    secrets=[modal.Secret.from_name("deepgram")],
    timeout=30 * MINUTES,
    cpu=4,
    memory=32 * 1024,  # MB
    min_containers=1,
    region="us-west",
)
@modal.concurrent(target_inputs=64)
@modal.experimental.http_server(port=API_PORT, proxy_regions=["us-west"])
class DeepgramServer(DeepgramServerBase):
    ...

Configure autoscaling

Modal automatically scales the number of Deepgram containers up and down based on per-container concurrency.

See their Scaling Out guide and Input Conccurrency guide for the different parameters and their functionality. Note that not all available parameters are surfaced in app.py.

Deepgram recommends keeping at least one container active to ensure that lulls in traffic don’t lead to queuing or 503s when scaling back up from zero. In Modal, set min_containers = 1.

Web endpoints served with the http_server only accept a value for target_inputs and not max_inputs. This number should be set slightly below the active request limit in your engine.toml file (see Auto-Scaling: Enforcing Limits).

Configure regions

To optimize network latency, you will likely want to set the PROXY_REGION AND SERVER_REGION and route traffic from clients in those regions to that deployment.

PROXY_REGION specifies the location of the Modal proxy that routes requests to containers. It can take one of four values: us-east, us-west, eu-west, ap-south.

SERVER_REGION specifies which region(s) the server containers can reside in. See the Modal Region Selection doc for more information.

# modal_deepgram/app.py
@app.cls(
    image=engine_base_image.env({"DEPLOY_LABEL": DEPLOY_LABEL}),
    volumes={
        MODELS_PATH: models_vol,
        CACHE_PATH: cache_vol,
    },
    gpu="L4",
    secrets=[modal.Secret.from_name("deepgram")],
    timeout=30 * MINUTES,
    cpu=4,
    memory=32 * 1024,  # MB
    min_containers=1,
    region="us-west",
)
@modal.concurrent(target_inputs=64)
@modal.experimental.http_server(port=API_PORT, proxy_regions=["us-west"])
class DeepgramServer(DeepgramServerBase):
    ...

Configure autoscaling

Modal automatically scales the number of Deepgram containers up and down based on per-container concurrency.

See their Scaling Out guide and Input Conccurrency guide for the different parameters and their functionality. Note that not all available parameters are surfaced in app.py.

Deepgram recommends keeping at least one container active to ensure that lulls in traffic don’t lead to queuing or 503s when scaling back up from zero. In Modal, set min_containers = 1.

Configure regions

To optimize network latency, you will likely want to set the PROXY_REGION AND SERVER_REGION and route traffic from clients in those regions to that deployment.

PROXY_REGION specifies the location of the Modal proxy that routes requests to containers. It can take one of four values: us-east, us-west, eu-west, ap-south.

SERVER_REGION specifies which region(s) the server containers can reside in. See the Modal Region Selection doc for more information.

1	# modal_deepgram/app.py
2
3	@app.cls(
4	image=engine_base_image.env({"DEPLOY_LABEL": DEPLOY_LABEL}),
5	volumes={
6	MODELS_PATH: models_vol,
7	CACHE_PATH: cache_vol,
8	},
9	gpu="L4",
10	secrets=[modal.Secret.from_name("deepgram")],
11	timeout=30 * MINUTES,
12	cpu=4,
13	memory=32 * 1024, # MB
14	min_containers=1,
15	region="us-west",
16	)
17	@modal.concurrent(target_inputs=64)
18	@modal.experimental.http_server(port=API_PORT, proxy_regions=["us-west"])
19	class DeepgramServer(DeepgramServerBase):
20	...