Compute and Autoscaling
With Modal, hardware resources and autoscaling configuration are specified as code. For the parameters in this section, update values by editing the values in app.py and redeploy.
When you clone the repo, the values are configured for an STT deployment in us-west.
Configure hardware
Hardware is set as literals at the top of modal_deepgram/app.py, sized for STT (Nova family). Edit them per workload, then redeploy.
For TTS (Aura-2), use two GPUs — one for the generative model, one for the vocoder:
For Deepgram’s hardware minimums, see Deployment Environments → Engine. For Modal’s GPU options, see Modal: GPU.
Configure autoscaling
Modal automatically scales the number of Deepgram containers up and down based on per-container concurrency.
See their Scaling Out guide and Input Concurrency guide for the different parameters and their functionality. Note that not all available parameters are surfaced in app.py.
Notes
Deepgram recommends keeping at least one container active to ensure that lulls in traffic don’t lead to queuing or 503s when scaling back up from zero. In Modal, set
min_containers = 1.
Web endpoints served with the
http_serveronly accept a value fortarget_inputsand notmax_inputs.
Region Selection
To optimize network latency, most Deepgram deployments will set the PROXY_REGION AND SERVER_REGION and route traffic from clients in those regions to that deployment.
PROXY_REGION specifies the location of the Modal proxy that routes requests to containers. It can take one of four values: us-east, us-west, eu-west, ap-south.
SERVER_REGION specifies which region(s) the server containers can reside in. See the Modal Region Selection doc for more information.