Auto-Scaling
Auto-scaling is a complex endeavor. It is highly dependent on the application deployed, the use case of that application, and business needs and constraints. As such, this guide does not attempt to address all scenarios. A basic understanding of containerized infrastructure is assumed, and specific implementation details are out of scope save for some selected example configurations. References to other materials may also be provided where appropriate.
Please reach out to your Deepgram Account Representative as needed for recommendations and support in scaling your infrastructure associated with your Deepgram self-hosted deployment.
Context
This guide and the following recommendations are written in the context of a Kubernetes deployment. The deepgram-self-hosted
Helm chart implements many of these recommendations for you. See the Autoscaling section of the chart README for more details.
With that said, the strategies presented here are transportable to other container orchestration tools. Notably, some specific implementation details may need to be adjusted for your environment, tooling, or use case. Thus it may be helpful to understand the following Kubernetes specific terms:
pod
- a single container for the sake of this guide. Ignoring sidecars and daemonsets, this is the smallest object of control in the Kubernetes environmentnode
- a worker running pods, analogous to a host instance running containers. This may be physical or virtualcluster
- a group of worker nodes (hosts)metrics
- Application performance data provided in Prometheus exposition format (OpenMetrics) used to monitor and make decisions about scalingmetrics server
- Service or tooling used to consume metrics, such as Prometheus, Cloudwatch, etc
Scaling Metrics
In order to make auto-scaling decisions you will require some insight into the applications running in your environment. The Metrics Guide describes available metrics and status endpoints for Deepgram containers.
In particular, the Engine metrics should be utilized to monitor and scale your self-hosted deployment. The list of available Engine metrics can be found under the Available Metrics section of the Metrics Guide.
Please note some metrics will not be exposed until a request has been submitted of a given type. For example, the engine_requests_total
counters for batch and streaming speech-to-text or text-to-speech will not appear until you make a request of the corresponding type.
Consuming Metrics
Out-of-the-box Kubernetes does not provide any metrics for use with auto-scaling. It is incumbent upon the operator or engineer deploying Kubernetes to also deploy something to capture and serve metrics pertinent to their needs. Kubernetes provides an optional metrics server that is compatible with Prometheus-style metrics which may be deployed for this purpose. If deployed it will automatically monitor your clusters for standard metrics such as CPU, memory, and I/O.
# Via `helm`
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm upgrade --install metrics-server metrics-server/metrics-server
# Via `kubectl`
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Kubernetes can also be configured with a Prometheus adapter to make custom metrics or external metrics from an external collector available through the Kubernetes API. This is useful for utilizing the Deepgram Engine metrics in your auto-scaling decision-making.
For cloud deployed environments, your service provider may include their own metrics server, such as AWS Cloudwatch or Google Stackdriver. Your organization may also already employ its own metrics infrastructure, such as Datadog, or self-managed Prometheus. You will need to ensure the Deepgram metrics are monitored by your specific metrics collectors and also made available to your auto-scaling control plane via a Prometheus adapter.
Benchmarks
Many sections of this guide rely on reliable performance benchmarks for various combinations of hardware, audio or text input, and features used. Deepgram has pre-computed many of the relevant benchmarks, and can assist you in additional benchmarking efforts for your specific use case. Please contact your Deepgram Account Representative for details beyond this guide.
Using Metrics
The Deepgram metrics exposed provide insight into the Engine's performance and are best used in conjunction with performance benchmarks for your specific use case, as detailed above. Because the Engine instances run on GPUs and are designed to optimize their use of GPU resources, traditional metrics such as CPU and Memory are not primary indicators of performance and GPU usage will generally remain consistent.
Batch Speech-to-Text
Deepgram is highly efficient at processing pre-recorded batch audio and can generate transcripts for an hour of audio in under 30 seconds. For extensive backlogs of audio you may still need to scale beyond a single instance of Engine to expedite the process, or to meet increasing customer needs. The primary metric of concern for scale considerations is the active requests: engine_active_requests{kind="batch"}
.
The Deepgram Engine will attempt to queue all incoming requests. Therefore, it is critical to establish a baseline for acceptable performance (the time to return transcripts) on your hardware via benchmarking. Remember that you may be processing files of varying sizes and whether or not you require an immediate response.
Once the baseline has been established the active requests metric can be utilized to scale before additional requests impact transcription times in your batch environment.
Streaming Speech-to-Text
Streaming audio is time sensitive for the majority of use cases, and establishing acceptable response latencies is critical to determining appropriate scaling points for your application. The specific combination of model and feature set used will determine how many streams your instances can support with acceptable latencies for your use.
The Measuring Streaming Latency guide can help you as you benchmark your system to establish the number of streams you can support at acceptable latencies. You will want to make sure that all simultaneous audio streams are able to maintain an average response time of 400ms or less. The more streams you open up for transcription, the higher the request latency will climb.
Performance can be negatively impacted by a variety of other factors, including the features you are using and the size of the chunks of audio you are sending. Consult your Deepgram Account Representative to analyze your use case, expected throughput, and risk tolerance.
Once you determine the number of streams you can serve at acceptable latency, you can use this number to monitor capacity using the engine_active_requests{kind="stream"}
metric for streaming requests.
Batch Text-to-Speech
The Self-Hosted Text to Speech guide details the unique latency and availability requirements specific to text-to-speech requests that are worth optimizing for. You can monitor the number of TTS requests with engine_active_requests{kind="tts"}
.
Summary
For batch speech-to-text, monitor:
engine_active_requests{kind="batch"}
For streaming speech-to-text, monitor:
engine_active_requests{kind="stream"}
engine_estimated_stream_capacity
optional
For batch text-to-speech, monitor:
engine_active_requests{kind="tts"}
Additionally, success & error counts are available under the status buckets:
engine_requests_total{kind="{batch,stream,tts}",response_status="{1xx,2xx,3xx,4xx,5xx}"}
Deployment Considerations
Recall that self-hosting Deepgram consists of a number of separate components, two of which we are concerned with scaling. These include Engine containers, which handle inference tasks, and API containers, which handle request brokering to the Engine containers, as well as some post-processing coordination.
Each API container can generally support up to 16 Engine containers, with a more typical ratio of 1:4. However, some customers choose to deploy API instances 1:1 with Engine, as the API container is comparatively lightweight.
GPU Instances
Ideally, all Engine instances should be deployed to NVIDIA GPU-enabled instances, and GPUs cannot be shared amongst or between Engine instances. The remaining components, such as API and License Proxy containers, can generally be deployed to CPU-only nodes and replicated as desired.
While the Deepgram Engine can make use of multiple NVIDIA GPUs, in most cases, deploying a single Engine container per NVIDIA GPU will be the more efficient deployment strategy. An enterprise-scale deployment should be prepared for horizontal scaling, which is the act of adding node (server) replicas to a deployment without adding resources to those nodes.
CPU-only Instances
Deepgram Engine has been implemented and is optimized to take advantage of NVIDIA GPUs, but it does support handling inference tasks in CPU-only environments. CPU-only instances should only be provisioned in use-cases and scenarios where NVIDIA GPU-accelerated instances are unavailable.
Container Image Cache
Each time a pod is deployed, a container image must be pulled from either a local Container Image Cache or a Container Repository to instantiate the workload. When replicating pods on a single node the container runtime on the node will generally handle this for the operator transparently by referencing the local Container Image Cache. When deploying new nodes, on the other hand, there are no local copies of the container images in the Container Image Cache and so the container images must be pulled from a remote Container Registry.
It is important to consider where the container images are pulled from when scaling out your deployment to new nodes, as this will impact the time required to scale. Some environments such as Amazon's ECS & EKS, Azure's AKS, and Google's GKE provide caching mechanisms you can leverage or that are automatic for those environments as well as local Container Registries. Others, such as Amazon's Fargate for instance, have no native caching available.
To reduce the node deployment time we recommend you cache images in a local Container Registry that will be used in a horizontal scaling scenario. This can be done by providing a cache service in your data center if you are running on your own hardware. If you are using a cloud provider such as AWS, Azure, or Google Cloud, you can deploy an in-region cache in your cloud environment.
Scaling Considerations
When scaling Deepgram horizontally, we generally recommend:
- Multiple servers hosted on VPC, DC, or bare metal
- A customer-provided proxy (for example, NGINX, HAProxy, Apache) to handle TLS and request distribution across API instances
- A single GPU per machine
If you want to increase your capacity to handle inference requests, you can add more GPU instances to your deployment and add additional Deepgram Engine containers to take advantage of the additional compute. Deciding when to add GPU instances to your environment depends on the type of requests you are receiving. See the using metrics section for details on benchmarking and various available metrics.
When to Scale
When resources become constrained, it's time to scale. There are two basic strategies for scaling.
- Vertical scaling, sometimes referred to as "scale-up"
- Resources such as CPU, memory, or storage are added to existing instances
- Horizontal scaling, also called "scale-out"
- New instances of workers are added
Deepgram generally recommends horizontal scaling for your self-hosted environment.
Enforcing Limits
The Deepgram system is designed to accept as much work as possible by default, irrespective of the performance implications of your specific infrastructure. Once the scaling threshold has been established for your particular use case and performance needs, you may wish to prevent existing instances from accepting additional load beyond the pre-calculated capacity.
This can be accomplished by setting a maximum request limit in your Engine configurations (engine.toml
) with the max_active_requests
value. In the deepgram-self-hosted
Helm chart, this value is controlled by the engine.concurrencyLimit.activeRequests
configuration value. Incoming requests to an Engine instance beyond this limit will return an error instead of slowing down other requests.
When an API instance receives a request and assigns inference work to a downstream Engine instance, that Engine instance may return an error if it is over capacity. The API instance will then attempt to retry the request with other Engine instances, if available. If no Engine instance is available to meet the request, after exhausting all retry attempts, the API instance will ultimately return a 503
HTTP response back to the calling application giving confirmation of the failure.
Scaling Down
Reclaiming unused or underutilized resources is highly appealing for a variety of reasons, including cost savings, resource management, and security footprint concerns. Scaling down can be difficult to do correctly, especially for streaming use cases. It's import to ensure active work is not lost, users are not disconnected, and prevent scaling back up shortly after scaling down ("yo-yo-ing"). Below we present two possible scale-down pathways.
Schedule Redeployment
The simplest method of reclaiming resources is the "redeployment" strategy, where part or all of the infrastructure of an application is periodically refreshed. This may be done during a regularly scheduled maintenance window where patches and upgrades are also deployed, or it may be done on a scheduled basis such as weekly or nightly, during slow periods.
Regardless of when the procedure occurs, the result is to bring the environment back down to a baseline level on a regular basis to reclaim resources. This strategy is almost always utilized in conjunction with auto-scaling up, allowing the environment to grow as required between resets.
If the full environment cannot be taken offline as a whole, care must be taken to first divert traffic from the segment being reset to other segments temporarily. Once reset, traffic is then restored and another segment is reset until all environments have been redeployed. This can be handled automatically in Kubernetes with rolling updates, such as running helm upgrade
with the deepgram-self-hosted
Helm chart.
Full Auto-Scaling
A fully automated environment that scales both up and down on-demand is the most challenging to achieve. Preventing resource yo-yo-ing requires some level of trade-off to keep resources around long enough to ensure they are no longer needed, and must be balanced with the level of aggressiveness used to scale out. These strategies will depend entirely upon your business needs and may require multiple deployments to handle different business outcomes from the same application.
Streams connected to Deepgram may be held open by an application even when not doing useful work, unlike batch mode requests which are cleared upon transcript completion. Because of this, it can be difficult to determine when an Engine instance is in fact no longer processing useful information. We recommend the following approach:
- Determine a grace period threshold within which most streaming requests in your environment should finish. This should be long enough to account for the highest reasonable outlier within your normal distribution of audio durations. For example, if your average audio is 5 minutes, and you have outliers at 20, 35, and 50 minutes on a normal day, you'd likely set your threshold to 60 minutes.
- Send a
SIGINT
signal to the API/Engine container that you are planning to scale down. This will instruct the container to stop listening on its active port, and therefore stop accepting new connections, but to continue processing existing requests. - Wait for the predetermined threshold for existing requests to complete.
- Beyond this threshold, there may be straggling streaming requests. You may choose to forcibly kill the container and any remaining requests at this point.
This behavior is automatically available in Kubernetes by setting the grace period. This period is configurable via global.outstandingRequestGracePeriod
in the deepgram-self-hosted
Helm chart. See the Helm chart documentation for details.
Lastly, care should be taken to ensure the replicas do not spin back up immediately after scaling down, sometimes called "yo-yo-ing" or "flapping". Kubernetes scaling policies directly support an anti-flapping mechanism to prevent immediate reallocation of replicas via the stabilizationWindowSeconds
directive. Hashicorp Nomad refers to this as a scaling policy cooldown, and your control plane may use other wording.
Scaling behavior, such as the stabilizationWindowSeconds
, can be configured in the deepgram-self-hosted
Helm chart with scaling.auto.{api,engine}.behavior
configuration values. See the Helm chart documentation for details.
Summary
Scaling your Deepgram deployment to handle your production traffic in an efficient and performant manner is a long-term challenge that is highly dependent on your use case and constraints. Deepgram publishes a variety of metrics to aid you in determining when Deepgram services are under heavy load, which can aid in scaling decisions.
If you have further questions about how to effectively benchmark your self-hosted deployment, or scale effectively, please contact your Deepgram Account Representative or Support.
Updated 7 months ago
For specific details about the metrics available and integrating Deepgram with a solution like Prometheus, refer to these additional materials.