Auto-scaling

[This guide and the following recommendations are written in the context of a Kubernetes deployment. However, the strategies presented here are transportable to other container control planes. Notably, some specific implementation details may need to be adjusted for your environment, tooling, or use case. Thus it may be helpful to understand the following Kubernetes specific terms:

  • pod - a single container for the sake of this guide. Ignoring sidecars and daemonsets, this is the smallest object of control in the Kubernetes environment
  • node - a worker running pods, analogous to a host instance running containers. This may be physical or virtual
  • cluster - a group of worker nodes (hosts)
  • metrics- Application performance data provided in Prometheus exposition format (OpenMetrics) used to monitor and make decisions about scaling
  • metrics server- Service or tooling used to consume metrics, such as Prometheus, Cloudwatch, etc

Auto-scaling is a complex endeavor. It is highly dependent on the application deployed, the use case of that application, and business needs and constraints. As such, this guide does not attempt to address all scenarios. A basic understanding of containerized infrastructure is assumed, and specific implementation details are out of scope save for some selected example configurations. References to other materials may also be provided where appropriate.

Scaling Metrics

In order to make auto-scaling decisions you will require some insight into the applications running in your environment. Basic metrics are available for the Deepgram API and Deepgram License Proxy. Additionally, each Deepgram Engine instance publishes a more robust set of system metrics in the Prometheus exposition format (now OpenMetrics) on the /metrics endpoint. These Engine metrics should be utilized to monitor and scale your on-prem deployment.

The list of available Deepgram metrics can be found under the Available Metrics section of the Deepgram Metrics guide.

Please note some metrics will not be exposed until a request has been submitted of a given type. For example, the engine_requests_total counters for ASR batch and streaming will not appear until you make a ASR batch or streaming request respectively.

Consuming Metrics

Out-of-the-box Kubernetes does not provide any metrics for use with auto-scaling. It is incumbent upon the operator or engineer deploying Kubernetes to also deploy something to capture and serve metrics pertinent to their needs. Kubernetes provides an optional metrics server that is compatible with Prometheus-style metrics which may be deployed for this purpose. If deployed it will automatically monitor your clusters for standard metrics such as CPU, memory, I/O, etc. This can be deployed in your environment directly with kubectl:

$ kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Kubernetes can also be configured with a Prometheus adapter to make custom metrics or external metrics from an external collector available through the Kubernetes API. This is useful for utilizing the Deepgram Engine metrics in your auto-scaling decision-making. For cloud deployed environments, your service provider may include their own metrics server, e.g. AWS Cloudwatch, Google Stackdriver, etc. Your organization may also already employ its own metrics infrastructure such as Datadog, or self-managed Prometheus. You will need to ensure the Deepgram metrics are monitored by your specific metrics collectors and also made available to your auto-scaling control plane via a Prometheus adapter.

Using Metrics

The Deepgram metrics exposed provide insight into the Engine's performance and are best used in conjunction with proper benchmarking of your specific use case, detailed further on in this document. Because the Engine instances run on GPUs and are designed to optimize their use of GPU resources, traditional metrics such as CPU and Memory are not primary indicators of performance and GPU usage will generally remain consistent.

Deepgram is highly efficient at processing pre-recorded batch audio and can generate transcripts for an hour of audio in under 30 seconds. For extensive backlogs of audio you may still find the need to scale beyond a single instance of Engine to expedite the process, or to meet increasing customer needs. The primary metric of concern for scale considerations is the active requests: engine_active_requests{kind="batch"} Because Engine will attempt to queue all added requests it is critical to establish a baseline for acceptable performance (the time to return transcripts) on your hardware via benchmarking. Once the baseline has been established the active requests metric can be utilized to scale before additional requests impact transcription times in your batch environment.

Streaming audio is time sensitive for the majority of use cases, and establishing acceptable response latencies is critical to determining appropriate scaling points for your application. The specific combination of model and feature set used will determine how many streams your instances can support with acceptable latencies for your use. The engine_estimated_stream_capacity metric can help provide a rough guide, however, this metric provides real-time output and will require monitoring under load to make the best use. Benchmarking will provide a more robust determination of performance. Once you've established the appropriate benchmarks, you can monitor capacity using the engine_active_requests{kind="stream"} metric for streaming requests.

Summary

For batch audio monitor:

  • engine_active_requests{kind="batch"}

For streaming audio monitor:

  • engine_active_requests{kind="stream"}
  • engine_estimated_stream_capacity

Additionally, success & error counts are available under the status buckets:

engine_requests_total{kind="batch",response_status="1xx"}
engine_requests_total{kind="batch",response_status="2xx"}
engine_requests_total{kind="batch",response_status="3xx"}
engine_requests_total{kind="batch",response_status="4xx"}
engine_requests_total{kind="batch",response_status="5xx"}
engine_requests_total{kind="stream",response_status="1xx"}
engine_requests_total{kind="stream",response_status="2xx"}
engine_requests_total{kind="stream",response_status="3xx"}
engine_requests_total{kind="stream",response_status="4xx"}
engine_requests_total{kind="stream",response_status="5xx"}

Deployment Considerations

Recall that Deepgram on-premises consists of a number of separate components, two of which we are concerned with scaling. These include onprem-engine instances (Engine), which handle inference tasks, and onprem-api instances (API), which handle request brokering to the instances of Engine and some post-processing coordination. Each API instance can generally support up to four Engine instances, however, some customers choose to deploy API instances 1:1 with Engine as it is comparatively lightweight. Ideally, all Engine instances should be deployed to NVIDIA GPU-enabled nodes, and GPUs cannot be shared amongst or between Engine instances. The remaining components, excluding API & Engine nodes, can generally be deployed to a single node and replicated as desired.

Single NVIDIA GPU vs Multi-NVIDIA GPU Instances

While the Deepgram Engine can make use of multiple NVIDIA GPUs, in most cases single NVIDIA GPU instances will be the more efficient deployment strategy. This is especially true where inference workloads are very homogeneous. If your workloads are highly variable, you may in some cases find performance increases by utilizing multi-NVIDIA GPU instances.

In either scenario Engine instances cannot share NVIDIA GPU resources. Each Engine instance you run will need to be deployed to a separate NVIDIA GPU-enabled node.

🚧

While the Deepgram Engine can take advantage of multiple GPUs and works with the latest high-powered NVIDIA GPUs, an enterprise-scale deployment should be prepared for horizontal scaling, which is the act of adding node (server) replicas to a deployment without adding resources to those nodes.

CPU-only Instances

Deepgram Engine has been implemented and is optimized to take advantage of NVIDIA GPUs, but it does support handling inference tasks in CPU-only environments. CPU-only instances should only be provisioned in use-cases and scenarios where NVIDIA GPU-accelerated instances are unavailable.

Container Image Cache

Each time a pod is deployed a container image must be pulled from either a local Container Image Cache or a Container Repository to instantiate the workload. When replicating pods on a single node the container runtime on the node will generally handle this for the operator transparently by referencing the local Container Image Cache. When deploying new nodes, on the other hand, there are no local copies of the container images in the Container Image Cache and so the container images must be pulled from a remote Container Registry.

It is important to consider where the container images are pulled from when scaling out your deployment to new nodes, as this will impact the time required to scale. Some environments such as Amazon's ECS & EKS, Azure's AKS, and Google's GKE provide caching mechanisms you can leverage or that are automatic for those environments as well as local Container Registries. Others, such as Amazon's Fargate for instance, have no native caching available.

To reduce the node deployment time we recommend you cache images in a local Container Registry that will be used in a horizontal scaling scenario. This can be done by providing a cache service in your data center if you are running on your own hardware. If you are using a cloud provider such as AWS, Azure, or Google Cloud, you can deploy an in-region cache in your cloud environment.

Scaling Considerations

When scaling Deepgram horizontally, we generally recommend:

  • Multiple servers hosted on VPC, DC, or bare metal
  • A customer-provided proxy (for example, NGINX, HAProxy, Apache) to handle TLS and request distribution across API instances
  • A single GPU per machine

If you want to increase your capacity to handle inference requests, you can add more GPUs to your deployment by spinning up additional Deepgram Engine nodes. Factors that can help you decide when to increase capacity vary depending on whether you are performing transcription on pre-recorded audio or live streaming audio, or performing other language AI requests.

Pre-recorded Audio

To determine when to increase your capacity for pre-recorded audio requests, you should first determine the acceptable amount of request latency for your business requirements and customer SLAs. When considering your own requirements, remember that you may be processing files of varying sizes and that, generally, you won't require an immediate response.

Live Streaming Audio

To determine when to increase your capacity for live streaming audio requests, you will want to make sure that all simultaneous audio streams are able to maintain an average response time of 400ms or less. The more streams you open up for transcription, the higher the request latency will climb.

Performance can also be negatively impacted by a variety of other factors, including the features you are using and the size of the chunks of audio you are sending. In this case, we recommend working with a Deepgram Applied Engineer to analyze your use case, expected throughput, and risk tolerance.

Other Language AI

If you are using other Deepgram services beyond ASR, please contact Support for recommendations on scaling considerations.

Scaling Up/Out

When resources become constrained, it's time to scale. There are two basic strategies for scaling: vertical scaling, sometimes referred to as "scale-up", where resources such as CPU, memory, or storage are added to an existing system; and horizontal scaling, also called "scale-out", where new instances of workers are added. Not all applications support all scaling strategies.

Scaling is highly dependent upon the software application you're deploying, and the type of work that application is performing. Even within an application, differing use cases may create differing loads and necessitate differing resource requirements. For instance, if you're primarily concerned with processing pre-recorded audio in large batches you may favor a highly cost-effective approach with high density instances where time to complete a transcription is less critical. On the other hand, if you're processing real-time streams of audio, latency is likely very important to you, and you may trade density for high performance by favoring fewer streams per deployed instance.

Benchmarking

Many factors impact the performance of your Engine instances. These include the inference model used, the quality and format of the input (e.g. audio), and the specific set of natural language understanding features employed. To best understand your use case, we strongly recommend you benchmark your specific settings to establish the best possible baseline for your needs. Additionally, this will allow you to establish the thresholds for scaling that meet your (or your client's) specific application needs well in advance of production deployment.

Benchmarking should be performed with representative input data of the same quality, duration, format, and features you intend to deploy in production. The goal is to determine your threshold for the performance of a single instance, or in other words, the inflection point at which more resources or instances are required to maintain the performance levels desired. This establishes your scaling threshold. In addition to the scaling threshold, you will need to understand the time required to provision new resources to cope with the additional load as you scale. This lead time will factor into your determination of when to provision resources.

When considering your scaling strategy and establishing your thresholds it is important to understand that while API instances can generally leverage scale-up (vertical) or scale-out (horizontal) strategies to a certain extent, Engine instances are GPU bound and cannot be vertically scaled. This means you must add new Engine instances when your existing instances become load-saturated.

API Scaling

The API instances are the ingress handlers to your Deepgram AI platform. Not only do they provide programmatic interfaces for your applications, but they also broker retries, coordinate complex transcription requests, and perform some post-processing tasks. All of this workload is traditional CPU-based processing that can benefit from some vertical scaling.

Additionally, a single API instance can manage multiple Engine instances. The recommended maximum ratio is at most 4:1 for Engine to API instances. API pods can run alongside Engine pods on the same node if required, provided the node is sufficiently sized.

Engine Scaling

The Engine is the workhorse of Deepgram and leverages all the power of modern advanced GPU programming. Each Engine instance must have at least one dedicated GPU to run on and multiple GPU instances are supported. Engine instances may not share GPUs, nor are GPUs vertically scalable. Therefore, to add Engine capacity you must horizontally scale Engine pods, with one pod per GPU-enabled node (host).

Fine-Tuning

The Deepgram system is designed to accept as much work as possible by default, irrespective of the performance implications of your specific infrastructure. Once the scaling threshold has been established for your particular use case and performance needs, you may wish to prevent existing instances from accepting any additional load. This can be accomplished by setting a maximum request limit in your Engine configurations (engine.toml) such that requests beyond the threshold of acceptable performance will return an error instead of slowing down other requests. The max_active_requests limiter provides direct control over this limit.

Setting an Engine requests limit signals to the API instances to retry the request on another Engine instance. If no Engine instance is available to meet the request, after exhausting all retry attempts, the API will ultimately return a 503 error back to the calling application giving confirmation of the failure. API retries can be configured in the api.toml configuration file using the retries_per_ip parameter to fine-tune as required.

Scaling Down

Reclaiming unused or underutilized resources is highly appealing for a variety of reasons including cost savings, resource management, and security footprint concerns to name a few. Scaling down an application can be difficult for as many reasons as it is beneficial. For Deepgram users, scaling down is generally the most difficult for streaming use cases, and this section will primarily target those.

Determining when to scale down requires as much attention, if not more, than scaling up or out. It also presents a number of unique challenges as scaling down is almost always disruptive. Ensuring active work is not lost, users are not disconnected, and preventing yo-yo-ing (scaling back up shortly after scaling down) are important considerations. Below we present two possible scale-down pathways.

Schedule Redeployment

The simplest method of reclaiming resources employed in infrastructure management today is the redeployment strategy. CI/CD pipelines, infrastructure-as-code management, and cloud platforms have enabled programmatic access to manage nearly all infrastructure and applications. In the redeployment scenario part or all of the infrastructure of an application is periodically refreshed. This may be done during a regularly scheduled maintenance window where patches and upgrades are also deployed, or it may be done on a scheduled basis such as weekly or nightly, during slow periods. Regardless of when the procedure occurs, the result is to bring the environment back down to a baseline level on a regular basis to reclaim resources. This strategy is almost always utilized in conjunction with auto-scaling up, allowing the environment to grow as required between resets.

If the full environment cannot be taken offline as a whole, care must be taken to first divert traffic from the segment or region being reset to other segments or regions temporarily. Once reset, traffic is then restored and another segment or region is reset until all environments have been redeployed. In some cases, a temporary environment may be utilized to handle part or all of the normal traffic while the reset is taking place, and then decommissioned upon completion of the routine.

Full Auto-Scaling

A fully automated environment that scales both up and down on demand is the most challenging to achieve. Preventing resource yo-yo-ing requires some level of trade-off to keep resources around long enough to ensure they are no longer needed, and must be balanced with the level of aggressiveness used to scale up or out. These strategies will depend entirely upon your business needs and may require multiple deployments to handle different business outcomes from the same application.

Streams connected to Deepgram may be held open by an application even when not doing useful work, unlike batch mode requests which are cleared upon transcript completion. Because of this, it can be difficult to determine when an Engine instance is in fact no longer processing useful information. Therefore we recommend a staged approach to decommissioning Engine instances over a period of time to permit active streams to finish and minimize disruption.

First, a period of observation or "lookback" is determined, across which the usage is averaged to determine the scale-down threshold. This time period may be as short as 10 minutes, though typically it would be over the course of several hours or a full workday if taking a more conservative approach.

Example Prometheus utilization formula, with 1-hour lookback and a resolution of 1 minute:

max_over_time(
	(
	sum(engine_active_requests{kind="stream"}) /
	sum(engine_max_active_requests) * 100
	)[1hr:1m]
)

Second, a grace period is established based on the use case to ensure active work has sufficient time to complete prior to removing pods and nodes. This cooldown period should be long enough to account for the highest reasonable outlier within your normal distribution of audio durations. For example, if your average audio is 5 minutes, and you have outliers at 20, 35, and 50 minutes on a normal day, you'd likely set your cooldown to 60 minutes.

In Kubernetes the grace period is configured in the pod spec or template using the terminationGracePeriodSeconds directive.

Example Pod Configuration:

# truncated excerpt of deployment yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: engine

[...]

  template:
    metadata:
      labels:
        run: engine
    spec:
      hostNetwork: true
      imagePullSecrets:
      - name: regcred
      containers:
      - name: engine
        image: deepgram/onprem-engine:3.45.3
        imagePullPolicy: IfNotPresent
        command: [ "impeller" ]
        volumeMounts:
        - name: engine-volume
          mountPath: /etc/config
        - name: models-volume
          mountPath: /models
        ports:
        - containerPort: 8081
          name: engine-port
        args: ["-v", "serve", "/etc/config/engine.toml"]
      # cooldown/grace period
      terminationGracePeriodSeconds: 3600

[...]

Lastly, care should be taken to ensure the replicas do not spin back up immediately after scaling down, sometimes called yo-yo-ing or 'flapping'. Kubernetes scaling policies directly support an 'anti-flapping' mechanism to prevent immediate reallocation of replicas via the stabilizationWindowSeconds directive (see Horizontal Pod Autoscaling). Hashicorp Nomad refers to this as a scaling policy cooldown, and your control plane may use other wording.

Example Kubernetes stabilization window or scaling cooldown:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300

Summary

Scaling your Deepgram deployment to handle your production traffic in an efficient and performant manner is a long-term challenge that is highly dependent on your use case and constraints. Deepgram publishes a variety of metrics to aid you in determining when Deepgram services are under heavy load, which can aid in scaling decisions.

If you have further questions about how to effectively benchmark your on-prem deployment, or scale effectively, please contact Support or your Deepgram Account Executive.


What’s Next

For specific details about the metrics available and integrating Deepgram with a solution like Prometheus, refer to these additional materials.