Metrics Guide
Monitoring system metrics is an important part of maintaining a healthy Deepgram self-hosted deployment. Metrics can also aid in decision making around scaling and performance concerns. To this end, Deepgram services publish a variety of metrics on exposed endpoints that you can query to determine system health.
Each section also contains details on implementing image-specific liveness and readiness probes. These should be implemented when using a container orchestration tool that supports health probes. For example, the deepgram-self-hosted
Helm chart includes these checks by default.
Deepgram API
For self-hosted deployments, the Deepgram API container images expose an endpoint /v1/status
, on port 8080 by default. Querying this endpoint will yield three pieces of information:
- If a successful response is received, the API is alive and listening to messages
- The response body gives a backward-looking indication of
system_health
- The response body indicates how many requests this API instance is processing
Liveness Probe
Use a TCP check on the open port (port 8080 by default).
Readiness Probe
Query the status of the /v1/status/engine
endpoint and check whether it is in a "Connected"
state with at least one Engine container.
curl --silent http://localhost:PORT/v1/status/engine | grep --quiet -e '^{\"engine_connection_status\"\:\"Connected\".*}$'
Make sure to replace the
PORT
placeholder with the port your container is listening on (port 8080 by default).
Model Metadata
Make a GET request to list all models the Engine has loaded, along with their metadata. For broader information, see the Model Metadata endpoint documentation. When pointed at a self-hosted deployment, this endpoint is helpful to confirm that models present in your models directory are being loaded by the container as expected.
curl --location 'http://localhost:8080/v1/models'
Deepgram Engine
The Deepgram Engine container image publishes an extensive set of system metrics. These metrics are configured on a separate endpoint and port than the main service.
Docker/Podman
Choose a host port HOST_PORT
where external queries can be made, and choose a container port CONTAINER_PORT
where Engine can internally publish its metrics. These can be the same port number, since they are binding to different networks (the host network versus the container network)
Port Collision
"Port collision" can occur when you try to bind to the same port from two different services. Since we are binding to both a container port and a host port, we have to be aware of this on two different networks.
When selecting a host port, do not use the same port that is used by any other Deepgram service, or any other service running on the host machine. In the default Deepgram
docker-compose.yml
file, the API often uses port8080
, and the License Proxy often uses ports8443
and8089
. A common default value for the EngineHOST_PORT
is9991
.When selecting a container port, do not select port
8080
, as this is used on the container network to communicate between the Engine and the API. A common default value for the EngineCONTAINER_PORT
is9991
.
Within your docker-compose.yml
file you must publish the internal container port to the external host port, as shown below. See Published Ports in the official Docker documentation for more details.
services:
engine:
# other definitions
ports:
- "HOST_PORT:CONTAINER_PORT"
To modify the Engine configuration, edit your engine.toml
file to specify the container port to publish metrics to:
# To support metrics we need to expose an Engine endpoint
[metrics_server]
host = "0.0.0.0"
port = CONTAINER_PORT
Make sure to replace the placeholders
HOST_PORT
andCONTAINER_PORT
in both of the above snippets.
Metrics may now be queried from the self-hosted instance on the local host at :HOST_PORT/metrics
.
Kubernetes
The Engine metrics endpoint is exposed by default by the deepgram-self-hosted
Helm chart via a NodePort Service. This service's name is defined as the .engine.namePrefix
configuration values appended by -metrics
. By default, this service will be named deepgram-engine-metrics
. See the engine.metricsServer.*
configuration values for more options.
When scaling.auto.enabled
is set to true
, the Engine metrics endpoint will be automatically scraped by Prometheus and used for auto-scaling.
Available Metrics
Upon startup of the containers, a limited set of metrics will be available until the first request is made. After the first request is made a complete set of metrics will be available.
Initial Metrics
engine_estimated_stream_capacity
value will increase as you open more streams until you reach the GPU capacity. This means it will start off low and increase as more streams are opened. When engine_estimated_stream_capacity
stops increasing this is when you have reached the GPU Stream capacity
# HELP engine_estimated_stream_capacity The number of streams the node believes it can serve with acceptable latency.
# TYPE engine_estimated_stream_capacity gauge
engine_estimated_stream_capacity <integer>
# HELP engine_active_requests Number of active ASR requests
# TYPE engine_active_requests gauge
engine_active_requests{kind="batch"} <integer>
engine_active_requests{kind="stream"} <integer>
engine_active_requests{kind="tts"} <integer>
Complete Metrics
# HELP engine_active_requests Number of active ASR requests
# TYPE engine_active_requests gauge
engine_active_requests{kind="batch"} <integer>
engine_active_requests{kind="stream"} <integer>
engine_active_requests{kind="tts"} <integer>
# HELP engine_batch_response_time_seconds Time to process a batch request.
# TYPE engine_batch_response_time_seconds histogram
engine_batch_response_time_seconds_bucket{le="1"} <integer>
engine_batch_response_time_seconds_bucket{le="2.5"} <integer>
engine_batch_response_time_seconds_bucket{le="5"} <integer>
engine_batch_response_time_seconds_bucket{le="10"} <integer>
engine_batch_response_time_seconds_bucket{le="30"} <integer>
engine_batch_response_time_seconds_bucket{le="60"} <integer>
engine_batch_response_time_seconds_bucket{le="+Inf"} <integer>
engine_batch_response_time_seconds_sum <float>
engine_batch_response_time_seconds_count <integer>
# HELP engine_estimated_stream_capacity The number of streams the node believes it can serve with acceptable latency.
# TYPE engine_estimated_stream_capacity gauge
engine_estimated_stream_capacity <integer>
# HELP engine_requests_total Number of ASR requests.
# TYPE engine_requests_total counter
engine_requests_total{kind="batch",response_status="1xx"} <integer>
engine_requests_total{kind="batch",response_status="2xx"} <integer>
engine_requests_total{kind="batch",response_status="3xx"} <integer>
engine_requests_total{kind="batch",response_status="4xx"} <integer>
engine_requests_total{kind="batch",response_status="5xx"} <integer>
engine_requests_total{kind="stream",response_status="1xx"} <integer>
engine_requests_total{kind="stream",response_status="2xx"} <integer>
engine_requests_total{kind="stream",response_status="3xx"} <integer>
engine_requests_total{kind="stream",response_status="4xx"} <integer>
engine_requests_total{kind="stream",response_status="5xx"} <integer>
engine_requests_total{kind="tts",response_status="1xx"} <integer>
engine_requests_total{kind="tts",response_status="2xx"} <integer>
engine_requests_total{kind="tts",response_status="3xx"} <integer>
engine_requests_total{kind="tts",response_status="4xx"} <integer>
engine_requests_total{kind="tts",response_status="5xx"} <integer>
# HELP engine_stream_connection_latency_seconds Latency establishing a streaming connection.
# TYPE engine_stream_connection_latency_seconds histogram
engine_stream_latency_sum <float>
engine_stream_latency_count <integer>
# HELP engine_stream_latency Latency processing streaming requests, in seconds.
# TYPE engine_stream_latency histogram
engine_stream_latency_bucket{le="0.1"} <integer>
engine_stream_latency_bucket{le="0.2"} <integer>
engine_stream_latency_bucket{le="0.3"} <integer>
engine_stream_latency_bucket{le="0.4"} <integer>
engine_stream_latency_bucket{le="0.5"} <integer>
engine_stream_latency_bucket{le="0.6"} <integer>
engine_stream_latency_bucket{le="0.7"} <integer>
engine_stream_latency_bucket{le="0.8"} <integer>
engine_stream_latency_bucket{le="0.9"} <integer>
engine_stream_latency_bucket{le="1.0"} <integer>
engine_stream_latency_bucket{le="1.1"} <integer>
engine_stream_latency_bucket{le="1.2"} <integer>
engine_stream_latency_bucket{le="1.3"} <integer>
engine_stream_latency_bucket{le="1.4"} <integer>
engine_stream_latency_bucket{le="1.5"} <integer>
engine_stream_latency_bucket{le="1.6"} <integer>
engine_stream_latency_bucket{le="1.7"} <integer>
engine_stream_latency_bucket{le="1.8"} <integer>
engine_stream_latency_bucket{le="1.9"} <integer>
engine_stream_latency_bucket{le="2.0"} <integer>
engine_stream_latency_bucket{le="5.0"} <integer>
engine_stream_latency_bucket{le="10.0"} <integer>
engine_stream_latency_bucket{le="+Inf"} <integer>
Streaming latency metrics
Together, the engine_stream_latency_bucket
metrics may be used to plot a histogram of streaming latency in Grafana, using the PromQL queries below. This set of metrics tracks the latency of each new stream. With this data, distribution and trends in streaming latency may be monitored for each deployment.
histogram_quantile(0.10, sum by (le) (rate(engine_stream_latency_bucket[5m]))) or vector(0)
histogram_quantile(0.25, sum by (le) (rate(engine_stream_latency_bucket[5m]))) or vector(0)
histogram_quantile(0.50, sum by (le) (rate(engine_stream_latency_bucket[5m]))) or vector(0)
histogram_quantile(0.75, sum by (le) (rate(engine_stream_latency_bucket[5m]))) or vector(0)
histogram_quantile(0.90, sum by (le) (rate(engine_stream_latency_bucket[5m]))) or vector(0)
histogram_quantile(0.95, sum by (le) (rate(engine_stream_latency_bucket[5m]))) or vector(0)
histogram_quantile(0.99, sum by (le) (rate(engine_stream_latency_bucket[5m]))) or vector(0)
Liveness Probe
Use a TCP check on the open port (port 8080 by default).
Readiness Probe
Use a TCP check on the open port (port 8080 by default).
Deepgram License Proxy
For self-hosted deployments, the Deepgram License Proxy container images expose an endpoint /v1/status
, on port 8080 by default. Querying this endpoint will indicate if the license proxy is able to communicate with the Deepgram license server.
Liveness Probe
Use a TCP check on the open status port (port 8080 by default).
Readiness Probe
Query the status of the /v1/status
endpoint and check the connection state.
curl --silent http://localhost:8080/v1/status | grep --quiet -e '^{.*\"state\"\:\"\(Connected\|TrustBased\)\".*}$'
Make sure to replace the
PORT
placeholder with the port your container is listening on (port 8080 by default).
Summary
To access metrics for the API, Engine, and License Proxy containers, run the following CURL request from the same machine the containers are running on. The ports in the commands below are the default port numbers; check your configuration files to see if the port mapping was changed.
- API:
curl "http://localhost:8080/v1/status"
andcurl "http://localhost:8080/v1/status/engine"
- Engine:
curl "http://localhost:9991/metrics"
- License Proxy:
curl "http://localhost:8080/v1/status"
Updated 17 days ago
You may want to setup tooling for ingesting and monitoring system metrics, for example, with our Prometheus guide.