Why your LiteLLM Prometheus metrics flicker under multiple workers
The Prometheus multiprocess problem in a nutshell — what `prometheus_client` actually does across forked workers, why LiteLLM's metrics are unusable with --num_workers, and how to fix it cleanly.
This isn’t a high-reaction issue. #10595 sits at zero 👍 in the LiteLLM repo. But every team running LiteLLM with --num_workers > 1 and Prometheus scraping eventually hits it, and the symptoms are exactly the kind that erode trust in your observability stack: counters go down. Counts oscillate. Scrape-to-scrape, the same metric reports different values depending on which worker happens to answer the /metrics request.
The issue is correctly diagnosed by the reporter and even points at the fix. It hasn’t been merged because the fix is operationally awkward, not because the diagnosis is wrong. This post explains what’s actually going on, why the workaround is non-trivial, and how to deploy LiteLLM with reliable Prometheus metrics today.
What you’ll see
Set --num_workers 4, cause some failures, then scrape /metrics repeatedly. You’ll see:
litellm_proxy_failed_requests_metric_total 12
# wait 5 seconds
litellm_proxy_failed_requests_metric_total 0
# wait 5 seconds
litellm_proxy_failed_requests_metric_total 7
# wait 5 seconds
litellm_proxy_failed_requests_metric_total 12
This is not a bug in LiteLLM’s metric recording. The recording is correct — each worker is faithfully tracking the failures it has personally seen. The bug is that /metrics returns the state of one worker, chosen at random. Counters are per-worker, scrapes hit one worker at a time, and Prometheus interprets the up-and-down pattern as a counter reset followed by a slow recovery.
The reporter’s diagnosis confirms this experimentally:
Specify
--num_workers 4, cause some failures…litellm_proxy_failed_requests_metric_totalnow goes up and down at random.
Do not specify
--num_workers…litellm_proxy_failed_requests_metric_totaldoes not ever decrease. (The single-worker case works correctly.)
Why prometheus_client doesn’t share by default
The prometheus_client Python library stores metrics in module-level globals: a Counter or Gauge object holds its current value in a Python attribute on the instance, in process memory. When gunicorn forks workers from a master, each worker inherits the master’s metric objects via copy-on-write — but the moment any worker increments a counter, that page is copied, and from then on each worker has its own copy.
This is not a bug. It’s the only thing fork-and-CoW can do. In-process Python objects can’t be shared across processes without explicit IPC.
The library’s designers know this, and prometheus_client ships an opt-in multiprocess mode specifically for this case. In multiprocess mode:
- Each worker writes its metric values to memory-mapped files in a shared directory (configured via
PROMETHEUS_MULTIPROC_DIR). - The
/metricsendpoint, instead of returning the in-memory state of the responding worker, reads all the files in the directory, aggregates them, and returns the combined view. - The aggregation logic is metric-type-aware: counters sum across files; gauges can be
min,max,liveall,livesum, ormostrecent; histograms sum buckets.
This works. The catch is in the small print.
The catch nobody mentions until they hit it
The PROMETHEUS_MULTIPROC_DIR environment variable must be set to a directory that the client library can use for metrics. This directory must be wiped between process/Gunicorn runs (before startup is recommended).
The reporter quotes this directly, and it’s the entire reason this isn’t a one-line PR. Multiprocess mode works only if:
- The directory exists at startup.
- The directory is empty at startup (or contains only files from this exact run).
- Every worker has write access to it.
- On worker shutdown, the worker’s files are cleaned up —
prometheus_clientprovides amark_process_dead(pid)hook to do this, but you have to wire it into your shutdown handler. - The directory survives across worker restarts within a single proxy lifetime, but is wiped between proxy restarts.
Number 5 is the operational headache. If you wipe the directory on every gunicorn restart, you lose metric history across reloads — counters reset to zero whenever you deploy. If you don’t wipe it, stale files from previous runs accumulate and skew aggregation. The “right” answer depends on how you deploy and what your operators expect.
Hence the issue’s status: the reporter has a working diagnosis and a clear path to a fix, but didn’t open a PR because they didn’t know how the LiteLLM team wanted to resolve the directory-management question. That’s a maintainer decision, not a technical one.
What a real fix looks like
A clean fix has three parts.
1. Set up the multiprocess directory at proxy startup. LiteLLM’s proxy entrypoint should:
import os, shutil, tempfile
from prometheus_client import multiprocess
def setup_prometheus_multiproc():
# Honor explicit env var if set
if "PROMETHEUS_MULTIPROC_DIR" in os.environ:
multiproc_dir = os.environ["PROMETHEUS_MULTIPROC_DIR"]
else:
multiproc_dir = tempfile.mkdtemp(prefix="litellm_prom_")
os.environ["PROMETHEUS_MULTIPROC_DIR"] = multiproc_dir
# Wipe at startup
if os.path.isdir(multiproc_dir):
for f in os.listdir(multiproc_dir):
os.remove(os.path.join(multiproc_dir, f))
return multiproc_dir
2. Register a worker shutdown hook. When a worker exits, call multiprocess.mark_process_dead(os.getpid()) so its files are cleaned up. With gunicorn, this is the worker_exit server hook.
3. Replace the /metrics handler. Instead of using the default prometheus_client handler that returns the local registry, return an aggregated view:
from prometheus_client import multiprocess, CollectorRegistry, generate_latest
@app.get("/metrics")
async def metrics():
registry = CollectorRegistry()
multiprocess.MultiProcessCollector(registry)
return Response(generate_latest(registry), media_type="text/plain")
That’s the entire fix. About 30 lines, all in the proxy entrypoint.
The reason it’s been open for ~11 months despite being well-understood is that integrating it cleanly with LiteLLM’s existing Prometheus setup requires touching several files and making opinionated decisions about defaults. The decisions are:
- Should multiprocess mode be opt-in (env var) or auto-enabled when
--num_workers > 1? - Should the directory be auto-created in
/tmpor required from the user? - Should the directory be wiped on startup, or only on first launch (so it survives crashes within a deployment)?
Reasonable people disagree on each of these.
What teams do today
The issue has only a handful of comments, but the ecosystem’s standard workarounds for “Prometheus + multi-worker Python proxy” apply here:
- Deploy with
--num_workers 1and scale horizontally instead. Run more proxy containers behind a load balancer rather than more workers per container. Each container has one worker, no multiprocess problem. Simple, but inefficient on memory because each container pays the LiteLLM startup cost separately. - Run a sidecar metrics aggregator. Skip Prometheus’s built-in multiprocess support and instead push metrics from each worker to a sidecar (StatsD, OpenTelemetry collector) that handles aggregation. Adds complexity but works with any number of workers and any framework.
- Patch the proxy locally. Apply the 30-line fix above to your own LiteLLM image. This is what teams who can’t change their deployment topology end up doing.
- Disable Prometheus and use a different observability stack. OpenTelemetry’s metric SDK has its own multi-process story (broadly similar) but the LiteLLM OTEL integration is independent of the Prometheus integration, so you can opt out of one without losing the other.
The broader lesson
Anything that uses prometheus_client in a forking server is going to hit this. It’s not unique to LiteLLM; the same issue exists in raw Flask, FastAPI, Django, and any other Python web framework with multi-process workers and Prometheus metrics. The multiprocess mode in prometheus_client exists because the library’s designers knew about this and provided a fix — but the fix has operational gotchas that frameworks need to wrap for their users.
The right pattern for any framework that ships Prometheus metrics out of the box:
- Auto-detect the number of workers at startup.
- If
> 1, transparently set up multiprocess mode with a sane default directory. - Provide a
/metricshandler that usesMultiProcessCollector. - Document the directory management story prominently.
LiteLLM does the first half (it ships Prometheus integration) but not the second half (it doesn’t handle the multi-worker case). #10595 is exactly the gap between those two halves, and the fix is well-understood. It just needs maintainer attention and a defaults decision.
References
- Upstream issue: #10595
prometheus_clientmultiprocess mode: prometheus.github.io/client_python- gunicorn server hooks: docs.gunicorn.org