LiteLLM Issue

Why your LiteLLM Prometheus metrics flicker under multiple workers

The Prometheus multiprocess problem in a nutshell — what `prometheus_client` actually does across forked workers, why LiteLLM's metrics are unusable with --num_workers, and how to fix it cleanly.

Dipankar Sarkar · · opsprometheusobservabilitygunicornuvicorn
Upstream issue
#10595 — [Bug]: Prometheus metrics aren't shared across Uvicorn workers
Opened May 6, 2025 · status: open · 0 👍 · 4 comments

This isn’t a high-reaction issue. #10595 sits at zero 👍 in the LiteLLM repo. But every team running LiteLLM with --num_workers > 1 and Prometheus scraping eventually hits it, and the symptoms are exactly the kind that erode trust in your observability stack: counters go down. Counts oscillate. Scrape-to-scrape, the same metric reports different values depending on which worker happens to answer the /metrics request.

The issue is correctly diagnosed by the reporter and even points at the fix. It hasn’t been merged because the fix is operationally awkward, not because the diagnosis is wrong. This post explains what’s actually going on, why the workaround is non-trivial, and how to deploy LiteLLM with reliable Prometheus metrics today.

What you’ll see

Set --num_workers 4, cause some failures, then scrape /metrics repeatedly. You’ll see:

litellm_proxy_failed_requests_metric_total 12
# wait 5 seconds
litellm_proxy_failed_requests_metric_total 0
# wait 5 seconds
litellm_proxy_failed_requests_metric_total 7
# wait 5 seconds
litellm_proxy_failed_requests_metric_total 12

This is not a bug in LiteLLM’s metric recording. The recording is correct — each worker is faithfully tracking the failures it has personally seen. The bug is that /metrics returns the state of one worker, chosen at random. Counters are per-worker, scrapes hit one worker at a time, and Prometheus interprets the up-and-down pattern as a counter reset followed by a slow recovery.

The reporter’s diagnosis confirms this experimentally:

Specify --num_workers 4, cause some failures… litellm_proxy_failed_requests_metric_total now goes up and down at random.

Do not specify --num_workerslitellm_proxy_failed_requests_metric_total does not ever decrease. (The single-worker case works correctly.)

Why prometheus_client doesn’t share by default

The prometheus_client Python library stores metrics in module-level globals: a Counter or Gauge object holds its current value in a Python attribute on the instance, in process memory. When gunicorn forks workers from a master, each worker inherits the master’s metric objects via copy-on-write — but the moment any worker increments a counter, that page is copied, and from then on each worker has its own copy.

This is not a bug. It’s the only thing fork-and-CoW can do. In-process Python objects can’t be shared across processes without explicit IPC.

The library’s designers know this, and prometheus_client ships an opt-in multiprocess mode specifically for this case. In multiprocess mode:

  • Each worker writes its metric values to memory-mapped files in a shared directory (configured via PROMETHEUS_MULTIPROC_DIR).
  • The /metrics endpoint, instead of returning the in-memory state of the responding worker, reads all the files in the directory, aggregates them, and returns the combined view.
  • The aggregation logic is metric-type-aware: counters sum across files; gauges can be min, max, liveall, livesum, or mostrecent; histograms sum buckets.

This works. The catch is in the small print.

The catch nobody mentions until they hit it

The PROMETHEUS_MULTIPROC_DIR environment variable must be set to a directory that the client library can use for metrics. This directory must be wiped between process/Gunicorn runs (before startup is recommended).

The reporter quotes this directly, and it’s the entire reason this isn’t a one-line PR. Multiprocess mode works only if:

  1. The directory exists at startup.
  2. The directory is empty at startup (or contains only files from this exact run).
  3. Every worker has write access to it.
  4. On worker shutdown, the worker’s files are cleaned up — prometheus_client provides a mark_process_dead(pid) hook to do this, but you have to wire it into your shutdown handler.
  5. The directory survives across worker restarts within a single proxy lifetime, but is wiped between proxy restarts.

Number 5 is the operational headache. If you wipe the directory on every gunicorn restart, you lose metric history across reloads — counters reset to zero whenever you deploy. If you don’t wipe it, stale files from previous runs accumulate and skew aggregation. The “right” answer depends on how you deploy and what your operators expect.

Hence the issue’s status: the reporter has a working diagnosis and a clear path to a fix, but didn’t open a PR because they didn’t know how the LiteLLM team wanted to resolve the directory-management question. That’s a maintainer decision, not a technical one.

What a real fix looks like

A clean fix has three parts.

1. Set up the multiprocess directory at proxy startup. LiteLLM’s proxy entrypoint should:

import os, shutil, tempfile
from prometheus_client import multiprocess

def setup_prometheus_multiproc():
    # Honor explicit env var if set
    if "PROMETHEUS_MULTIPROC_DIR" in os.environ:
        multiproc_dir = os.environ["PROMETHEUS_MULTIPROC_DIR"]
    else:
        multiproc_dir = tempfile.mkdtemp(prefix="litellm_prom_")
        os.environ["PROMETHEUS_MULTIPROC_DIR"] = multiproc_dir

    # Wipe at startup
    if os.path.isdir(multiproc_dir):
        for f in os.listdir(multiproc_dir):
            os.remove(os.path.join(multiproc_dir, f))

    return multiproc_dir

2. Register a worker shutdown hook. When a worker exits, call multiprocess.mark_process_dead(os.getpid()) so its files are cleaned up. With gunicorn, this is the worker_exit server hook.

3. Replace the /metrics handler. Instead of using the default prometheus_client handler that returns the local registry, return an aggregated view:

from prometheus_client import multiprocess, CollectorRegistry, generate_latest

@app.get("/metrics")
async def metrics():
    registry = CollectorRegistry()
    multiprocess.MultiProcessCollector(registry)
    return Response(generate_latest(registry), media_type="text/plain")

That’s the entire fix. About 30 lines, all in the proxy entrypoint.

The reason it’s been open for ~11 months despite being well-understood is that integrating it cleanly with LiteLLM’s existing Prometheus setup requires touching several files and making opinionated decisions about defaults. The decisions are:

  • Should multiprocess mode be opt-in (env var) or auto-enabled when --num_workers > 1?
  • Should the directory be auto-created in /tmp or required from the user?
  • Should the directory be wiped on startup, or only on first launch (so it survives crashes within a deployment)?

Reasonable people disagree on each of these.

What teams do today

The issue has only a handful of comments, but the ecosystem’s standard workarounds for “Prometheus + multi-worker Python proxy” apply here:

  1. Deploy with --num_workers 1 and scale horizontally instead. Run more proxy containers behind a load balancer rather than more workers per container. Each container has one worker, no multiprocess problem. Simple, but inefficient on memory because each container pays the LiteLLM startup cost separately.
  2. Run a sidecar metrics aggregator. Skip Prometheus’s built-in multiprocess support and instead push metrics from each worker to a sidecar (StatsD, OpenTelemetry collector) that handles aggregation. Adds complexity but works with any number of workers and any framework.
  3. Patch the proxy locally. Apply the 30-line fix above to your own LiteLLM image. This is what teams who can’t change their deployment topology end up doing.
  4. Disable Prometheus and use a different observability stack. OpenTelemetry’s metric SDK has its own multi-process story (broadly similar) but the LiteLLM OTEL integration is independent of the Prometheus integration, so you can opt out of one without losing the other.

The broader lesson

Anything that uses prometheus_client in a forking server is going to hit this. It’s not unique to LiteLLM; the same issue exists in raw Flask, FastAPI, Django, and any other Python web framework with multi-process workers and Prometheus metrics. The multiprocess mode in prometheus_client exists because the library’s designers knew about this and provided a fix — but the fix has operational gotchas that frameworks need to wrap for their users.

The right pattern for any framework that ships Prometheus metrics out of the box:

  • Auto-detect the number of workers at startup.
  • If > 1, transparently set up multiprocess mode with a sane default directory.
  • Provide a /metrics handler that uses MultiProcessCollector.
  • Document the directory management story prominently.

LiteLLM does the first half (it ships Prometheus integration) but not the second half (it doesn’t handle the multi-worker case). #10595 is exactly the gap between those two halves, and the fix is well-understood. It just needs maintainer attention and a defaults decision.

References