Guide

Accelerating the LiteLLM proxy with Fast LiteLLM

A production-ready guide to running the LiteLLM proxy server with Fast LiteLLM under gunicorn, Docker, and systemd — including the import-order trap that catches most teams.

Dipankar Sarkar · April 12, 2026 · tutorialproxygunicornuvicorndeployment

The LiteLLM proxy is where Fast LiteLLM earns its keep. A single proxy instance fielding hundreds of concurrent requests bottlenecks on connection pooling and rate limiting long before any model provider becomes the limiting factor. This guide walks through the production-grade setup.

The import-order trap

Before anything else: fast_litellm must be imported before litellm in every process that handles requests. The acceleration works by monkey-patching LiteLLM’s hot-path classes at import time. If LiteLLM has already been imported, the patches no-op silently and you get pure-Python performance with no error.

This sounds obvious until you remember that gunicorn forks workers. If you import fast_litellm inside the worker after litellm has already been imported in the master, every worker is unaccelerated.

The fix: import fast_litellm in the master process, before forking, with gunicorn’s --preload flag.

The wrapper module pattern

Create a tiny app.py that imports things in the correct order:

# app.py
import fast_litellm                          # 1. accelerate first
from litellm.proxy.proxy_server import app   # 2. then load the proxy

That’s the entire file. Then point gunicorn at it:

gunicorn app:app \
    --preload \
    -w 4 \
    -k uvicorn.workers.UvicornWorker \
    -b 0.0.0.0:4000

Two things matter here:

--preload runs app.py once in the master process before forking. The patched LiteLLM is then inherited by every worker via copy-on-write. Without --preload, each worker runs app.py independently and you lose the global state benefit.
UvicornWorker is the right worker class for LiteLLM’s async proxy. Don’t use the default sync worker.

Docker

The same pattern, just packaged:

FROM python:3.12-slim

WORKDIR /app
RUN pip install --no-cache-dir fast-litellm gunicorn uvicorn
COPY app.py config.yaml ./

ENV LITELLM_CONFIG=/app/config.yaml
EXPOSE 4000

CMD ["gunicorn", "app:app", \
     "--preload", \
     "-w", "4", \
     "-k", "uvicorn.workers.UvicornWorker", \
     "-b", "0.0.0.0:4000"]

fast-litellm pulls litellm as a dependency, so you don’t need to install both explicitly.

systemd

For VM deployments, a typical unit file:

# /etc/systemd/system/litellm-proxy.service
[Unit]
Description=LiteLLM Proxy (Fast LiteLLM accelerated)
After=network.target

[Service]
Type=notify
User=litellm
WorkingDirectory=/srv/litellm
Environment=LITELLM_CONFIG=/srv/litellm/config.yaml
ExecStart=/srv/litellm/.venv/bin/gunicorn app:app \
    --preload \
    -w 4 \
    -k uvicorn.workers.UvicornWorker \
    -b 0.0.0.0:4000
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Type=notify requires gunicorn to support sd_notify, which it does as of recent versions. If you see startup races, drop to Type=simple.

Worker count

The temptation is to set -w to the CPU count. For LiteLLM specifically, that’s usually too many. The proxy is heavily I/O-bound — each worker spends most of its time waiting on upstream providers. Two workers per CPU is a reasonable starting point, and you should benchmark from there.

With Fast LiteLLM enabled, the connection pool is shared within a worker but not across workers. Each worker has its own pool. This matters for capacity planning: if you run too many workers, you fragment the connection pool and lose some of the pooling benefit.

Verifying it’s working

After deploy, check the proxy is actually accelerated:

curl http://localhost:4000/health/readiness

In your logs at startup, look for:

fast_litellm: active components = ['connection_pool', 'rate_limiter', 'token_counter']

If that line is missing, your import order is wrong somewhere. The most common culprit is an __init__.py somewhere in your package that imports litellm directly — that runs before app.py does.

Configuration

Fast LiteLLM is zero-config by default. For staged rollouts:

# Disable a specific component
export FAST_LITELLM_RUST_ROUTING=false

# Canary: enable batch token counting on 10% of calls
export FAST_LITELLM_BATCH_TOKEN_COUNTING=canary:10

The canary mode is useful when you’re rolling out acceleration to a hot path and want gradual exposure.

What this doesn’t change

Provider integrations, cost tracking, callbacks, and the proxy’s HTTP API are all unchanged.
Authentication, virtual keys, and team management work exactly as documented in the LiteLLM proxy docs.
Prometheus metrics from LiteLLM are still emitted — Fast LiteLLM adds its own additional metrics on top.

What’s next

Rate limiting at scale — the memory story for high-cardinality workloads.
Benchmarks — what to expect from real workloads.