Guide

Accelerating the LiteLLM proxy with Fast LiteLLM

A production-ready guide to running the LiteLLM proxy server with Fast LiteLLM under gunicorn, Docker, and systemd — including the import-order trap that catches most teams.

Dipankar Sarkar · · tutorialproxygunicornuvicorndeployment

The LiteLLM proxy is where Fast LiteLLM earns its keep. A single proxy instance fielding hundreds of concurrent requests bottlenecks on connection pooling and rate limiting long before any model provider becomes the limiting factor. This guide walks through the production-grade setup.

The import-order trap

Before anything else: fast_litellm must be imported before litellm in every process that handles requests. The acceleration works by monkey-patching LiteLLM’s hot-path classes at import time. If LiteLLM has already been imported, the patches no-op silently and you get pure-Python performance with no error.

This sounds obvious until you remember that gunicorn forks workers. If you import fast_litellm inside the worker after litellm has already been imported in the master, every worker is unaccelerated.

The fix: import fast_litellm in the master process, before forking, with gunicorn’s --preload flag.

The wrapper module pattern

Create a tiny app.py that imports things in the correct order:

# app.py
import fast_litellm                          # 1. accelerate first
from litellm.proxy.proxy_server import app   # 2. then load the proxy

That’s the entire file. Then point gunicorn at it:

gunicorn app:app \
    --preload \
    -w 4 \
    -k uvicorn.workers.UvicornWorker \
    -b 0.0.0.0:4000

Two things matter here:

  • --preload runs app.py once in the master process before forking. The patched LiteLLM is then inherited by every worker via copy-on-write. Without --preload, each worker runs app.py independently and you lose the global state benefit.
  • UvicornWorker is the right worker class for LiteLLM’s async proxy. Don’t use the default sync worker.

Docker

The same pattern, just packaged:

FROM python:3.12-slim

WORKDIR /app
RUN pip install --no-cache-dir fast-litellm gunicorn uvicorn
COPY app.py config.yaml ./

ENV LITELLM_CONFIG=/app/config.yaml
EXPOSE 4000

CMD ["gunicorn", "app:app", \
     "--preload", \
     "-w", "4", \
     "-k", "uvicorn.workers.UvicornWorker", \
     "-b", "0.0.0.0:4000"]

fast-litellm pulls litellm as a dependency, so you don’t need to install both explicitly.

systemd

For VM deployments, a typical unit file:

# /etc/systemd/system/litellm-proxy.service
[Unit]
Description=LiteLLM Proxy (Fast LiteLLM accelerated)
After=network.target

[Service]
Type=notify
User=litellm
WorkingDirectory=/srv/litellm
Environment=LITELLM_CONFIG=/srv/litellm/config.yaml
ExecStart=/srv/litellm/.venv/bin/gunicorn app:app \
    --preload \
    -w 4 \
    -k uvicorn.workers.UvicornWorker \
    -b 0.0.0.0:4000
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Type=notify requires gunicorn to support sd_notify, which it does as of recent versions. If you see startup races, drop to Type=simple.

Worker count

The temptation is to set -w to the CPU count. For LiteLLM specifically, that’s usually too many. The proxy is heavily I/O-bound — each worker spends most of its time waiting on upstream providers. Two workers per CPU is a reasonable starting point, and you should benchmark from there.

With Fast LiteLLM enabled, the connection pool is shared within a worker but not across workers. Each worker has its own pool. This matters for capacity planning: if you run too many workers, you fragment the connection pool and lose some of the pooling benefit.

Verifying it’s working

After deploy, check the proxy is actually accelerated:

curl http://localhost:4000/health/readiness

In your logs at startup, look for:

fast_litellm: active components = ['connection_pool', 'rate_limiter', 'token_counter']

If that line is missing, your import order is wrong somewhere. The most common culprit is an __init__.py somewhere in your package that imports litellm directly — that runs before app.py does.

Configuration

Fast LiteLLM is zero-config by default. For staged rollouts:

# Disable a specific component
export FAST_LITELLM_RUST_ROUTING=false

# Canary: enable batch token counting on 10% of calls
export FAST_LITELLM_BATCH_TOKEN_COUNTING=canary:10

The canary mode is useful when you’re rolling out acceleration to a hot path and want gradual exposure.

What this doesn’t change

  • Provider integrations, cost tracking, callbacks, and the proxy’s HTTP API are all unchanged.
  • Authentication, virtual keys, and team management work exactly as documented in the LiteLLM proxy docs.
  • Prometheus metrics from LiteLLM are still emitted — Fast LiteLLM adds its own additional metrics on top.

What’s next