Accelerating the LiteLLM proxy with Fast LiteLLM
A production-ready guide to running the LiteLLM proxy server with Fast LiteLLM under gunicorn, Docker, and systemd — including the import-order trap that catches most teams.
The LiteLLM proxy is where Fast LiteLLM earns its keep. A single proxy instance fielding hundreds of concurrent requests bottlenecks on connection pooling and rate limiting long before any model provider becomes the limiting factor. This guide walks through the production-grade setup.
The import-order trap
Before anything else: fast_litellm must be imported before litellm in every process that handles requests. The acceleration works by monkey-patching LiteLLM’s hot-path classes at import time. If LiteLLM has already been imported, the patches no-op silently and you get pure-Python performance with no error.
This sounds obvious until you remember that gunicorn forks workers. If you import fast_litellm inside the worker after litellm has already been imported in the master, every worker is unaccelerated.
The fix: import fast_litellm in the master process, before forking, with gunicorn’s --preload flag.
The wrapper module pattern
Create a tiny app.py that imports things in the correct order:
# app.py
import fast_litellm # 1. accelerate first
from litellm.proxy.proxy_server import app # 2. then load the proxy
That’s the entire file. Then point gunicorn at it:
gunicorn app:app \
--preload \
-w 4 \
-k uvicorn.workers.UvicornWorker \
-b 0.0.0.0:4000
Two things matter here:
--preloadrunsapp.pyonce in the master process before forking. The patched LiteLLM is then inherited by every worker via copy-on-write. Without--preload, each worker runsapp.pyindependently and you lose the global state benefit.UvicornWorkeris the right worker class for LiteLLM’s async proxy. Don’t use the default sync worker.
Docker
The same pattern, just packaged:
FROM python:3.12-slim
WORKDIR /app
RUN pip install --no-cache-dir fast-litellm gunicorn uvicorn
COPY app.py config.yaml ./
ENV LITELLM_CONFIG=/app/config.yaml
EXPOSE 4000
CMD ["gunicorn", "app:app", \
"--preload", \
"-w", "4", \
"-k", "uvicorn.workers.UvicornWorker", \
"-b", "0.0.0.0:4000"]
fast-litellm pulls litellm as a dependency, so you don’t need to install both explicitly.
systemd
For VM deployments, a typical unit file:
# /etc/systemd/system/litellm-proxy.service
[Unit]
Description=LiteLLM Proxy (Fast LiteLLM accelerated)
After=network.target
[Service]
Type=notify
User=litellm
WorkingDirectory=/srv/litellm
Environment=LITELLM_CONFIG=/srv/litellm/config.yaml
ExecStart=/srv/litellm/.venv/bin/gunicorn app:app \
--preload \
-w 4 \
-k uvicorn.workers.UvicornWorker \
-b 0.0.0.0:4000
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Type=notify requires gunicorn to support sd_notify, which it does as of recent versions. If you see startup races, drop to Type=simple.
Worker count
The temptation is to set -w to the CPU count. For LiteLLM specifically, that’s usually too many. The proxy is heavily I/O-bound — each worker spends most of its time waiting on upstream providers. Two workers per CPU is a reasonable starting point, and you should benchmark from there.
With Fast LiteLLM enabled, the connection pool is shared within a worker but not across workers. Each worker has its own pool. This matters for capacity planning: if you run too many workers, you fragment the connection pool and lose some of the pooling benefit.
Verifying it’s working
After deploy, check the proxy is actually accelerated:
curl http://localhost:4000/health/readiness
In your logs at startup, look for:
fast_litellm: active components = ['connection_pool', 'rate_limiter', 'token_counter']
If that line is missing, your import order is wrong somewhere. The most common culprit is an __init__.py somewhere in your package that imports litellm directly — that runs before app.py does.
Configuration
Fast LiteLLM is zero-config by default. For staged rollouts:
# Disable a specific component
export FAST_LITELLM_RUST_ROUTING=false
# Canary: enable batch token counting on 10% of calls
export FAST_LITELLM_BATCH_TOKEN_COUNTING=canary:10
The canary mode is useful when you’re rolling out acceleration to a hot path and want gradual exposure.
What this doesn’t change
- Provider integrations, cost tracking, callbacks, and the proxy’s HTTP API are all unchanged.
- Authentication, virtual keys, and team management work exactly as documented in the LiteLLM proxy docs.
- Prometheus metrics from LiteLLM are still emitted — Fast LiteLLM adds its own additional metrics on top.
What’s next
- Rate limiting at scale — the memory story for high-cardinality workloads.
- Benchmarks — what to expect from real workloads.