How Fast LiteLLM works
Fast LiteLLM is a drop-in Rust acceleration layer for LiteLLM. It replaces a few specific hot-path components with PyO3-compiled Rust, leaves everything else untouched, and falls back to Python automatically on any error.
The architecture in one diagram
┌─────────────────────────────────────────────────────────────┐
│ LiteLLM Python Package │
├─────────────────────────────────────────────────────────────┤
│ fast_litellm (Python Integration Layer) │
│ ├── Enhanced Monkeypatching │
│ ├── Feature Flags & Gradual Rollout │
│ ├── Performance Monitoring │
│ └── Automatic Fallback │
├─────────────────────────────────────────────────────────────┤
│ Rust Acceleration Components (PyO3) │
│ ├── core (Advanced Routing) │
│ ├── tokens (Token Counting) │
│ ├── connection_pool (Connection Management) │
│ └── rate_limiter (Rate Limiting) │
└─────────────────────────────────────────────────────────────┘ The four accelerated components
Connection pool — 3.2× faster
Replaces LiteLLM's per-provider HTTP connection bookkeeping with a DashMap-backed lock-free pool. The biggest single win — most production proxies bottleneck on connection management under concurrent load, not on the model call itself.
Rate limiter — 1.6× faster, 42× less memory
Atomic-counter-based rate limiting in Rust. The memory story matters more than the speed story: when you have thousands of unique API keys, the Python implementation's per-key dict overhead dominates resident memory. Rust uses ~42× less.
Token counter — 1.5–1.7× faster (large texts)
Backed by tiktoken-rs. Best for long documents and batch tokenization. For short messages, FFI overhead actually makes Python faster — we measure this honestly and disable Rust for small inputs automatically.
Routing — conditional
Advanced router selection. Currently slower than Python for typical workloads because routing involves rich Python objects across the FFI boundary. Off by default; available behind a feature flag for routing-heavy benchmarks.
Production safety
- Automatic fallback: any exception in a Rust path falls back to the Python implementation on the same call, transparently.
- Circuit breaker: after 10 errors in a Rust component, that component is disabled for the remainder of the process.
- Feature flags: each accelerated component can be disabled or rolled out by percentage via env vars (
FAST_LITELLM_RUST_ROUTING=false,FAST_LITELLM_BATCH_TOKEN_COUNTING=canary:10). - Performance monitoring: real-time per-component metrics exposed for scraping.
- Type-safe: full Python type stubs ship with the package.
What it does not do
Fast LiteLLM does not replace LiteLLM. It does not change the public LiteLLM API, modify provider behaviour, alter cost calculation, or touch the proxy server. It only swaps the hot-path implementations of a few specific components. If LiteLLM doesn't support a model or provider, Fast LiteLLM doesn't either.
Use it with the LiteLLM proxy
When running the LiteLLM proxy under gunicorn, create a wrapper module:
import fast_litellm # apply acceleration before litellm loads
from litellm.proxy.proxy_server import app Then run with --preload so workers inherit the patched components:
gunicorn app:app --preload -w 4 -k uvicorn.workers.UvicornWorker -b 0.0.0.0:4000 See the proxy acceleration guide for systemd, Docker, and config-file walkthroughs.