Benchmarks
Production-grade Python (thread-safe, lock-protected) compared head-to-head with Fast LiteLLM's Rust implementations. We publish the wins and the losses.
Methodology
- Each benchmark runs 200 iterations after a warm-up phase.
- Python baselines use the production code path, including thread-safety primitives — not stripped-down toy versions.
- Memory is measured as steady-state RSS after running through 1,000+ unique keys (the high-cardinality test).
- Reproducible:
python scripts/run_benchmarks.py --iterations 200in the source repo.
Results
| Component | Speedup | Memory | Best for |
|---|---|---|---|
| Connection Pool | 3.2× faster | Same | HTTP connection management |
| Rate Limiting | 1.6× faster | Same | Throttling, quota management |
| Large Text Tokenization | 1.5–1.7× faster | Same | Long documents |
| High-Cardinality Rate Limits | 1.2× faster | 42× less memory | Many unique API keys/users |
| Concurrent Connection Pool | 1.2× faster | Same | Multi-threaded workloads |
| Small Text Tokenization | 0.5× (Python faster) | Same | Short messages — FFI overhead dominates |
| Routing | 0.4× (Python faster) | Same | Model selection — FFI overhead dominates |
Use Rust acceleration for
- Connection pooling — 3×+ speedup, the single biggest win.
- Rate limiting — 1.5×+ speedup.
- Large text token counting — 1.5×+ speedup.
- High-cardinality workloads (1000+ unique keys) — 40×+ memory savings.
Python may be faster for
- Small-text token counting. The cost of crossing the Python ↔ Rust boundary dominates the actual tokenization work for short messages. Fast LiteLLM detects this and prefers Python automatically.
- Routing with complex Python objects. Marshalling rich Python objects across FFI is expensive. Routing acceleration is off by default; enable it only after benchmarking your specific workload.
Why we publish the losses
"Rust is always faster" benchmarks are marketing, not engineering. FFI overhead is real. The interesting question for any acceleration layer is which specific workloads benefit and which don't. Showing only the wins would tell you nothing useful for capacity planning. The honest table above is what you'd get from running the benchmarks yourself.
Reproducing locally
git clone https://github.com/neul-labs/fast-litellm
cd fast-litellm
uv venv && source .venv/bin/activate
uv add --dev maturin
uv run maturin develop
python scripts/run_benchmarks.py --iterations 200