Why Fast LiteLLM exists
LiteLLM is excellent at what it does — a unified interface across 100+ LLM providers, with routing, rate limiting, cost tracking, and a proxy server. None of that needs to change. But a few hot-path components were never designed for the kind of concurrency a production proxy now sees.
The bottleneck profile we kept seeing
Across LiteLLM proxy deployments at scale, the same three things show up under load:
- Connection pool contention. Per-provider HTTP connection bookkeeping uses Python data structures protected by locks. Under concurrent load the locks become the bottleneck — long before the upstream providers do.
- Rate limiter memory. Per-key counters are kept in dicts. With thousands of unique API keys (each user, each team, each tenant), the dict overhead dominates resident memory and triggers GC pauses.
- Token counting latency. Pre-flight tokenization runs on every request to estimate cost and check context windows. For long documents this is non-trivial CPU work happening on the request path.
Why Rust, specifically
All three problems have the same shape: lots of small operations, hot inner loops, contention on shared state. That's the workload profile where the GIL hurts most and where a lock-free language with cheap atomics shines. Rust + DashMap + tiktoken-rs handles all three with a small native extension that compiles to a wheel.
The alternative would be rewriting LiteLLM in a different language. Nobody wants that. A targeted Rust extension via PyO3 lets us keep LiteLLM as the source of truth for everything else and only swap the bottlenecks.
What we explicitly didn't do
- We did not rewrite cost tracking, provider integrations, the proxy server, the router, or the callback system.
- We did not change LiteLLM's public API.
- We did not add a Redis or external service requirement. Acceleration is in-process.
- We did not assume Rust is faster everywhere. Routing and small-text tokenization are slower under acceleration — we measured, documented, and disabled them by default.
Who benefits
- LiteLLM proxy operators running many concurrent requests through a single instance.
- Multi-tenant platforms with high-cardinality rate limiting (per-user, per-team, per-API-key).
- RAG pipelines tokenizing long documents on the request path.
- Teams that can't tolerate Rust as a build dependency — prebuilt wheels mean no Rust toolchain required.
Who probably doesn't need it
- Single-user scripts and notebooks. The bottleneck there is the model call, not the framework.
- Low-concurrency serverless functions where each invocation is a fresh process.
- Workloads dominated by short messages and simple routing — see the benchmarks page for why.
Want to talk through whether this fits your stack? Neul Labs offers free 30-minute scoping calls.