Why Fast LiteLLM exists

LiteLLM is excellent at what it does — a unified interface across 100+ LLM providers, with routing, rate limiting, cost tracking, and a proxy server. None of that needs to change. But a few hot-path components were never designed for the kind of concurrency a production proxy now sees.

The bottleneck profile we kept seeing

Across LiteLLM proxy deployments at scale, the same three things show up under load:

Connection pool contention. Per-provider HTTP connection bookkeeping uses Python data structures protected by locks. Under concurrent load the locks become the bottleneck — long before the upstream providers do.
Rate limiter memory. Per-key counters are kept in dicts. With thousands of unique API keys (each user, each team, each tenant), the dict overhead dominates resident memory and triggers GC pauses.
Token counting latency. Pre-flight tokenization runs on every request to estimate cost and check context windows. For long documents this is non-trivial CPU work happening on the request path.

Why Rust, specifically

All three problems have the same shape: lots of small operations, hot inner loops, contention on shared state. That's the workload profile where the GIL hurts most and where a lock-free language with cheap atomics shines. Rust + DashMap + tiktoken-rs handles all three with a small native extension that compiles to a wheel.

The alternative would be rewriting LiteLLM in a different language. Nobody wants that. A targeted Rust extension via PyO3 lets us keep LiteLLM as the source of truth for everything else and only swap the bottlenecks.

What we explicitly didn't do

We did not rewrite cost tracking, provider integrations, the proxy server, the router, or the callback system.
We did not change LiteLLM's public API.
We did not add a Redis or external service requirement. Acceleration is in-process.
We did not assume Rust is faster everywhere. Routing and small-text tokenization are slower under acceleration — we measured, documented, and disabled them by default.

Who benefits

LiteLLM proxy operators running many concurrent requests through a single instance.
Multi-tenant platforms with high-cardinality rate limiting (per-user, per-team, per-API-key).
RAG pipelines tokenizing long documents on the request path.
Teams that can't tolerate Rust as a build dependency — prebuilt wheels mean no Rust toolchain required.

Who probably doesn't need it

Single-user scripts and notebooks. The bottleneck there is the model call, not the framework.
Low-concurrency serverless functions where each invocation is a fresh process.
Workloads dominated by short messages and simple routing — see the benchmarks page for why.

Want to talk through whether this fits your stack? Neul Labs offers free 30-minute scoping calls.

Book a call Consulting services