Why Fast LiteLLM exists

LiteLLM is excellent at what it does — a unified interface across 100+ LLM providers, with routing, rate limiting, cost tracking, and a proxy server. None of that needs to change. But a few hot-path components were never designed for the kind of concurrency a production proxy now sees.

The bottleneck profile we kept seeing

Across LiteLLM proxy deployments at scale, the same three things show up under load:

  1. Connection pool contention. Per-provider HTTP connection bookkeeping uses Python data structures protected by locks. Under concurrent load the locks become the bottleneck — long before the upstream providers do.
  2. Rate limiter memory. Per-key counters are kept in dicts. With thousands of unique API keys (each user, each team, each tenant), the dict overhead dominates resident memory and triggers GC pauses.
  3. Token counting latency. Pre-flight tokenization runs on every request to estimate cost and check context windows. For long documents this is non-trivial CPU work happening on the request path.

Why Rust, specifically

All three problems have the same shape: lots of small operations, hot inner loops, contention on shared state. That's the workload profile where the GIL hurts most and where a lock-free language with cheap atomics shines. Rust + DashMap + tiktoken-rs handles all three with a small native extension that compiles to a wheel.

The alternative would be rewriting LiteLLM in a different language. Nobody wants that. A targeted Rust extension via PyO3 lets us keep LiteLLM as the source of truth for everything else and only swap the bottlenecks.

What we explicitly didn't do

Who benefits

Who probably doesn't need it

Want to talk through whether this fits your stack? Neul Labs offers free 30-minute scoping calls.