Article

Why LiteLLM needs Rust (in three specific places, not everywhere)

A measured argument for hybrid Python+Rust in LiteLLM's hot path — and the places where Python is still the right answer.

Dipankar Sarkar · · opinionrustpythonperformanceffigil

The “rewrite it in Rust” reflex has hardened into a meme, and it deserves the cynicism it gets. Most rewrites move bottlenecks rather than removing them, and most teams can’t justify the maintenance cost. Fast LiteLLM exists because there’s a smaller, more boring claim to make: Rust is the right answer for three specific components in LiteLLM’s hot path, and the wrong answer for almost everything else.

This post is about how we picked the three.

Rule one: only accelerate the things that profile shows are slow

LiteLLM is a wide library. Provider integrations, cost tracking, callbacks, the router, the proxy server — each one is a candidate for “what if we made this faster.” Almost all of them are wrong answers.

The starting point was profiling a real LiteLLM proxy under load. Not a microbenchmark, not a synthetic test. A production-shaped workload running through an actual LiteLLM proxy with py-spy attached.

What showed up on top, consistently:

  1. Connection pool bookkeeping. The proxy holds HTTP connections to a dozen upstream providers. Each request acquires a connection, dispatches a call, and releases. The acquire/release path involves a Python dict and a lock, and under concurrent load that lock becomes the second-largest contributor to wall time, behind only the actual model call.
  2. Rate limit accounting. Per-key counters incremented on every request, again behind a lock, again contended.
  3. Tokenization on the request path. LiteLLM tokenizes incoming prompts to estimate cost and check context windows. For long documents this is a non-trivial CPU cost happening synchronously before the model call.

Everything else was background noise relative to these three. So we built acceleration for these three and left everything else alone.

Rule two: Rust is for contention, not for raw compute

A lot of the Rust-vs-Python folklore focuses on raw compute. “Rust is X times faster at numerical loops.” That’s true, and almost never the reason to reach for it in a Python project that already uses NumPy or tiktoken.

The real reason to use Rust here is that the GIL turns CPython’s contention into a serial bottleneck. Two threads incrementing a Python counter cannot truly do so in parallel — one has to wait. Two threads incrementing an AtomicU64 in Rust can. For a rate limiter that gets hit on every request, that’s the entire game.

The same logic applies to the connection pool. The pool’s internal data structure is a dict-like map of provider → idle connections. Concurrent acquire/release operations from multiple threads serialize on dict mutation. Replace it with DashMap and the contention goes away.

This is the load-bearing argument for Fast LiteLLM. Rust isn’t faster because Rust is faster. Rust is faster because the contention model is fundamentally different.

Rule three: FFI is not free

Crossing the Python ↔ Rust boundary costs roughly 1–5 microseconds per call, depending on argument complexity. For a hot inner loop with millions of crossings per second, that overhead dominates whatever you saved by going to native code.

Two cases where Fast LiteLLM specifically refuses to use Rust:

  • Small text tokenization. Tokenizing a 10-token user message in Rust takes less time than the FFI crossing. We measured this. Python wins.
  • Routing. LiteLLM’s router takes rich Python objects (model configs, callback functions, fallback chains) as arguments. Marshalling those across FFI is expensive. Even when the routing logic itself is faster in Rust, the marshalling cost erases the win.

We disabled Rust acceleration in both cases. The Routing row in the benchmarks table shows 0.4× — Python is more than twice as fast — and we ship that result on the front page rather than hiding it. A serious acceleration layer has to be willing to say “Python wins here.”

The PyO3 story

PyO3 is what makes any of this approachable. It’s a Rust crate that lets you write Rust functions with Python-compatible signatures and have them appear as ordinary methods on a Python class. Maturin builds them into wheels. There’s no FFI plumbing for the user to write.

The downsides are real:

  • The Rust code has to be aware of the GIL boundary. Operations that need to release the GIL (long-running pure-Rust work) require explicit Python::allow_threads blocks. Forget those and you serialize everything anyway.
  • Error handling crosses two error systems. Rust Result types have to be converted to Python exceptions and back.
  • Build complexity exists, even if maturin hides it. Cross-compiling for every target requires CI infrastructure.

For Fast LiteLLM, the upside outweighs the downsides because the surface area is small. We’re not exposing all of Rust to Python — we’re exposing four functions across four modules. Maintaining that is tractable. Maintaining a full Rust port of LiteLLM would not be.

The boring conclusion

If you’re considering a Rust acceleration layer for any Python project, the questions to answer first are:

  1. Where is the actual bottleneck? Profile under production-shaped load. Don’t guess.
  2. Is the bottleneck contention-shaped? If yes, lock-free Rust data structures will help. If it’s raw compute, NumPy or tiktoken or polars probably already help more, with less integration cost.
  3. Will FFI overhead eat the savings? Measure the per-call cost of crossing the boundary against the cost of the work you’re moving across.
  4. Can you accept the build and maintenance complexity? Cross-compilation, wheels, two error systems, two memory models.

For LiteLLM specifically, the answers were yes-yes-no-yes for three components and yes-yes-yes-yes for everything else, so we built the three. That’s the whole story.