Guide

Rate limiting LiteLLM at high cardinality

Why per-user rate limiting in pure Python eats memory at scale, and how Fast LiteLLM gets to 42× less RSS without changing your config.

Dipankar Sarkar · April 12, 2026 · performancerate-limitingmemorydashmapmulti-tenant

Most performance writeups about rate limiters focus on throughput. For LiteLLM at scale, memory is the more interesting story.

The shape of the problem

LiteLLM’s rate limiter tracks request and token counts per identity. The identity is whatever you key on: an API key, a user ID, a team, a tenant. For a single-tenant setup with a handful of internal users, the data structures are tiny — a few dozen counters. Nobody thinks about it.

For a multi-tenant proxy with thousands of API keys, every active key holds a slot in the per-key counter dict, and every slot drags Python object overhead with it. The slot itself is small. The Python wrapping is not.

A back-of-the-envelope: a Python dict entry for a 32-character API key plus its counter object costs ~250–400 bytes once you account for hash table load factor, the string, the value object, and reference counting overhead. Multiply by 100,000 active keys and you’re at 25–40 MB just for the rate limiter, before any actual request data.

That’s not a crisis on its own. It becomes a crisis when:

The dict is part of the GC’s working set, so it’s traversed during every full GC pass.
The dict is locked during writes (it has to be — concurrent dict mutation in CPython is not safe across threads in all cases). Lock contention scales with cardinality.
The proxy has multiple workers, each with its own copy of the dict.

In benchmarks with 1,000+ unique keys, the Rust rate limiter uses about 42× less memory than the production Python implementation. That’s not a typo.

Where the savings come from

Three things, in order of impact:

1. No per-entry Python object overhead. A DashMap<String, AtomicU64> entry is, give or take, the length of the key plus 8 bytes for the counter plus a few bytes of bookkeeping. No PyObject headers, no refcount, no GC tracking, no allocation slack.

2. Lock-free reads. DashMap shards its internal map and uses fine-grained locks per shard. Reads don’t block writes on other shards. For a rate limiter that gets hit on every request, this matters more than the memory saving.

3. Atomic counter increments. The counter value itself is an AtomicU64. Increment is a single CPU instruction. No lock acquisition for the common case (increment-and-check).

What it doesn’t change

A few things that are not magically improved:

Your rate limit policy is still your policy. Fast LiteLLM doesn’t redefine what counts as a request, what the windows are, or how token limits are computed. It just runs the same accounting more efficiently.
Distributed rate limiting still needs Redis. If you’re sharding LiteLLM proxies behind a load balancer and need a single source of truth, you still want LiteLLM’s Redis-backed rate limiter. Fast LiteLLM accelerates the in-process path, not the cross-process path.
The first-request cost. Acceleration doesn’t help when the limit check is a cold cache miss against Redis. It helps the in-process counters that sit in front.

Configuration

There’s nothing to configure. The Rust rate limiter is enabled by default when Fast LiteLLM is active. If you want to disable it for a specific deployment:

export FAST_LITELLM_RUST_RATE_LIMITER=false

For a gradual rollout:

export FAST_LITELLM_RUST_RATE_LIMITER=canary:25  # 25% of calls

Capacity planning rule of thumb

For a rough sizing exercise, count your peak concurrently-active rate-limit keys (not your total keys ever issued — only keys that have made a request in the active window). Multiply by ~20 bytes for the Rust path or ~300 bytes for the Python path. That’s the steady-state RSS attributable to the rate limiter, per worker.

For 100k active keys and 4 workers, that’s 8 MB (Rust) vs 120 MB (Python) of resident memory just for the rate limiter. The difference matters most on small VMs, on serverless platforms with tight memory limits, and on hot reload during deploys when you briefly have two generations of workers running.

What’s next

Accelerating the LiteLLM proxy — make sure your proxy actually loads the accelerated path.
Benchmarks — methodology and results.