Article

A deep dive into the Fast LiteLLM token counting benchmark

Why tokenization with tiktoken-rs is 1.5–1.7× faster on long inputs and 0.5× as fast on short ones — the FFI overhead curve, fully explained.

Dipankar Sarkar · · benchmarktiktokenbenchmarksffitokenization

The token counting numbers in the Fast LiteLLM benchmarks tell a story that’s easy to miss at a glance. Rust is faster on large inputs and slower on small ones. Both numbers are real and both matter for capacity planning. This post walks through where the curve comes from and what to do about it.

The setup

Fast LiteLLM’s token counter wraps tiktoken-rs, a Rust port of OpenAI’s tokenizer. The Python baseline is OpenAI’s official tiktoken package, which is itself a Rust extension under the hood — so this isn’t “Python vs Rust” at the algorithm level. It’s “two Rust tokenizers with different FFI bridges.”

The benchmark runs 200 iterations of count_tokens(text) for inputs of varying length. Times are wall-clock, measured with time.perf_counter, GIL contention controlled by single-threading the comparison.

The shape of the curve

For a 10-token input (“Hello world”), Fast LiteLLM is roughly 0.5× the speed of tiktoken — Python wins by a factor of two. For a 10,000-token input (a long document), Fast LiteLLM is roughly 1.7× the speed of tiktoken. Somewhere in the middle, the curves cross.

The crossover point in our measurements is around 100–200 tokens. Below that, the FFI overhead dominates. Above that, the actual tokenization work dominates and the Rust path’s lower per-byte cost wins.

Why FFI dominates short inputs

Crossing the Python ↔ Rust boundary via PyO3 has a cost. Specifically, each call has to:

  1. Acquire the GIL if it isn’t already held.
  2. Convert the Python str argument into a &str Rust slice. This is mostly free for ASCII but requires UTF-8 validation either way.
  3. Allocate a Rust Vec for the output, run the tokenizer, and convert the result back to a Python list of integers (or, for count_tokens, just return the length as a Python int).
  4. Release the GIL state.

That overhead is roughly 1–2 microseconds per call. For a 10-token input, the actual tokenization is also roughly 1–2 microseconds. So you’re paying double for the same work.

tiktoken (the Python package) avoids this because its FFI bridge is tighter — it was designed by the same people who designed the tokenizer, and the Python wrapper does minimal marshalling. Fast LiteLLM has to go through PyO3’s general-purpose conversion layer, which is slightly less specialized.

This is not a “Rust is slow” story. It’s a “FFI specialization matters” story.

Why Rust dominates long inputs

For a 10,000-token input, the actual tokenization takes hundreds of microseconds. The FFI overhead is still 1–2 microseconds, but it’s now ~0.5% of the total cost instead of 50%. The interesting question becomes “which tokenizer is faster per byte?”

tiktoken-rs happens to be slightly faster than tiktoken per byte on long inputs, mainly because of differences in the regex engine used for the byte-pair encoding pre-tokenization step. Both are Rust regex engines, but the specific crate (fancy-regex vs Python’s regex module exposed through pyo3) has different performance characteristics on long inputs.

The result: for long documents, Fast LiteLLM is consistently 1.5–1.7× faster.

What this means for your stack

If your workload is dominated by short messages (chat applications with one-line user inputs), token counting acceleration won’t help you and may slightly hurt. Disable it:

export FAST_LITELLM_RUST_TOKEN_COUNTER=false

If your workload involves long documents (RAG pipelines, summarization, document analysis), token counting acceleration is a real win and the default is right.

If you have a mix, leave the default. Fast LiteLLM uses an input-length heuristic to route short inputs through Python and long inputs through Rust automatically. The crossover threshold is configurable:

export FAST_LITELLM_TOKEN_COUNTER_THRESHOLD=200

Why we publish the loss

It would be easy to omit the small-text row from the benchmarks table. Most marketing pages would. We left it in because anyone capacity planning a real workload needs both numbers. Hiding the loss would mean a team running a chat workload would deploy Fast LiteLLM expecting a token counting speedup, see no improvement (or a tiny slowdown), and lose trust in the rest of the numbers. Showing the loss means they can plan around it.

The honest table is also the table that survives independent verification. Every benchmark on our benchmarks page is reproducible from the source repo with one command, and people will run it. Inflated numbers get caught.

The reproducer

git clone https://github.com/neul-labs/fast-litellm
cd fast-litellm
uv venv && source .venv/bin/activate
uv add --dev maturin
uv run maturin develop --release
python scripts/benchmark_tokens.py --sizes 10,100,1000,10000 --iterations 200

You should see the crossover yourself. If you don’t — if Fast LiteLLM is winning at 10 tokens or losing at 10,000 — please open an issue. We want to know.