LiteLLM Issue

Bisecting the LiteLLM 1.80 → 1.81 performance regression

A walkthrough of how to bisect a performance regression in a release like LiteLLM 1.81.x, the likely culprits in this specific case, and how to verify a fix.

Dipankar Sarkar · April 12, 2026 · performanceperformanceregressionbisectpostgres

Upstream issue

#19921 — [Bug]: Significant performance regression after upgrading from 1.80.5 to 1.81.x (UI + API slowness)

Opened January 28, 2026 · status: open · 15 👍 · 41 comments

Issue #19921 is the kind of report every operations team dreads: “we upgraded a minor version and everything got slower.” No exception, no error, no log line that points anywhere obvious. UI slow. API slow. Container CPU and memory unchanged. Database utilization unchanged. Just slower.

The reporter’s setup is concrete: 50 internal users, 200 models behind a Postgres-backed LiteLLM proxy, ~10 RPS, Langfuse OTEL callbacks, shuffle routing. Going from 1.80.5 to 1.81.0 / 1.81.3 broke the SLO. 41 comments later, the issue is still open and the underlying cause hasn’t been narrowed down. Several other users have independently confirmed seeing the same regression.

This post is about how you’d actually go about isolating a regression like this, what the most likely culprits are in this specific case, and what a fix would look like.

How to bisect a release-shaped performance regression

The classic mistake when investigating a slowdown is to start guessing. “Maybe it’s the new callback dispatch path.” “Maybe it’s the new model registry.” “Maybe Langfuse has a new version too.” Every guess that doesn’t reproduce wastes time. Bisect first, theorize second.

For a regression that spans multiple commits between two version tags:

Reproduce the slowdown reliably. Build a representative load test that hits the slow paths the user describes — login, model list, virtual keys, MCP tools, internal users. Each of these is a separate API endpoint. Measure each one with both versions. If you can’t reproduce on a stripped-down setup, the regression is environment-dependent and you need to bring more of the user’s environment into your reproducer.
Bisect at the version level first. 1.80.5 → 1.81.0 spans tens of releases. Start by testing 1.80.5 vs 1.81.0; if the regression is present, you’ve narrowed it. If 1.81.0 is fine, walk forward through the 1.81.x patch releases until you find the one that introduces the slowdown.
Bisect at the commit level once you have a single release. git bisect between the good and bad release tags. For each midpoint commit, run the load test and report good or bad. Log(N) commits later you have the offending commit.
Read the commit, not the release notes. Release notes routinely omit the things that cause performance regressions because nobody notices them at write time. The actual diff is the source of truth.

This sounds laborious because it is. The reason to do it anyway is that any other approach has worse expected cost — guessing-and-checking on a complex codebase usually takes longer than mechanical bisection.

The likely culprits in 1.81.x

Without doing the bisection myself, here are the structural patterns to look at first based on what’s usually behind release-spanning slowdowns in LiteLLM-shaped codebases:

Database query patterns. The reporter mentions Postgres for backing storage, and the slow operations are exactly the ones that hit the DB: loading models, virtual keys, internal users, MCP tools. The most common cause of “everything got slower with no obvious reason” in a DB-backed proxy is an N+1 query that was introduced in a refactor. A list endpoint that previously did one query now does one query for the list plus one query per item to load related data. The container CPU stays low because the bottleneck is round-trip time to Postgres, not local CPU. The DB CPU stays low because each query is cheap. But latency goes up because you’re now doing 50 sequential queries instead of one. Look for new .refresh(), .fetch_related(), or eagerly-loaded relations in any model loading code that touches LiteLLM_VirtualKey, LiteLLM_TeamTable, LiteLLM_ModelTable, etc.

Callback dispatch path changes. The reporter has Langfuse OTEL callbacks enabled. A change to how callbacks are dispatched — particularly if it added a synchronous await somewhere that used to be fire-and-forget — would slow every API call by the round-trip time to the callback target. This is a single-digit-ms regression per call, but when it’s compounded across many internal API calls per page load, it becomes visible.

Authentication / authorization middleware. Many API endpoints in the LiteLLM proxy share an auth middleware that resolves the user, the team, the budget, the rate limit. If a refactor in 1.81 added an extra DB roundtrip to that middleware (say, to check a new permission flag), every endpoint pays the cost. This pattern is particularly nasty because it doesn’t show up as a slow individual query — it shows up as “everything is slightly slower.”

OTEL span propagation overhead. Langfuse + OTEL instrumentation can be expensive if span-context propagation is implemented as Python contextvars lookups in hot paths. A change to how spans are emitted, especially adding spans to internal helper functions that are called many times per request, can add up. The fact that the user reports the UI and API are slow suggests it’s not solely the model-call path.

Connection pool sizing. A change to how the DB connection pool is configured — for example, lowering pool_size or max_overflow defaults — would manifest as latency that goes up under load while CPU and memory stay flat. Connections wait for the pool, requests wait for connections, and nothing looks busy.

Of these, the N+1 query pattern is the most common cause of this exact symptom set in this exact application shape. It would be where I’d start.

How to verify a fix once you find it

The hard part of fixing a performance regression isn’t writing the fix. It’s proving the fix actually works. For a regression like this:

Capture the load test from your bisection as the regression test. Whatever you used to detect the slowdown should become a permanent benchmark in CI.
Compare wall-clock latency for each affected endpoint before and after the fix. Median, p95, p99 — not just average.
Compare DB query counts using pg_stat_statements or equivalent. If your fix is for an N+1, you should see the query count for that endpoint drop.
Re-run with both 1.80.5 and post-fix 1.81.x. The post-fix version should match or beat 1.80.5 on the same workload. If it only matches partially, there are multiple regressions and you need to bisect again from the new known-bad commit.

What users in the issue have done

From the comments on #19921, the workarounds are mostly “stay on 1.80.5.” A few users have tried:

Pinning back to 1.80.5 in production while testing newer versions in staging. This is the most common workaround and works fine until you need a feature or fix from 1.81.x.
Disabling Langfuse callbacks as a diagnostic step, to confirm whether the regression is in the callback path or elsewhere. Some users report partial improvement; others report none, suggesting the regression has multiple sources.
Increasing the Postgres connection pool size and worker count, which mitigates contention-driven symptoms even if it doesn’t fix the root cause.
Switching to in-memory backed routing for non-critical operations, bypassing the DB hot path entirely. Only viable for some deployments.

None of these are the fix. They are stalling tactics until the underlying cause is bisected and patched.

The broader lesson

Performance regressions on minor version bumps are an organizational problem, not an engineering one. A project as wide as LiteLLM cannot rely on individual reviewers spotting “this query is now N+1” or “this middleware now does an extra round trip” in code review. The defense is automated:

A perf benchmark in CI that runs a representative workload and fails the build if latency regresses by more than a threshold. Even a simple “list 100 keys, list 100 models, run 10 completions” benchmark would have caught this regression on the PR that introduced it.
Query-count assertions in tests for the most-hit endpoints. Snapshot tests for query count are a cheap way to catch the entire class of N+1 regressions.
Annotated load test results in PRs, so reviewers can see “this PR adds 12ms to the auth middleware” before merging.

LiteLLM has a CI matrix for compatibility, which is excellent. A perf matrix is the missing piece. Until then, regressions like #19921 are detected by the user, in production, after the upgrade.

References

Upstream issue: #19921
git bisect documentation: git-scm.com
Postgres pg_stat_statements: postgresql.org/docs