LiteLLM Issue

Why LiteLLM burns extra GitHub Copilot premium requests on agent flows

A deep dive into the X-Initiator header semantics, how Copilot's premium request accounting works, and why LiteLLM's transformation layer over-bills compared to Copilot CLI and OpenCode.

Dipankar Sarkar · · providergithub-copilotprovidersbillingagents
Upstream issue
#18155 — [Bug]: GitHub Copilot Provider - Excessive Premium Request Usage
Opened December 17, 2025 · status: open · 9 👍 · 4 comments

GitHub Copilot has a billing model that rewards single-conversation usage and punishes the wrong kind of automation. Each user-initiated chat counts as one “premium request”; subsequent tool calls and follow-ups within the same agentic flow are supposed to be free. Issue #18155 reports that LiteLLM, when used as a Copilot proxy by agentic clients like Claude Code or Codex CLI, burns multiple premium requests per session — even though the equivalent flow through Copilot’s own VS Code chat or the official CLI uses one.

The reporter cites the X-Initiator header as the relevant mechanism and points to two open-source projects (opencode-copilot-auth and copilot-api) that get this right. The issue has been adjacent to a months-long discussion in #12859, but the over-billing still happens on the latest release.

This post explains what the X-Initiator header actually does, where LiteLLM’s transformation layer goes wrong, and what a correct implementation looks like.

The Copilot premium request model

GitHub’s Copilot billing documentation describes premium requests as the metered unit for paid Copilot usage. The model is roughly:

  • A user typing a chat message into VS Code, the Copilot CLI, or any IDE integration counts as one premium request.
  • Tool calls, retries, function call follow-ups, and agent loops triggered by that initial user message are free. They share the request budget of the originating call.
  • The way the Copilot backend distinguishes “this was the user typing” from “this was the agent following up” is through a request header: X-Initiator. Values are roughly user for human-typed messages and agent for follow-ups.

This is a sane billing model — you pay for the work the user asked for, not for the work the agent did to satisfy it. It also creates a real financial incentive to send the header correctly. Misclassifying an agent loop as user-initiated multiplies the bill by however many turns the loop took.

Where LiteLLM gets it wrong

LiteLLM has two relevant transformation files for the Copilot provider:

  • litellm/llms/github_copilot/chat/transformation.py for chat completions.
  • litellm/llms/github_copilot/responses/transformation.py for the responses API.

Both files do consider the X-Initiator header, according to the reporter’s investigation. But considering the header isn’t enough — the question is whether the value being set matches the semantics Copilot expects. The reporter has confirmed that LiteLLM consumes multiple premium requests for both:

  • Claude Code (Anthropic-shaped chat completions routed through LiteLLM to Copilot)
  • Codex CLI (OpenAI Codex responses-API routed through LiteLLM to Copilot)

The same agent flows through Copilot’s own clients use only one premium request. So the bug isn’t “LiteLLM never sends the header”; it’s “LiteLLM sends the header incorrectly, or sends it on calls that should not have it, or doesn’t preserve it across multi-turn conversations.”

The likely root cause

Without access to the LiteLLM transformation source as it exists today, here’s the structural pattern that almost always causes this exact symptom in provider-translation code:

The header is set per-call based on whether the call has function calls or tool calls in the message history, not based on whether the call originated from a user or an agent. This is a tempting heuristic — “if there are tool calls, this is mid-agent-loop, so it’s agent-initiated; otherwise it’s user-initiated.” It’s wrong because:

  • A user can type a follow-up message after an agent loop has completed. The new message has tool calls in its history (from the previous turn), but it’s still user-initiated.
  • An agent can make a tool call before any function call response exists in the message history. The first call is mid-loop but doesn’t look like it.
  • The “is this an agent loop” status is a property of the session, not of any individual request. You can’t infer it from the message contents alone.

The OpenCode and copilot-api implementations the reporter cites get this right by tracking the initiator status at the session level, not the message level. They set X-Initiator: user on the first call after a user input and X-Initiator: agent on every subsequent call until the next user input. The information lives in the proxy’s per-session state, not in the message payload.

LiteLLM is a stateless transformation layer by design — each completion call is independent. To track session state, the transformation layer would need to either:

  1. Accept a hint from the caller about whether the call is user- or agent-initiated, via a header or a metadata field.
  2. Maintain its own session state keyed on something stable (a thread ID, a Copilot session token, the conversation message count).

Option 1 is the cleaner answer because it pushes the decision to the caller, which is where the information actually lives. Claude Code knows whether it’s processing a user prompt or executing a tool call; Codex CLI knows the same. The bridge layer just needs to honor what the caller tells it.

What a correct implementation looks like

Pseudocode for the right shape:

def transform_request(model_response_request, headers):
    initiator = "user"  # default

    # Honor explicit caller hint
    if headers.get("X-LiteLLM-Initiator") in ("user", "agent"):
        initiator = headers["X-LiteLLM-Initiator"]
    elif _is_tool_call_response(model_response_request):
        # Tool call responses are always agent-initiated
        initiator = "agent"
    elif _has_assistant_with_tool_calls_as_last_turn(model_response_request):
        # The last assistant turn requested tools; we're following up on them
        initiator = "agent"

    upstream_headers = {
        "X-Initiator": initiator,
        # ... other Copilot-specific headers
    }
    return transformed_payload, upstream_headers

The crucial changes:

  1. Default to user only when there’s no other signal. This is safe because most explicit user calls don’t have tool-call structure in them.
  2. Recognize tool-call responses (the message containing function call results) as definitively agent-initiated. These should never burn a premium request.
  3. Recognize assistant-with-tool-calls follow-ups as agent-initiated. When the previous assistant turn requested tools, the next call from the client is the tool result, which is part of the agent loop.
  4. Allow caller override via a header. Agentic clients like Claude Code know better than the proxy and should be trusted when they send a hint.

The reporter’s claim that OpenCode and copilot-api both implement this pattern correctly is verifiable by reading their source — both linked in the issue. They are short files (under 200 lines each), and they’re a useful reference for what LiteLLM’s transformation should look like.

What teams do today

The workarounds reported in #18155 and the related #12859 thread are mostly:

  1. Use LiteLLM’s Copilot integration only for non-agentic flows. Single-shot chat completions don’t trigger the bug. Agent loops do. Some teams have separated their Copilot usage so that anything tool-using goes through a different path.
  2. Use one of the alternative proxies. OpenCode’s auth proxy and ericc-ch’s copilot-api both correctly handle the header. Teams have migrated to these for Copilot specifically while using LiteLLM for everything else.
  3. Disable Copilot for agentic clients entirely and route Claude Code / Codex through the actual provider APIs (Anthropic, OpenAI). This is the cleanest answer if you have the budget for direct provider access — it avoids the Copilot middleman and the billing surprises.
  4. Monitor Copilot usage closely and pin LiteLLM versions when the billing is stable. There’s no good way to detect the over-billing from the LiteLLM side; you only see it in the GitHub Copilot dashboard.

The broader lesson

Provider integrations in unified LLM proxies have a recurring failure mode: the proxy implements the protocol correctly but misses the billing semantics that aren’t part of the protocol. The protocol says “you can send this header”; the billing system says “this header has these specific semantics that depend on session state.” Translating between protocols loses session state by default.

For any project that wraps a metered LLM API, the audit checklist is:

  • Read the provider’s billing docs alongside the API docs. They are usually separate documents and the billing doc has the constraints that matter for cost.
  • Identify which headers, fields, or session structures affect billing, not just functionality. These need first-class support in the transformation layer, not best-effort heuristics.
  • Test billing behavior end-to-end, not just response correctness. A test that confirms “this agent flow consumed one premium request, not five” requires hitting the real Copilot API and reading the dashboard. Hard to automate, important to do at least once per release.
  • Allow callers to override billing-relevant headers explicitly, because the caller usually knows more about their session than the proxy does.

The Copilot premium request bug has been quietly costing teams real money since the integration shipped. The fix is small in code but requires someone to sit down with both the LiteLLM transformation files and the Copilot billing docs and reconcile them carefully. Until then, the workarounds above are the responsible answer.

References