Bounding your agent

Agents are not like web servers. A web server request runs for milliseconds, costs fractions of a cent, and either succeeds or fails fast. An agent can run for minutes, chain dozens of LLM calls, consume thousands of tokens per step, and keep going until it decides it's done — or until something breaks.

A long-running or looping agent can quietly exhaust your token quota, blow past your latency budget, or rack up a surprising bill before you've noticed anything is wrong.

The instinct is to look to your LLM provider for controls. They disappoint. Providers can cap your account spend and rate-limit your organisation's throughput — but nothing they offer operates at the level of a single agent run. No provider will stop a run at $0.50 or after 30 seconds on your behalf.

The controls have to live inside the agent. To understand why — and how to implement them — there are three things to work through:

What LLM providers actually guarantee: SLAs cover HTTP availability, not latency or per-run cost. Understanding the gap explains why you cannot outsource bounding to the infrastructure.
The token usage data you already have: every major LLM API returns input and output token counts on every response. That's the raw material for a cost budget.
The accumulation loop: how to sum token spend across steps, track wall-clock time, handle the context window cost spiral, and shut down gracefully when a limit is hit — including where an MCP gateway fits as a backstop for tool call counts.

LLM provider guarantees

Every major provider operates a tiered model. Standard pay-as-you-go carries no formal guarantee; uptime SLAs require a premium tier, and latency guarantees are rarer still:

Provider	Standard tier	Premium tier	Uptime SLA	Latency SLA
OpenAI	No SLA	Priority / Scale	99.9%	Yes — p50, per-request
Anthropic	Best-effort	Priority	99.5% target	No
AWS Bedrock	99.9%	Reserved	99.5% (model response)	No
Google Vertex AI	No SLA	Provisioned Throughput	99.5%	No
Azure OpenAI	99.9%	PTU	99.9%	Yes — PTU only

Those SLAs measure one thing: whether the API returns HTTP 200 instead of 5xx. A model responding at 5 tokens/second instead of 100 is, by every definition, fully available. Rate limit events (HTTP 429) are explicitly excluded from all of them. For an agent, a throttled request means a stalled step: the agent backs off, retries, and burns wall-clock time — none of which registers as a provider failure.

Cost controls are even more limited. Providers can cap your account spend, but nothing operates at the level of a single run. Claude Managed Agents is a partial exception: sessions can be interrupted externally via the API, which lets an external watchdog halt a run mid-flight. But the session creation API has no budget parameter — a session will keep running until you stop it or exhaust your credits. The problem is outsourced, not solved.

Latency compounds in a way a single-turn call doesn't. An agent making 10 sequential LLM calls experiences the sum of per-step latencies, not the average. A p95 tail hits multiple times across a run; the end-to-end distribution is far heavier than any per-request SLA suggests.

The agent is the only place to enforce a run budget

An MCP Gateway can cap tool call counts per session — but it's blind to what happens between tool calls. It never sees the tokens consumed by the LLM calls that drive the agent loop, so it can enforce "no more than 20 tool invocations," but not "$0.50 total inference cost per run."

The agent loop is the only component that sees everything: every LLM call, every response, every token — and every major provider hands you that data on every response.

The usage metadata you already have

The field names differ by provider, but the data is there. Anthropic returns a usage object with input_tokens and output_tokens; OpenAI uses prompt_tokens and completion_tokens; Bedrock and Vertex follow similar patterns under usage and usageMetadata respectively.

One thing to account for: prompt caching. Anthropic returns cache_read_input_tokens and cache_creation_input_tokens separately; OpenAI surfaces cached_tokens inside prompt_tokens_details. Cached tokens are priced lower than uncached ones — Anthropic charges around 10% of the base input price for cache reads. If your agent reuses a long system prompt across steps, ignoring cache pricing will overestimate cost. Cache creation costs a little more than a normal input token, so the first call of a cached run is your most expensive. Track the cache fields alongside input and output tokens, or your cost accumulation will be wrong in both directions.

The context window cost spiral

Step-count limits miss a cost curve that bends upward. As an agent runs longer, its message history grows — and every subsequent LLM call pays for that entire accumulated context as input tokens. A 10-step agent where each step adds 500 tokens doesn't accumulate costs linearly: step 1 pays for 1000 input tokens, step 10 pays for 5500.

Output tokens amplify this further — they're not fixed. The same prompt can produce responses of very different lengths depending on model state, sampling parameters, and load. A degraded model in a brownout may produce longer, less focused completions than usual; that extra output then becomes expensive input in the next step. An "available" model can silently inflate your costs before you've noticed anything is wrong.

Accumulating cost across a run

This is the main loop of a budget-aware agent:

A few things to keep in mind when building this loop:

At the top of every iteration, check before you call. Two things: accumulated token cost and elapsed wall-clock time. If either has crossed the limit, skip the LLM call and go straight to wrap-up. Not after — before. You need headroom for that final call.

Track input and output tokens separately. Pull them from the usage field on every response. They're priced differently on every model, so accumulate them as separate counters, multiply by the per-model rates, and sum across all steps since run start. A step counter is a proxy for the wrong thing — actual spend per step varies widely with context size.

Don't hard-stop when you hit the limit. Make one final LLM call: instruct the model to summarise what it has accomplished, note what remains, and return a partial result. A mid-run abort gives the caller nothing usable.

Use budget position as a dial, not just a switch. As accumulated cost climbs, start adapting before the limit is reached: inject a budget-aware instruction into the system prompt ("you have approximately N tokens remaining — be concise"), deprioritise optional tool calls, or tighten reasoning depth. Some models expose this directly — Claude's extended thinking accepts a budget_tokens parameter that caps internal reasoning spend per call. The pattern is the same regardless: the agent adapts continuously, not just at the hard stop.

Where gateways fit

Two gateways can strengthen this setup — one on the LLM side, one on the tool side — and they cover different ground.

An LLM Gateway sits between your agent and the LLM providers. Every inference call flows through it, making it the natural place to aggregate token usage and cost across runs, teams, and models — the observability layer that in-process budget tracking alone can't give you.

An MCP Gateway sits between your agent and tool servers. It's blind to LLM inference costs, but it is the natural insertion point for bounding tool call counts. Both ACP and A2A provide the identifiers needed to scope limits to a run: ACP defines explicit run_id and session_id UUID fields on every run; A2A gives each task a unique id and groups related tasks under a contextId. The agent passes one of these as a correlation header, enabling the gateway to count invocations per run and reject new calls once a ceiling is hit — a useful backstop for runaway tool-calling loops, and one that holds even if the agent's own logic has a bug. It cannot, however, replace agent-side budget enforcement: it never sees the inference costs between tool calls.

But nothing ties all of this together by run_id out of the box. The LLM Gateway sees inference costs; the MCP Gateway sees tool invocations; the Agent holds the live budget — each in isolation. The missing piece is a shared store keyed by run_id that all three can write to and read from during a live run. That store doesn't exist in any of these components today; you'd have to build it — backed by something like Redis or a time-series database — and wire it in yourself.

Observability

An observability platform like Langfuse or MLflow Tracing is not that shared store. These platforms are built for post-hoc analysis: they collect traces after the fact and surface them in dashboards, not as a live signal the agent can query mid-run. They tell you what happened; they don't tell the agent what it has left. The in-process tally is still yours to maintain; the observability platform is where you go afterwards to understand why a run cost what it did.

Conclusion

Neither ACP nor A2A currently define fields for expressing budget constraints on a run. Both provide identifiers — run_id, session_id, contextId — but no standard way to say "this run may spend at most N tokens" or "must complete within T seconds.", Budget metadata today lives only in your agent's own configuration, invisible to the protocol.

So a bounded agent is not an afterthought: it is a design constraint that must shape how you write the loop from the start. Concretely, that means:

Assign a run_id at run start. Generate a UUID or take one from the ACP/A2A invocation context. Carry it as a header on every LLM and tool call throughout the run.
Track input and output tokens separately on every step. They have different prices. Accumulate them from the usage field in each response, not from a step counter.
Check the budget before each step, not after. You need headroom for a graceful wrap-up call. A hard stop mid-run produces a result that is useless to the caller.
Set a wall-clock deadline. Token budgets don't catch slow steps or stalled tool calls. Both checks — cost and time — need to run before every step.
Use an MCP Gateway as a backstop for tool calls, not a substitute for the above. It catches runaway loops when the agent's own logic fails; it cannot see inference costs.

But none of this works without observability. The in-process budget check is a circuit breaker; it fires or it doesn't. What you need to understand is why a run was expensive — which step consumed most of the budget, and whether the pattern repeats. That requires traces — one span per LLM call and per tool call, tagged with run_id, with token counts and latency attached. Route inference through an LLM Gateway and correlate everything in a shared store keyed by run_id. Without it, you are flying blind: you know the run stayed within budget, but you have no idea how close it came, or where the cost is concentrated.

Bounding your agent ​

LLM provider guarantees ​

The agent is the only place to enforce a run budget ​

The usage metadata you already have ​

The context window cost spiral ​

Accumulating cost across a run ​

Where gateways fit ​

Observability ​

Conclusion ​