Skip to content

Networking AI Agents on Kubernetes (3): Agents Inside vs. Outside the Cluster

Edit (2026-04-10): Since this post was published, Anthropic announced Claude Managed Agents — a suite of composable APIs for building and deploying cloud-hosted agents at scale. This confirms the direction described in the "Pre-Built Agents" section: all three major AI providers now offer hosted agentic runtimes.

The first two posts in this series treated the LLM as an external dependency: the cluster makes outbound calls to an API, and the challenge is managing that egress — routing, rate limiting, failover, cost. That model assumes the agent loop belongs inside the cluster.

I think that assumption is about to break.

Agentic APIs — where you submit a task and a tool list and the provider runs the reasoning loop — invert the traffic model: instead of your cluster calling out to the LLM, the LLM calls back into your cluster to execute tools. The early signs are already there; the full shift hasn't happened yet, but the direction seems clear. This isn't just an architectural convenience — it follows from where the proprietary models and the valuable compute actually live.

This post is part analysis, part forecast: what that inversion means in practice, how to declare an agent portably, what an MCP Gateway actually needs to do, and where the standard-setting efforts stand today.

The Core Shift

Today's agent architecture runs the loop inside the cluster:

  • Your agent runs in a pod, powered by LangGraph / a custom loop
  • Cllients (maybe users, but maybe other systems) connect to this agent through some specific protocol like ACP / A2A, or maybe a regular REST API.
  • In each step there is a call to the LLM, so LLM API → get a tool call → execute the MCP tool → feed result back
  • In this model, the cluster keeps the Agent loop, the state, and the tool execution

The natural evolution: LLM providers offer agentic APIs — you submit the task and the tool list, they run the loop, and they call your tools directly.

This flips the traffic model: instead of outbound calls from the cluster to the LLM, you now have inbound calls from an external LLM provider into your cluster tools. The caller — user or orchestrating agent — communicates via ACP or A2A regardless of where the agent loop runs.

That reduces the cluster's role to two problems:

  1. Declare the agent (prompts + available tools + extra config)
  2. Expose your MCP tools so the external provider can call them

Declaring an Agent

Declaring an agent requires a way to express things like:

  • System prompt / instructions
  • List of available tools (names, descriptions, input schemas)
  • Parameters (max steps, timeout, model preference)

Several solutions exist for declaring an agent nowadays, depending on the environment where they are intended to be run:

  • OpenAI Responses API: tools defined per-request as proprietary JSON schemas — not portable across providers. The older Assistants API had a more persistent agent-object model (instructions + tools defined once, run against repeatedly), but is deprecated as of 2026-08-26 in favour of the Responses API and Conversations API.
  • A2A Agent Cards: created by Google, donated to the Linux Foundation in June 2025, now an open standard. Designed for declaring agent capabilities and endpoints — closer to service discovery than task submission. By August 2025, ACP and A2A had joined forces under LF AI & Data, converging towards a unified standard.
  • Kagent CRDs (covered in a previous post): declarative K8s-native agent spec, designed for in-cluster agents — not for submission to external providers.
  • Docker cagent: Docker's open-source multi-agent runtime using declarative YAML. Agents define their model, instructions, and MCP toolsets in a single file (docker agent run agent.yaml). Agents can be versioned and pushed to OCI registries like container images — a compelling distribution model, oriented toward local or in-cluster execution.

A declarative agent spec (YAML, JSON, CRD) defines what an agent is — its identity, instructions, and available tools. It does not define what the agent should do in a given invocation. At runtime, additional information must be provided:

  • The task input: the actual user request or trigger payload
  • Token budget: how many tokens this run is allowed to consume
  • Step / timeout limit: how long the agent can run before being terminated
  • User identity and context: who triggered this run, for authorization and audit
  • Active tool subset: which of the declared tools are enabled for this specific invocation

This distinction matters: the spec is static configuration; the invocation is a separate API call that combines the spec with runtime parameters. Current solutions conflate these two concerns, which makes it hard to reason about cost, safety, and access control separately from agent identity.

Exposing MCP Tools to External Callers

Exposing MCP tools to external callers is not unique to the model shift described above — it is a general requirement for any company that wants to make its capabilities available to AI agents. Flight search engines are already publishing MCP servers so that agents can query routes and fares directly. Travel booking platforms, financial data providers, e-commerce catalogues — any service that wants to be reachable by an autonomous agent needs to solve exactly this problem: how to expose tools over the network, to callers you don't fully control, without leaking credentials or opening your internal services to abuse.

In the context of this shift, the problem is the same but the caller is the LLM provider running your agent's reasoning loop. MCP servers typically run as internal cluster services, unreachable from outside. An external LLM provider calling your tools needs a specialised component in front of them — what we call an MCP Gateway.

What an MCP Gateway Is (and Why a Regular API Gateway Isn't Enough)

A regular API gateway (Nginx, Traefik, a plain Envoy proxy) handles TLS termination, JWT validation, rate limiting, and URL-based routing. That covers the network layer, but falls well short of what MCP requires. Six capabilities set it apart from a standard proxy:

  1. MCP protocol parsing. MCP uses JSON-RPC 2.0 over Streamable HTTP, not plain REST. Every call arrives at the same endpoint (POST /mcp), with the operation (tools/list, tools/call, etc.) and the target tool name inside the JSON body. A regular gateway routes by URL — routing by body content requires custom plugins.

  2. Session lifecycle management. MCP defines a stateful connection lifecycle: initializenotifications/initialized → operation → DELETE (shutdown). The gateway must issue and track mcp-session-id across requests — regular gateways are stateless.

  3. MCP server multiplexing. A single gateway endpoint can aggregate multiple backend MCP servers, deduplicating tool names and routing each tools/call to the right backend. This is how you present a single surface to the LLM provider while keeping tools split across separate internal services.

  4. OAuth 2.1 discovery endpoints. The MCP spec requires exposing /.well-known/oauth-protected-resource so clients can discover the authorization server — a step no standard gateway implements.

  5. Tool filtering and scope enforcement. OAuth scopes must be mapped to MCP tool names, with the gateway enforcing which tools are visible and callable per token. This requires MCP-aware policy logic, not just JWT validation.

  6. SSE / streaming support. MCP responses can be streamed as Server-Sent Events; the gateway must proxy long-lived connections without buffering or timeout interruption.

None of this existed in productised form before mid-2025. Today, Envoy AI Gateway (with a MCPRoute CRD for Kubernetes), Kong AI Gateway 3.12 (with ai-mcp-proxy and ai-mcp-oauth2 plugins), Kgateway (CNCF, K8s Gateway API native), and Solo.io Agent Gateway each cover a different subset of these capabilities — and no single implementation covers every requirement yet.

Why This Shift Will Happen

Following the Compute

Workloads belong where the compute is. Conventional applications run in the cluster because that is where the CPU and memory are spent — the cluster is the computer. AI agent workloads are different. The vast majority of their compute happens at the LLM: every reasoning step, every token generated, every tool-call decision is GPU time at the provider's data center. The agent loop running in your cluster pod is almost entirely idle — waiting for the LLM to respond, parsing JSON, issuing the next call. That is not a workload; it is a relay.

Moving the loop where the compute actually runs is the natural conclusion. What remains in the cluster are the MCP tools — the parts that genuinely interact with local resources: databases, internal APIs, the Kubernetes control plane. They stay because they must; everything else can move out.

The Model Is the Moat

Compute placement is only part of the story. Traditional cloud computing was a question of convenience and economics — managed infrastructure that anyone could replicate on-premises with enough capital and engineering. The technology was not secret; the moat was operational scale and pricing.

AI changes this. Even with vLLM and the best open-source models, you are bounded by what those models can do. The proprietary models — GPT-4o, Claude, Gemini — are only accessible through the providers' APIs. You cannot self-host them. For the first time, cloud providers hold something on-premises infrastructure genuinely cannot replicate: the model itself.

The advantage compounds. AWS, Google, and Microsoft are integrating LLMs directly into their SaaS ecosystems — Bedrock with S3, Lambda, and RDS; Vertex AI with BigQuery and Cloud Functions; Azure OpenAI with the entire Microsoft productivity stack. An agentic API that can natively call managed cloud services without any network hop is a qualitatively different proposition from one that has to tunnel back into your cluster for every tool call.

This may be the inflection point cloud providers have been waiting for: a differentiation that cannot be competed away by running equivalent open-source software on bare metal.

There is some evidence that this is already happening:

  • OpenAI Responses API: you define tools, OpenAI manages the loop, calls back to your tools, and returns the final answer. Explicitly described as "agentic by default", replacing the older Assistants API.
  • Anthropic tool use + extended thinking: moving towards multi-step autonomous execution. In April 2026, Anthropic announced Claude Managed Agents — a suite of composable APIs for building and deploying cloud-hosted agents at scale, closing the gap with AWS and Azure on the hosted agentic runtime front.
  • Google Gemini with tool use: same direction.
  • Azure AI Foundry Agent Service: Microsoft's unified platform for building and running agentic applications, with native MCP support — "bring your own APIs using Model Context Protocol" — and Azure API Management handling governance and observability across models and MCP tools.
  • Amazon Bedrock AgentCore: AWS's runtime for deploying and operating agents at scale. Natively supports MCP, ships with an explicit MCP Gateway component, and integrates with AWS Marketplace for pre-built agents and tools. In April 2026, AWS added the Agent Registry — a cloud-agnostic discovery and governance hub that indexes agents, tools, MCP servers, and agent skills across providers, supporting both MCP and A2A metadata standards.

The pattern is clear: providers want to own the reasoning loop, not just next-token prediction.

The Exception: Security-Mandated In-Cluster Agents

This evolution makes no sense for organizations that are required to run their AI agents inside the cluster for security or compliance reasons. If the agent loop itself must not touch external infrastructure — because it processes regulated data, because network policies prohibit outbound LLM calls, or because the security posture demands full auditability of every reasoning step — then the in-cluster architecture covered in the previous posts remains the only viable path. The agentic API model assumes you can trust an external provider with your agent's reasoning. Not every organization can.

The Next Step: Pre-Built Agents as a Provider Offering

If providers own the reasoning loop and agents are submitted declaratively, the logical next step is for providers to skip the submission entirely — and offer pre-built agents you simply install into your account.

The prompts are already written, hardened, and maintained by the provider. The tools are well-known third-party MCP servers with established schemas. Your job as an operator reduces to three things: pick the agent, point it at your MCP endpoints, and wire up a trigger. No prompt engineering, no tool schema definitions, no agent loop to maintain.

The examples write themselves. A Security Scanner agent pre-configured to query a Splunk MCP server — you point it at your instance, schedule it nightly, and it surfaces anomalies without you writing a single prompt. A Compliance Conformance agent that knows how to interrogate a Docker Registry MCP server and an Artifactory MCP server — you install it, connect your endpoints, and it runs whenever a new image is pushed. The agent knows what to look for; you supply the data sources.

The trigger model is exactly what you would expect from the Lambda analogy: periodic scheduling (a cron expression), reactive triggering from infrastructure events (a new image tag, a failed deployment, a spike in error rate), or on-demand invocation from another agent. The provider handles the execution, the scaling, and the loop. You handle the MCP endpoints and the event wiring.

This is, in effect, an agent marketplace. Cloud providers already sell pre-integrated SaaS connectors and managed runbooks. Managed agents are the same idea, one abstraction higher — except the "function" being sold is not code but a reasoning loop with a curated tool set. Providers that own both the models and the agent catalogue have a compounding advantage: every new MCP server in the ecosystem makes their pre-built agents more useful, without any additional work on their part. Azure AI Foundry is already moving in this direction, offering over 1,400 pre-built connectors to enterprise systems — SAP, Salesforce, Dynamics 365 — alongside native MCP tool integration and event-driven triggers via Azure Logic Apps and Azure Functions. Amazon Bedrock AgentCore takes the same approach on the AWS side: pre-built agents and tools listed on AWS Marketplace, deployable against your own MCP endpoints. The AWS Agent Registry, launched in preview in April 2026, goes further — a cloud-agnostic catalogue that indexes agents, tools, and MCP servers from any provider, with governance metadata and real-time operational data (latency, uptime, usage). Anthropic joined in April 2026 with Claude Managed Agents — composable APIs for deploying cloud-hosted agents at scale. All three are early, concrete versions of exactly this model.

Conclusions

The tooling isn't complete yet. No portable agent spec exists, and no single MCP Gateway implementation covers every requirement. But the direction is set — ACP and A2A are converging, purpose-built MCP gateways are shipping, and the major providers are already running agentic loops at their end. The question is not whether this shift happens, but how fast, and whether your cluster is ready for it. The posture worth adopting now is to start thinking about the cluster as a capability provider rather than a compute host: the MCP tools you expose, the authentication boundary you enforce, and the policy you define at the gateway become the interface your agents depend on — regardless of where the reasoning loop runs.