Skip to content

Networking AI Agents on Kubernetes (1)

Traditional Kubernetes workloads have predictable, inbound-dominated traffic: many short HTTP requests arrive, get processed, responses go back. The networking model — Services, Ingress, NetworkPolicy — is designed around that shape.

AI agents invert it. Their traffic is dominated by outbound calls: every reasoning step triggers HTTPS requests to LLM provider APIs, tool endpoints, and retrieval systems. There is lateral traffic too — agent-to-agent calls in multi-agent topologies, and requests to locally deployed MCP servers — but that stays within the cluster and fits the standard east-west model reasonably well. The hard part is the external egress: unpredictable in timing, long-lived when streaming, and carrying all the operationally critical signals — cost, model selection, token usage — buried inside the request and response bodies. None of that is visible to standard Kubernetes networking primitives.

Controlling Egress

By default, Kubernetes allows all egress from pods. For agent workloads this is immediately problematic: any agent pod can reach any external endpoint, including exfiltrating data through an LLM prompt or making unconstrained calls that exhaust a provider quota in minutes.

The first instinct is to apply NetworkPolicy egress rules, but this hits a fundamental wall: NetworkPolicy speaks IP CIDRs. LLM provider APIs sit behind CDNs with continuously rotating IPs. Maintaining a static allowlist for api.openai.com is a losing game. CNIs that support FQDN-based egress rulesCilium being the most capable option — solve this at the DNS level, letting you express intent by hostname rather than by IP block.

FQDN-based policies give you L3/L4 control: you know which pods can reach which hostnames. What you still don't know is what they're sending, how many tokens are being consumed, or whether the traffic represents a bug-induced infinite loop burning through your budget. For that, address filtering is not enough — you need a chokepoint that can inspect the traffic.

Why a Regular Gateway Isn't Enough

The natural next step from FQDN policies is an egress proxy — an Envoy or Nginx instance that all agent traffic is funneled through. This is genuinely useful: egress isolation means agent pods have no direct NetworkPolicy egress to provider endpoints, request-count rate limiting gives you a coarse enforcement point, and you get basic access logs with URLs, status codes, and latency.

But standard gateways operate at the HTTP envelope level. They see method, URL, headers, and status code. For LLM traffic, the signals that actually matter are in the body:

  • The model name is in the request body ("model": "gpt-4o"), not the URL.
  • Prompt and completion token counts are in the response body JSON.
  • With stream: true, there is no single response body — only a sequence of SSE events, each carrying a token delta, flowing through the connection until the stream terminates.

Body inspection is not a feature regular gateways deliberately omit — it is a deliberate trade-off. Parsing and buffering request and response bodies adds CPU overhead, increases memory pressure proportional to payload size, and for streaming responses requires the gateway to consume the entire stream before forwarding it, adding latency that is unacceptable for interactive workloads. For the traffic patterns gateways were designed for — short JSON API responses, HTML pages, binary assets — the trade-off is sensible. For LLM traffic, where payloads routinely run to tens of kilobytes and responses stream for seconds, it becomes a fundamental capability gap.

Without body inspection you lose: token-based rate limiting (request count can be off by orders of magnitude as a cost proxy), model-aware routing, transparent provider failover, per-provider API key injection, and any form of semantic caching. A regular gateway gives you a controlled egress point but leaves you blind to everything that actually matters for LLM traffic.

The LLM Gateway

This is the gap that LLM-specific gateways fill. LiteLLM Proxy sits between agent pods and LLM providers, exposing a single OpenAI-compatible endpoint to agents while handling everything a regular gateway cannot:

  1. Token-based rate limiting. Budgets are enforced in tokens per agent type per minute, not in request counts. A single long-context call can consume as much as hundreds of simple queries — request counting misses this entirely.

  2. API key injection. Agent pods carry no real provider credentials. Each agent authenticates to the gateway with a virtual key scoped to its identity. The gateway holds the actual provider secrets (in a K8s Secret, or synced via External Secrets Operator) and injects them per outbound request. Key rotation requires updating the gateway's configuration, not redeploying every agent.

  3. Intelligent routing. The same /v1/chat/completions request can be directed to different backends depending on model name, prompt token count, provider availability, or latency requirements. If OpenAI returns a 429, the gateway retries transparently against Anthropic or a local vLLM instance without the agent knowing.

  4. Semantic caching. Responses to semantically similar prompts are served from cache, reducing latency and cost for agents with repetitive retrieval or classification patterns.

  5. Unified observability. Every LLM call — regardless of provider, model, or originating agent — flows through one point. Token usage, cost, and latency are available in one place without instrumenting each agent individually.

In Kubernetes, the deployment is a LiteLLM Deployment behind a ClusterIP Service. Only the gateway pods hold egress NetworkPolicy to external provider endpoints. Agent pods have no direct path to LLM APIs:

Kubernetes deployment pattern:

Agent pods call http://llm-gateway.svc.cluster.local/v1/chat/completions using a standard OpenAI client. The gateway is the only pod with egress to external LLM APIs — a compromised agent pod has no direct path to any provider.

Token Counting Through a Stream

Token-based rate limiting works cleanly for non-streaming calls: the response body includes a usage object with exact prompt and completion token counts. Streaming breaks this. With stream: true, the provider sends a sequence of SSE events carrying token deltas — there is no single response body, and token counts are only fully known when the stream terminates.

Three approaches exist for a gateway to handle this:

  1. Buffer the full stream before forwarding to the agent. Token counts are exact, but this destroys the streaming UX and adds substantial latency before the first token arrives.
  2. Count token deltas from each SSE chunk in real time as they pass through. Works, but requires parsing every chunk and maintaining per-stream state.
  3. Extract token counts from the terminal SSE event. Most providers include a final usage chunk at the end of the stream with total prompt and completion tokens.

LiteLLM uses option 3: chunks are forwarded transparently as they arrive, and usage is captured from the last event. The implication is that token budgets for streaming calls are enforced post-hoc — you don't know the final cost until the stream ends. For pre-call gates you can use prompt token estimates, which are available before the call since the full prompt is in the request body.

Trade-offs

An LLM gateway is not free. It introduces a new component in the critical path of every LLM call with the associated operational weight:

✅ Token-aware rate limiting and cost control❌ Extra hop adds latency — a few ms typically, but first-token latency matters for streaming
✅ API key centralization — single rotation point, agents carry no credentials❌ New component that must be kept highly available
✅ Provider failover behind a consistent OpenAI-compatible interface❌ Single point of failure: a misconfiguration takes down LLM access for all agents simultaneously
✅ Semantic caching shared across all agent types❌ Caching requires buffering the full response, which defeats streaming for cache hits
✅ Unified observability across all providers and models❌ At scale, handling all concurrent streaming connections puts significant memory pressure on the gateway
✅ No vendor lock-in in agent code❌ Routing rules and virtual key management add operational complexity

For most production deployments the trade-offs favor the gateway: the control it provides over cost, credentials, and routing is hard to replicate at the application level, and the operational overhead is manageable. The critical risk is availability — if the gateway goes down, all agents lose LLM access. Treat it accordingly: multiple replicas, readiness probes, and circuit breakers on provider connections.

In some organizations the gateway is not optional at all. Security and compliance policies often restrict which LLM providers agents are permitted to call — approved vendors only, no direct internet access to arbitrary model APIs. A gateway becomes the single enforcement point for that policy: it is the only pod with egress to approved providers, and routing rules determine which agent identities are permitted to reach which backends. Without it, enforcing provider allowlists at the agent level is a configuration problem scattered across every agent deployment — fragile and easy to bypass.