Skip to content

Networking AI Agents on Kubernetes (2): Local LLM Inference

In the previous post we established that agent traffic is dominated by outbound LLM calls, and that the right architecture is an LLM gateway — LiteLLM sitting between agent pods and providers, handling token-based rate limiting, API key injection, and intelligent routing. The gateway diagram showed three backends: api.openai.com, api.anthropic.com, and a local vLLM instance inside the cluster. This post is about that third backend.

Most agent deployments start with external LLM APIs. But several forces push towards running inference locally. At high request volume, per-token costs compound quickly and dedicated GPU compute becomes cheaper. Latency becomes harder to control when you depend on external SLAs and provider-side rate limits. Fine-tuned or quantized models built for specific tasks may not be available via any external API.

Sometimes the decision isn't economic at all — it's a hard constraint. In the previous post we noted that some organizations mandate an LLM gateway because only certain approved providers are permitted. The logical extension of that policy, in regulated industries where data residency requirements are strict, is that no external provider is permitted at all. Prompt content cannot leave the cluster. Local inference stops being an architectural option and becomes the only viable path.

Running inference locally on Kubernetes is genuinely harder than deploying any other workload. GPUs behave nothing like CPU or memory. Models are too large for container images. Scaling an inference server has characteristics that standard autoscalers were not designed for.

GPU as a Different Resource

CPU and memory in Kubernetes are divisible and overcommittable — you can request fractional cores, set limits above requests, and the scheduler handles contention. GPUs are none of those things. nvidia.com/gpu: 1 is an opaque extended resource: a pod either gets the full GPU or it doesn't, and once allocated, that GPU is held for the pod's entire lifetime regardless of how much it is actually being used. For inference pods serving sporadic agent requests, this is expensive — an idle pod still holds its GPU allocation.

GPU memory is the actual constraint, not compute. A model's memory footprint determines which GPUs can host it and how many instances can co-exist. Quantization directly affects this: a 7B model in fp16 requires ~14GB of VRAM, in int8 ~7GB, in GGUF 4-bit around 4GB. These differences determine whether you need a full A100 80GB or a smaller partition. Scheduling decisions downstream of this are mostly mechanical — the real choice happens when you pick the quantization format.

Topology adds another dimension. Models that exceed a single GPU's memory use tensor parallelism, splitting weights across multiple GPUs. Those GPUs must be on the same node and connected via NVLink for acceptable bandwidth. Kubernetes has no native concept of NVLink topology; you work around it with node labels, affinity rules, and topologySpreadConstraints, or with the LeaderWorkerSet (LWS) API for multi-node inference.

Sharing GPUs: Time-Slicing vs MIG

Full GPU-per-pod allocation is often wasteful, particularly when hosting multiple smaller models or running mixed development and production workloads. Two mechanisms exist for sharing a physical GPU across pods.

Time-Slicing

Configured on the NVIDIA device plugin DaemonSet via a ConfigMap. It exposes N logical GPU resources per physical GPU, each receiving equal time on the hardware. From a pod's perspective it has exclusive access to a GPU; in reality it shares both compute time and VRAM with other slices — there is no memory isolation. One workload can exhaust VRAM and cause OOM errors in adjacent pods.

Good for: development clusters and small models where predictability isn't critical. Not suitable for: production inference under concurrent load, where the absence of isolation is a meaningful risk.

MIG (Multi-Instance GPU)

Available on A30, A100, H100, and newer architectures (H200, B200). Partitions the GPU at the hardware level — each MIG instance gets dedicated compute engines, a fixed VRAM slice, and isolated memory bandwidth, giving true isolation with QoS guarantees. Instances appear as distinct resources in Kubernetes (nvidia.com/mig-1g.10gb, nvidia.com/mig-2g.20gb, etc.) and are requested like any other extended resource. An A100 80GB can be partitioned as a single 7g.80gb, or split into combinations such as two 2g.20gb and one 3g.40gb.

The limitation is rigidity. MIG profiles are configured statically on the node and cannot be changed without taking the GPU offline. If your workload mix shifts — more small models, a new large one — you are reprising a node maintenance operation. Plan partitioning upfront.

Good for: production multi-tenant clusters where different agent types run different model sizes and need isolated, predictable latency.

DRA (Dynamic Resource Allocation)

Introduced as alpha in Kubernetes 1.26 and significantly restructured in subsequent releases, DRA will eventually replace the device plugin model, enabling more expressive allocation semantics and dynamic GPU sharing. NVIDIA has a DRA driver in development. Not yet ready for production at most organisations, but worth watching.

The Serving Layer

vLLM has become the de facto standard for high-throughput inference in Kubernetes. Two features drive this:

  1. PagedAttention eliminates KV cache fragmentation by managing cache memory in non-contiguous pages, analogous to how an OS manages virtual memory. This dramatically improves GPU memory utilisation under concurrent load.
  2. Continuous batching processes new requests as tokens are generated rather than waiting for a full batch to complete, keeping GPU utilisation high even with variable request sizes.

vLLM exposes an OpenAI-compatible REST API, which means it plugs directly into the LLM gateway pattern covered in the previous post — LiteLLM routes to an internal vLLM Service exactly as it routes to api.openai.com, with no changes to agent code.

SGLang is an emerging alternative worth tracking for agentic workloads specifically. Its RadixAttention mechanism caches the KV state of shared prompt prefixes — valuable when many agents share the same system prompt, which is common in multi-agent deployments.

Model Storage

Models are large. A 7B parameter model in fp16 is approximately 14GB; a 70B model is around 140GB. Container images are the wrong abstraction — pushing a 14GB layer to a registry and pulling it on every pod start is impractical at that scale. The model must live outside the container image.

Three patterns exist, each with a different point on the simplicity-vs-latency trade-off.

  1. Init container — downloads the model from object storage (S3, GCS, HuggingFace Hub) when the pod starts. Operationally simple — no pre-provisioning required — but the cold start cost is prohibitive for large models: 10 to 30 minutes before inference can begin.
  1. PersistentVolume — stores the model once and mounts it into inference pods. ReadWriteMany volumes (NFS, CephFS) allow sharing across replicas but add network latency on the large sequential reads that happen at load time. ReadWriteOnce with node affinity keeps reads local and fast, but binds the inference pod to a specific node, constraining scheduling flexibility.
  1. Node-local storage — pre-pulls the model to node SSDs via a DaemonJob and mounts it with hostPath. The fastest option at load time, but requires every GPU node to be pre-provisioned with the model before inference pods can schedule there, and nodes must carry sufficient local storage capacity.

For most deployments, a PersistentVolume with ReadWriteOnce and node affinity is the practical starting point: fast enough, simpler than pre-provisioning, and it allows migration to node-local storage when latency requirements justify the operational overhead.

Scaling Inference

Autoscaling an inference server is not like autoscaling a web service. CPU utilisation is a poor signal — during active inference it spikes, between requests it drops to near zero, and neither reading reflects whether the server is actually overloaded. Request queue depth is the right signal: vllm:num_requests_waiting, exposed via vLLM's Prometheus endpoint, reflects how many requests are queued waiting for GPU time — the actual bottleneck.

Standard HPA can consume this metric via the Prometheus adapter, but the configuration is non-trivial. KEDA is a cleaner fit: it scales directly on raw Prometheus queries without the adapter layer. The vLLM Production Stack Helm chart has native KEDA integration — a ScaledObject targeting a vLLM Deployment is enabled through a single values.yaml block, with configurable thresholds, cooldown periods, and fallback replica counts.

Even with the right signal, general-purpose autoscalers have limits with GPU inference workloads. Throughput doesn't scale linearly with replica count because batching efficiency changes as replicas share load. Cold start times are long enough that reactive scaling often adds capacity too late for latency-sensitive traffic.

Dedicated solutions are emerging that understand LLM workloads natively:

  1. AIBrix — open-source, from ByteDance, under the vllm-project umbrella. Provides LLM-specific autoscaling policies and a Kubernetes-native control plane for vLLM, designed around inference traffic characteristics rather than generic request rates.

  2. NVIDIA Grove — a broader platform approach, offering multi-level autoscaling with topology-aware GPU scheduling for complex inference deployments.

  3. ScaleOps — approaches the problem from a different angle. Rather than replica-based scaling, it focuses on GPU rightsizing and safe workload co-location: packing multiple models onto fewer GPUs based on actual runtime memory and compute behaviour, reducing the total GPU count needed regardless of replica count.

Cold start is the central problem. An inference pod takes several minutes to load a large model into GPU memory before it can serve any request. Scaling up in response to increased load means new capacity isn't available for minutes after it's needed. The practical implication: you cannot scale an inference server to zero if you need predictable response latency. Setting minReplicas: 1 keeps one warm instance running at all times, accepting a continuous GPU cost to avoid cold start penalty on the first request.

Conclusions

Running vLLM in-cluster doesn't change the LLM gateway picture from the previous post. vLLM exposes an OpenAI-compatible endpoint; LiteLLM adds it as a named backend alongside external providers. From the agent's perspective nothing changes — the same POST /v1/chat/completions call reaches whichever backend the gateway's routing rules select. Workloads subject to data residency requirements route to the local cluster; cost-optimised workloads route to cheaper external models; the gateway falls back to external providers if the local inference server is overloaded or unavailable.

The GPU infrastructure is complex to operate, but it remains hidden behind the same interface agents already use.