Skip to content

Instant Agents: How Sub-Second Infrastructure Changes What Agents Can Be

You want to run your own model. Maybe a fine-tuned Llama from Hugging Face, maybe something you trained yourself. The problem is everything that comes after: renting GPUs, building images, managing CUDA drivers, writing autoscaling logic, paying for idle capacity while nobody is hitting your endpoint. It's expensive, it's operationally heavy, and most teams either overpay for managed inference or burn weeks on plumbing.

Modal is a serverless GPU platform that collapses most of that. You define infrastructure in Python decorators — image, GPU class, scaling — and they handle the rest. The headline claim: sub-second container starts for cached images, ~50s for a full GPU inference server cold boot — down from ~2000s naively.

They've published the internals, and they're worth understanding: four independent systems that compound — cloud instance buffers, a custom FUSE filesystem, CPU checkpoint/restore, and GPU checkpoint/restore.

What Modal actually is

No YAML, no Dockerfiles in the critical path. Images, hardware, and scaling are defined in Python decorators:

python
import modal

image = modal.Image.debian_slim().pip_install("vllm")
app = modal.App("llm-inference", image=image)

@app.function(gpu="A100", scaledown_window=60)
def complete(prompt: str) -> str:
    from vllm import LLM, SamplingParams
    llm = LLM("meta-llama/Llama-3.1-8B-Instruct")
    return llm.generate([prompt], SamplingParams(max_tokens=256))[0].outputs[0].text

Image, GPU class, autoscaling, and the function — all in one file. modal deploy and it's serving. (Production usage loads the model once via @modal.enter rather than per-call — we'll see that pattern in the snapshotting section.)

Containers run on gVisor (runsc), a syscall-interception sandbox with stronger isolation than Docker. Billing is per-second with scale-to-zero and elastic GPU access (T4 through B200, multi-GPU). Beyond inference and training, Modal offers Sandboxes — ephemeral containers for running untrusted code (50,000+ concurrent) purpose-built for coding agents. Anthropic recently named Modal as one of the managed service providers for its self-hosted sandboxes alongside Cloudflare, Daytona, and Vercel. Their acquisition of Jamsocket adds session-scoped compute via Plane (open-source orchestrator) and ForeverVM (stateful Python REPLs for AI code execution).

Layer 1 — Cloud buffers: taking instance allocation off the hot path

Provisioning a cloud instance and health-checking its GPUs takes minutes to tens of minutes. Modal maintains a shared buffer of idle, healthy GPUs across multiple clouds — new replicas schedule onto pre-warmed machines instantly, and the buffer refills asynchronously. A resource solver (linear programming with Google GLOP) manages allocation based on scraped prices and observed supply. Active checks on boot, passive monitoring, and weekly deep diagnostics (dcgmi diag) keep the fleet healthy. A buffer is the price of robustness.

Layer 2 — ImageFS: a lazy-loading, content-addressed FUSE filesystem

docker run pulls an entire image (often GBs) sequentially through layers — minutes. Modal disaggregates the container launcher from image delivery. The runtime is gVisor (runsc); image delivery goes through ImageFS, a custom FUSE filesystem written in Rust. Instead of pulling a full image, ImageFS loads a ~5MB metadata index (1–100ms), mounts via FUSE (~2ms), and fetches file contents on demand. Research shows only 6.4% of data in a pulled image is actually needed.

Behind the mount sits a multi-tier content-addressed cache: page cache (µs, 10–40 GiB/s) → local SSD (100µs, 4 GiB/s) → regional CDN (100ms) → blob storage (200ms). Content-addressing deduplicates shared bytes across images regardless of layer structure. Lazy loading itself is not unique to Modal — AWS Lambda uses a similar approach. The innovation is combining it with gVisor, the custom FUSE filesystem, and the multi-tier cache.

Layer 3 — CPU memory snapshotting: checkpoint/restore for process state

Even with a fast filesystem, Python's module loading is brutal — import torch alone executes 26,000 syscalls. The insight: a running process is just a heap, threads, and file descriptors. Serialize it to disk and restore directly, skipping re-execution.

Modal uses gVisor's built-in checkpoint/restore (not Linux CRIU). The checkpoint is a single pages.img file (100MB to multi-GB), delivered through the same ImageFS FUSE machinery. The user-facing API: @app.cls(enable_memory_snapshot=True) with @modal.enter(snap=True) for pre-snapshot work (model loading, torch imports) and @modal.enter() for post-snapshot setup (e.g., moving weights to GPU). Caveat: snapshots are sensitive to host CPU instruction set — a multi-cloud platform needs multiple snapshots per deployment.

  • import torch: 5s → 1.05s (p50), 0.69s (p0)
  • Stable Diffusion cold start: 13s → 3.5s
  • General speedup: ~2.5x
  • Source: Memory Snapshots blog post

Layer 4 — GPU memory snapshotting: CUDA checkpoint/restore

Host-side snapshots don't capture GPU state — CUDA graphs, torch.compile artifacts, loaded kernels. For vLLM and SGLang, CUDA graph capture and JIT compilation alone take tens of seconds to minutes. NVIDIA's CUDA checkpoint/restore API (basic support from driver 550; full feature set from 570+) checkpoints device memory into host memory, which the existing CPU snapshot machinery persists to disk. Restore reverses the flow. Not everything should be snapshotted: model weights are better reloaded (throughput-bottlenecked, not compute-bottlenecked), and the KV cache is faster to recreate fresh. Multi-GPU adds complexity — NCCL programs can deadlock when one peer pauses.

  • 4–10x speedup over baseline
  • Parakeet (NeMo): 20s → 2s
  • vLLM (Qwen 3 0.6B): 95.6s → 13.8s (mean)
  • SGLang (Qwen 3 0.6B): 83.7s → 17.5s (mean)
  • Scale (Feb–Apr 2026): ~35M CPU + ~15M GPU snapshot restores
  • Source: GPU Memory Snapshots blog post

The compound effect — from 2000s to 50s

No single layer is revolutionary. What makes Modal's stack interesting is that they compound — each removes a different bottleneck, and the gains multiply rather than overlap.

  • Instance allocation: minutes → removed from hot path (buffer)
  • Filesystem: minutes → seconds (lazy loading + cache)
  • CPU init: seconds → sub-second (checkpoint/restore)
  • GPU init: minutes → seconds (CUDA checkpoint/restore)
  • Total: inference server cold start from ~2000s → ~50s

The bigger picture — instant infrastructure for instant agents

Modal is not an agent platform. It solves the compute layer — containers, GPUs, code execution — but provides no orchestration, no memory system, no tool registry. You bring your own agent framework (LangGraph, OpenAI Agents SDK, your own code) and Modal provides the fast, isolated compute underneath. Companies like Lovable and DoorDash are already doing this. But the interesting question isn't whether Modal can run agents — it's what happens to agent architecture when the compute layer is this fast.

Today's agents don't have a cold start problem. Most are long-running processes — coding assistants, research loops, multi-step pipelines orchestrated by Temporal or LangGraph — where you pay the startup cost once and the session runs for minutes or hours. The workarounds are good enough, and the industry has built around them. Frameworks like Restate are designed for durable execution. Google ADK markets the ability to "build long-running agents that pause, resume, and never lose context". The prevailing wisdom is that agents need long-running, stateful processes.

That wisdom rests on an assumption: that agents should be long-running, because the infrastructure demands it. What if they don't need to be?

Combine instant compute (Modal: sub-second containers, GPU snapshots) with instant inference and externalized state, and agents become ephemeral actions — spin up, load context, execute, write results, die. The "agent" is the workflow orchestration, not the process. Modal is already moving here with Sandboxes (50,000+ concurrent, sub-second startup, filesystem and memory snapshots for instant resume), and the inference side is shifting just as fast. Cerebras is hitting 2,600+ tok/s on Llama 4 Scout (19x faster than GPU solutions), 2,500+ tok/s on Maverick (~2.4x Blackwell), and 3,000 tok/s on OpenAI gpt-oss-120B — all driven by their WSE-3's 44GB on-chip SRAM at 21 PB/s bandwidth, which eliminates the HBM bottleneck that limits GPU inference. Taalas takes a more radical approach: their HC1 chip etches the model weights directly into silicon (mask ROM, no HBM, TSMC 6nm), hitting ~17,000 tok/s per user on Llama 3.1 8B — at the cost of running only that one model. Different paths, same direction: inference latency is collapsing.

This isn't just about inference. Every layer in the agent stack is converging on the same "instant" target. Container startup, model loading, token generation — and now the agent runtime itself. Agent Substrate, a new project from Kubernetes co-founder Tim Hockin, multiplexes hundreds of stateful agent sessions onto a small pool of Pods with sub-second activation and full state snapshots — 30× oversubscription on early demos. It's solving the same problem as Modal Sandboxes, but natively on Kubernetes. The pattern is the same everywhere: reduce startup time, externalise state, treat the process as disposable.

When every part of the pipeline is instant, the cost model flips. Long-running agents pay for idle time between actions; ephemeral agents pay only for execution and scale to zero between steps. A "session" becomes a chain of ephemeral invocations with shared external state.

The obvious counter-argument: state reload becomes the new bottleneck. If an ephemeral agent hydrates context from external storage on every invocation, that I/O cost replaces the cold start cost. But context is orders of magnitude smaller than a full process image (KBs–low MBs vs. GBs), it can be pre-fetched or streamed, and Modal's own storage layer (modal.Volume, modal.Dict) is co-located for exactly this pattern. Ephemeral agents trade one bottleneck (process startup) for another (state hydration) — but the ratio is shifting fast, and we're likely at the beginning of a 12–18 month transition.

Where this leaves us

The infrastructure thesis is simple: fast-enough startup changes what's architecturally possible. We're moving from "agents are long-running processes" to "agents are instant actions triggered by events." The companies that figure out the orchestration layer on top of instant infrastructure will define the next wave.

What does agent architecture look like when startup is free?