Agent Runtime Isolation: Docker, Firecracker, VM Sandbox — How to Choose

Q: Is Docker with a seccomp profile enough?

No. seccomp can only filter system calls, but Docker's default runtime (runc) still shares the same Linux kernel with the host. Any kernel vulnerability (CVE) can allow an attacker to break out from the container to the host. AWS's core reason for replacing QEMU with Firecracker in 2019 was exactly this: the attack surface of a shared kernel is too large. The correct approach is to choose an isolation strategy based on the agent's risk level — low risk (personal tools) can use Docker + seccomp + hardened configuration, medium risk (internal agents) needs at least gVisor for userspace kernel interception, and high risk (multi-tenant / LLM-generated code execution) should strongly prefer Firecracker or Kata for hardware-level VM isolation.

Q: How to choose between Firecracker and gVisor?

The core difference is the isolation mechanism: gVisor intercepts system calls through a userspace kernel (Sentry), while Firecracker provides a true VM boundary through KVM hardware virtualization. Selection criteria: (1) If your agent needs GPU passthrough, choose gVisor (Firecracker does not support GPU passthrough); (2) For compute-intensive, low-I/O workloads, gVisor offers the best price-performance ratio (CPU overhead <5%, I/O overhead 10–30%); (3) For multi-tenant platforms or executing untrusted LLM-generated code, Firecracker is a recommended baseline for multi-tenant untrusted execution — the hardware VM boundary is easier to justify in audits because VM boundaries are familiar and independently verifiable; (4) If startup speed is the highest priority (<10ms), neither is suitable — consider WASM or containers. Performance comparison: Firecracker cold start ~125ms, memory ~5MB; gVisor cold start ~100ms, memory ~20MB.

Q: What impact do microVMs have on CI/CD pipelines?

The impact of microVMs on CI/CD pipelines is manageable and centers on three areas: 1) Image building — Firecracker uses a minimal rootfs (Alpine ~63MB vs Ubuntu ~300MB), with a build process different from traditional Docker, but it can be integrated into existing CI pipelines; 2) Startup latency — a single cold start of ~125ms (Firecracker) to ~200ms (Kata) has minimal impact on CI tasks (typically lasting seconds to minutes); 3) Resource cleanup — microVMs auto-destroy on exit, leaving no residual processes or files, which actually simplifies CI environment cleanup. Recommended approach: use a warm pool to eliminate cold start latency — Google GKE Agent Sandbox and AWS Lambda SnapStart have both validated this pattern, reducing startup latency from hundreds of milliseconds to sub-millisecond levels.

Q: Are there any ready-to-use agent sandbox services?

Yes, divided into self-hosted vs. managed categories. Managed services: 1) E2B (open source, GitHub 12K+ stars) — a Firecracker-based agent sandbox platform, cold start 80–200ms, one of the hosted sandbox clients/providers listed in the OpenAI Agents SDK documentation, with MCP server support; 2) Docker Sandboxes — Docker's official MicroVM sandbox service launched in 2026, with an independent Docker daemon per sandbox; 3) Northflank — based on Kata Containers, with GPU support. Self-hosted options: 1) Use the kubernetes-sigs/agent-sandbox controller to deploy on GKE, supporting SandboxTemplate + WarmPool for sub-second sandbox creation; 2) Build a custom sandbox service with the Firecracker Go SDK, referencing E2B's open-source architecture. Selection principle: if your team has Kubernetes operational expertise, self-hosting offers more flexibility; if you need to get started quickly, E2B is the most mature open-source option.

2026-05-21 · Difficulty: Intermediate-Advanced · AI Agent Production Engineering Series (Part 4 of 6)

⚡ TL;DR — 30 Seconds

The Docker default runtime (runc) shares the host kernel — a single kernel CVE could lead to agent escape; it cannot be used for untrusted code execution
Production choices: gVisor (lightweight, userspace kernel interception) → Firecracker (hardware VM, ~125ms startup) → Kata Containers (Kubernetes native)
Decision rule of thumb: LLM-generated code → at minimum gVisor; multi-tenant → Firecracker/Kata; text-only agents → hardened Docker is sufficient

1. Why Docker Isn't Enough

Multiple agent-security practices and OWASP GenAI risk discussions treat untrusted code execution as a high-risk scenario and recommend isolated execution environments for LLM-generated code. Docker is most teams' first instinct — but it also happens to be the most easily misused solution.

The isolation mechanism of standard Docker containers (runc runtime) is Linux namespaces + cgroups. The core problem with this mechanism: containers share the same Linux kernel with the host. Namespaces provide view isolation for processes, networks, and filesystems, but they are not security boundaries — they are management boundaries.

"Docker containers are not virtual machines. They are process wrappers that share a kernel."

What does this mean? Any kernel vulnerability (CVE) can allow malicious code inside a container to break out to the host. In 2019, the AWS Lambda team proved this with data: QEMU had 1.4 million lines of code, hundreds of emulated devices, and a continuous stream of CVEs. They replaced QEMU with Firecracker — an ultra-minimal VMM of 50,000 lines of Rust code. AWS's choice sent a clear message: for scenarios involving untrusted code execution, even within a cloud provider, shared-kernel containers are not enough.

Docker itself has acknowledged this. In April 2026, Docker launched the Docker Sandboxes product — Docker Sandboxes use microVM isolation, isolated networking, and a per-sandbox Docker Engine.

This is the first key insight this article aims to establish: Docker is an excellent packaging and distribution tool, but its default runtime (runc) is not a security boundary for agent code execution. For AI agent scenarios — where agents can be induced by prompt injection to execute arbitrary code — we need stronger isolation.

The Agent Execution Security Chain — Review

In the first article of this series (Agent Code Sandbox Design), we built a five-boundary sandbox architecture — from process isolation to network isolation. In the second article (Agent Tool Permission Control), we defined which tools the agent can invoke. In the third article (Agent Command Execution Safety), we implemented a Policy Engine to review every command. All three of these layers operate within the same runtime environment.

This article answers the next-level question: what technology should isolate the runtime environment itself? When the Policy Engine approves a command, when the agent executes code inside the sandbox — how hard is the sandbox's underlying isolation boundary? Is it a container boundary sharing the host kernel, or a standalone hardware virtualization boundary?

2. The Isolation Spectrum: From WASM to Full VMs

Runtime isolation isn't a binary choice (isolated vs. not isolated) — it's a continuous spectrum. From ultra-lightweight language-level sandboxes to full hardware virtualization, each tier makes different tradeoffs among startup speed, memory overhead, and security strength.

Six Isolation Technologies at a Glance

Technology	Isolation Mechanism	Cold Start	Memory Overhead	Host Kernel Exposure	Escape Blast Radius
WASI / WASM	Capability-based runtime	<1ms	<1MB	None	Within WASI interface
Cloudflare V8 Isolates	V8 runtime isolation	<1ms	~1MB	None	Within V8 sandbox
Docker (runc)	Linux namespaces + cgroups	~10–50ms	~10MB	Fully shared	Host kernel CVE
gVisor (runsc)	Userspace kernel (Sentry) intercepts syscalls	~100ms	~20MB	None (Sentry intercepts)	Sentry + kernel
Firecracker	KVM hardware virtualization microVM	~125ms	~5MB	None (KVM hardware)	Hypervisor CVE
Kata Containers	Lightweight VM (KVM) wrapping OCI interface	~200ms	~30MB	None (KVM hardware)	Hypervisor CVE

(Data sources: Zylos Research 2026-04 systematic comparison; NumaVM 2026-03 Firecracker end-to-end benchmarks; NextKick Labs 2026-01 Firecracker vs Docker security comparison.)

Note: the numbers below come from public papers, vendor documentation, and third-party benchmarks. Treat them as order-of-magnitude guidance; real latency and overhead vary by host hardware, kernel version, image size, storage layer, network model, and workload type.

Key Dividing Lines on the Spectrum

There are two critical dividing lines on this spectrum:

The first dividing line: shared kernel vs. independent kernel. Docker (runc) sits on the left side — it shares the host kernel. WASM and V8 Isolates, while having a higher security grade than Docker (they don't directly expose Linux syscalls), also have an extremely narrow capability scope — WASM can't execute shell commands, can't access the filesystem (unless explicitly granted). gVisor, Firecracker, and Kata sit on the right side — they all provide some form of independent kernel boundary. This is the baseline requirement for agent production deployments.

The second dividing line: software interception vs. hardware isolation. gVisor intercepts syscalls through software (a userspace Go program called Sentry), while Firecracker and Kata rely on hardware virtualization (Intel VT-x / AMD-V). Software interception is more flexible and has lower overhead, but the attack surface exists at the software layer (Sentry itself could be compromised); hardware isolation provides a stronger boundary, but requires KVM support and additional memory overhead.

Why WASM and V8 Isolates Are on the Far Left?

WASM (WebAssembly) and V8 Isolates provide the highest-performance isolation — sub-millisecond startup time, under 1MB of memory overhead. But their capability scope is severely limited: WASM modules cannot directly invoke system calls; all external interactions (filesystem, network) must go through the WASI interface with explicit authorization. This makes them ideal for pure-computation sandboxes — for example, an agent calling a Python function for math or JSON processing — but unsuitable for agent tool invocations that require a full Linux execution environment.

For agent scenarios requiring full Linux capabilities — shell commands, pip installs, git operations, etc. — WASM is too narrow. What we need is: retain full Linux capability while providing stronger isolation than Docker. That's where gVisor and Firecracker sit.

3. MicroVMs in Practice: Firecracker and Kata Containers

If you could pick just one "sweet spot" from the technology stack — VM-level isolation strength combined with near-container startup speed — the answer is microVMs. Firecracker is the category definer; Kata Containers is its Kubernetes-native cousin.

Firecracker: 50,000 Lines of Rust Minimalism

AWS's motivation for releasing Firecracker in 2018 was straightforward: Lambda needed to run thousands of tenants' functions simultaneously. QEMU was too heavy (1.4 million lines of code, hundreds of emulated devices, continuous CVE disclosures), but Docker's shared kernel couldn't satisfy multi-tenant isolation requirements. Firecracker's design philosophy: only do the absolute minimum needed to create a microVM; cut everything else.

This is reflected in several hard numbers:

50,000 lines of Rust code (4% of QEMU), minimal attack surface
Only 3 virtual devices: virtio-block (block storage), virtio-net (networking), serial console — compared to QEMU's hundreds
One independent VMM process per microVM — no shared daemon, a single point of failure doesn't cascade to other instances
150 microVM creations per second (per host), cold start ~125ms
VMM process only ~5MB memory, supports 20× overcommit (tested), 10× overcommit (production)

Firecracker Security Architecture: Dual Concentric Rings

Firecracker's security design can be understood as two concentric rings:

                  ┌──────────────────────────────────┐
                  │        Outer Ring: CPU Hardware Boundary     │
                  │   Intel VT-x / AMD-V virtualization extensions    │
                  │   Any VM escape must first breach the hardware boundary │
                  │  ┌────────────────────────────┐   │
                  │  │   Inner Ring: Jailer Sandbox     │   │
                  │  │   • chroot filesystem isolation      │   │
                  │  │   • seccomp limits to 24 syscalls │   │
                  │  │   • cgroup resource limits          │   │
                  │  │   • isolated network namespace          │   │
                  │  └────────────────────────────┘   │
                  └──────────────────────────────────┘

The outer ring is CPU hardware virtualization — physical isolation between different VMs on the same CPU. The inner ring is the Jailer — Firecracker's built-in secondary sandbox process that further constrains the VMM process itself through chroot, seccomp (only 24 syscalls allowed), and cgroups. Even if an attacker breaks out of the microVM boundary, they still need to pierce through the Jailer's isolation to reach the host.

This dual-isolation philosophy comes from AWS's Lambda operational experience. Lambda handles trillions of function invocations per month, and Firecracker's multi-tenant isolation has been validated at hyperscale.

Real-World Adoption

Platform	Isolation Technology	Cold Start	Session Limit	Notes
AWS Lambda	Firecracker microVM	~125ms (cold) / <50ms (SnapStart)	15 min	Trillions of monthly invocations, 10× overcommit in production
E2B	Firecracker microVM	~80–200ms	24 hours	Open source (12K+ stars), one of the hosted sandbox providers listed in the OpenAI Agents SDK documentation
Anthropic Computer Use	Firecracker (via E2B)	~150ms	Per session	Desktop sandbox, provides graphical isolated environment for computer-use agents
Fly.io Machines	Firecracker microVM	~125ms	Per machine	Global anycast network
Sprites.dev	Firecracker microVM	~150ms	Unlimited	Checkpoint/rollback support
Docker Sandboxes	MicroVM (custom VMM)	~200ms	Per session	Independent Docker daemon per sandbox + MITM TLS filtering proxy

(Data sources: E2B GitHub & docs 2023–2026; AWS Compute Blog 2025-08; Docker Blog 2026-04-16.)

Kata Containers: Kubernetes-Native VM-Level Isolation

If Firecracker is "the microVM born in the Lambda scenario," Kata Containers is "the secure container grown in the Kubernetes ecosystem." Kata's core design: each container runs inside a dedicated lightweight VM, but exposes a standard OCI interface externally. To Kubernetes, Kata is just another RuntimeClass — you can use the same Pod spec and switch to a different isolation level.

# Kubernetes RuntimeClass three-tier isolation strategy
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: isolated        # gVisor — medium isolation
handler: runsc
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: highly-isolated # Kata — strong isolation
handler: kata

Kata 3.0 (2025–2026) is a major rewrite — migrating from Go to Rust, with significant performance improvements. The Kata 3.x ecosystem can use backends such as QEMU, Cloud Hypervisor, Firecracker, and Dragonball; some distributions and cloud-provider offerings use Dragonball for deep optimization.

Compared to Kata 2.0: 90% overhead reduction, 3× faster startup, 10× density improvement
Supports multiple VMM backends: QEMU, Cloud Hypervisor, Firecracker, Dragonball
Built-in GPU passthrough support — a capability Firecracker lacks

Kata's unique value lies in bridging the gap between the "container ecosystem" and "VM isolation." If you're already running agent workloads on Kubernetes, Kata is the smoothest path to upgrading isolation — just change the RuntimeClass.

Firecracker vs. Kata: How to Choose?

Dimension	Firecracker	Kata Containers
Startup Speed	~125ms (no snapshot) / <50ms (snapshot restore)	~200ms (Dragonball backend)
Memory Overhead	~5MB (VMM) + guest kernel	~30MB (VMM + kata-agent)
K8s Integration	Requires custom controller	Native RuntimeClass, community-maintained
GPU Support	Not supported	GPU passthrough supported
OCI Compatibility	Non-OCI (independent API)	Fully OCI-compatible
Best For	Custom agent sandbox platforms, serverless patterns	Kubernetes-native agent clusters

Selection principle: building a custom agent sandbox platform → Firecracker; already on Kubernetes → Kata Containers. If you need both — for example, the Kubernetes ecosystem with Firecracker's extreme performance — Kata can use Firecracker as a VMM backend (Kata + Firecracker combination).

4. gVisor: The Art of Userspace Kernel Tradeoffs

gVisor takes a unique path — it doesn't isolate at the hardware level (unlike Firecracker with KVM), nor does it filter at the syscall level (unlike seccomp intercepting at the kernel entry point). Instead, it implements a complete Linux kernel in userspace.

Sentry: The Userspace Syscall Interceptor

gVisor's core component is called Sentry — a userspace kernel written in Go. When a process running inside a gVisor sandbox issues a system call:

The process issues a system call (e.g., openat)
The kernel's seccomp redirects the system call to Sentry (a userspace process)
Sentry processes the system call in its own memory space — validates parameters, checks permissions, executes logic
Sentry issues only the genuinely necessary operations to the host kernel through a restricted set of system calls

The key: applications inside the sandbox can never directly issue system calls to the host kernel. All system calls are proxied by Sentry in userspace. This means even if an application inside the sandbox triggers a kernel CVE, that CVE requires the host kernel to execute a malicious operation — but the attacker can never reach the host kernel; they can only reach Sentry.

Platform Modes: Systrap vs. KVM

gVisor offers two platform modes:

Mode	Mechanism	Performance	Security	Recommended For
Systrap	seccomp redirects syscalls	10–30% I/O overhead	Relies on seccomp	Recommended default since 2024, no KVM required
KVM	Hardware virtualization + Sentry in guest ring 0	Lower I/O overhead (~10%)	Hardware + Sentry dual-layer	Environments needing extreme performance with KVM support

gVisor's Performance Profile: Asymmetric Overhead

gVisor's performance overhead isn't evenly distributed — it shows significant asymmetry between compute and I/O (data from Zylos Research 2026-04 and Safeguard.sh 2023-12):

Pure compute (CPU-bound): overhead <5%, near-native performance. Sentry only intervenes when a system call is actually triggered.
File I/O: overhead 10–30%. Reason: every file operation traverses the "sandbox process → Sentry → Gofer (file proxy process) → host filesystem" path, adding 2 userspace round-trips compared to a direct system call.
Network I/O: throughput drops 20–40%. Reason: Sentry implements its own TCP/IP stack (based on Google's netstack) rather than using the host's network stack.
Syscall-intensive workloads: performance drops 2–5×. Reason: every system call requires context switching and processing inside Sentry.

What this means: gVisor is the best price-performance choice for compute-intensive agents (e.g., model inference, data processing), but for agents requiring heavy file I/O (e.g., code compilation, large-scale file processing), I/O overhead may become a bottleneck.

If GPU is required, verify current platform support first. gVisor has GPU support documentation, and Kata also has GPU passthrough paths; Firecracker is generally not the default choice for GPU passthrough.

gVisor's Syscall Coverage

A practical limitation of gVisor: it only implements about 200 Linux system calls (roughly 70% of the complete Linux kernel). Most agent workloads only use a small subset of these, but certain edge cases may encounter unimplemented syscalls:

✅ Well supported: file operations (openat, read, write, close), process management (fork, execve, wait4), networking (socket, connect, sendto)
⚠️ Partially supported: certain ioctl subcommands, advanced networking features
❌ Not supported: kernel module operations (init_module), direct hardware access (iopl), certain special filesystem operations

Practical impact: the vast majority of Python/Node.js/Go agent workloads run well on gVisor. A third-party reverse-engineering analysis reported signs of gVisor in the OpenAI Code Interpreter environment.

gVisor Security Model: One Wall in Defense in Depth

From a security perspective, gVisor doesn't provide the hardest boundary (hardware virtualization), but rather the narrowest attack surface:

Applications cannot directly call the host kernel — all syscalls are proxied by Sentry
Sentry is a pure userspace Go program — a memory-safe language, avoiding the buffer overflow class of vulnerabilities common in C
Gofer (the file proxy) runs with minimal privileges — can only access allowlisted directories
Even if Sentry is compromised, the attacker is still inside a seccomp sandbox — with a limited set of syscalls

gVisor's positioning: an order of magnitude stronger than Docker, an order of magnitude lighter than Firecracker. It's best suited for scenarios where: you need stronger isolation than "shared host kernel," but your infrastructure doesn't support KVM (e.g., certain cloud environments, CI platforms), or you need GPU support.

5. Agent Runtime Integration Patterns

Once you've chosen an isolation technology, the next engineering question is: how does the agent's tool invocation connect to the isolated runtime? When the LLM decides to invoke a tool (e.g., executing Python code or a shell command), the actual execution should happen inside the isolated sandbox. There are five main patterns for this "handoff" process.

Five Integration Patterns Compared

Pattern	Mechanism	Latency	Security Strength	Best For
API-managed	Call sandbox provider API (E2B, Modal)	+50–100ms RTT	Strong (provider-managed)	Cloud-hosted agents
exec local	Direct subprocess execution	~0ms	Weakest	Local dev, trusted code
gRPC sidecar	Sidecar sandbox daemon	+1–5ms	Strong (local VM)	Self-hosted Firecracker clusters
OCI runtime	Docker/Podman + runsc/kata	+10–200ms	Medium–Strong	Kubernetes native
vsock	Inter-kernel communication (no network interface)	<1ms	Strong (no network exposure)	Firecracker host↔guest

Pattern 1: API-Managed (E2B Pattern)

This is the most frictionless approach: use a managed sandbox service. The agent framework creates sandboxes, executes code, and retrieves results via API. E2B is the representative of this pattern — it provides Python/TypeScript SDKs; the agent just needs a single line: sandbox.run_code(code), while underneath it's a Firecracker microVM.

# E2B API pattern — agent code example
from e2b import Sandbox

# Create a Firecracker microVM sandbox (~80-200ms)
sandbox = Sandbox.create(template="python-3.12")

# Agent-generated code executes inside the isolated microVM
result = sandbox.run_code("""
import os
import subprocess
# Even dangerous operations inside the sandbox can't affect the host
subprocess.run(["ls", "-la"])
print(os.getcwd())
""")

print(result.logs)  # Only stdout/stderr can be retrieved
sandbox.close()     # microVM auto-destroyed

The advantages of this pattern are clear — zero ops, fast integration, and validated at scale (E2B is one of the hosted sandbox clients/providers listed in the OpenAI Agents SDK documentation). The cost is 50–100ms of network RTT per invocation, and data produced inside the sandbox must be transmitted back via API.

Pattern 2: gRPC Sidecar (Self-Hosted Firecracker)

If you need to self-host a sandbox platform (for compliance, cost, or customization reasons), the gRPC sidecar pattern provides low-latency local Firecracker integration. The architecture looks like this:

┌──────────────────────────────────────────────────────┐
│                        Host                              │
│  ┌─────────────┐    gRPC     ┌─────────────────────┐ │
│  │ Agent Process  │ ←─────────→ │ Sandbox Daemon      │ │
│  │ (Python/Go) │  (localhost) │ (manages microVM lifecycle)│ │
│  └─────────────┘             │ ┌─────────────────┐ │ │
│                               │ │ Firecracker VMM │ │ │
│                               │ │  ┌───────────┐  │ │ │
│                               │ │  │ microVM   │  │ │ │
│                               │ │  │ (agent code)│  │ │ │
│                               │ │  └───────────┘  │ │ │
│                               │ └─────────────────┘ │ │
│                               └─────────────────────┘ │
└──────────────────────────────────────────────────────┘

The Sandbox Daemon maintains a warm pool of pre-started microVMs to eliminate cold-start latency. When an agent requests code execution, the Daemon pulls a pre-started microVM from the pool, injects the code, executes it, returns the result, then returns the microVM to the pool or destroys and recreates it.

Abhishek Dadwal validated this pattern's performance in a January 2026 real-world report: through VM pooling, per-request latency dropped from 8,700ms to 500ms — a 17× improvement. With a goroutine concurrency pool (10 concurrent workers), throughput can reach dozens of agent code execution requests per second.

Pattern 3: OCI Runtime (Kubernetes Native)

If your agent infrastructure already runs on Kubernetes, the OCI runtime pattern is the most natural integration. Google's kubernetes-sigs/agent-sandbox controller (released November 2025) productized this pattern:

apiVersion: agentsandbox.io/v1alpha1
kind: SandboxTemplate
metadata:
  name: python-agent
spec:
  runtimeClass: kata       # or runsc (gVisor)
  image: python:3.12-slim
  resources:
    cpu: "1"
    memory: "512Mi"
---
apiVersion: agentsandbox.io/v1alpha1
kind: WarmPool
metadata:
  name: agent-warm-pool
spec:
  templateRef: python-agent
  minSize: 5              # Always maintain 5 warm sandboxes
  maxSize: 20

The agent framework requests execution through the sandbox.run() API; underneath, the controller pulls a pre-warmed Pod from the WarmPool, injects code, executes it, and returns results. The advantage of this pattern: full integration into the Kubernetes ecosystem — monitoring, logging, resource limits, and scaling all use native K8s mechanisms.

Pattern 4: vsock (Zero Network Exposure)

Firecracker's unique vsock (virtio socket) mechanism enables a kernel-level communication channel between host and guest that bypasses the network stack entirely. Traditional TCP/IP communication between host and guest must traverse virtual NICs, network namespaces, and network policies — vsock completely bypasses all of these.

Security value: vsock creates no network interface, so even if the agent inside the microVM attempts network scanning or outbound connections, it cannot traverse vsock (vsock is a strictly point-to-point channel). The host side can precisely control what is received and sent over vsock.

OpenAI's Harness/Sandbox Separation Pattern

OpenAI Agents SDK introduces an architectural concept worth discussing separately: separation of harness (control plane) from sandbox (compute plane).

Harness: control plane — LLM invocation, tool routing, user approval, security policy evaluation. This is "the smart part."
Sandbox: compute plane — code execution, file operations, shell commands. This is "the potentially dangerous part."

Key design decision: credentials are injected into the sandbox as runtime configuration, not as prompt content. This means even if an attacker reads the agent's context (system prompt, conversation history) through prompt injection, they cannot obtain API keys or database passwords — this information is only injected at sandbox creation time via environment variables and is not present in the prompts.

This is an important security pattern: separate the credential channel from the data channel. The data channel (prompts, LLM output, tool call parameters) may be observed or manipulated by attackers; the credential channel (environment variable injection, secret manager mounts) is unidirectional and unreadable.

6. Performance Tradeoffs at Scale

"Stronger isolation means worse performance" — this is most people's intuition about security technology. In the context of agent runtime isolation, this intuition is broadly correct, but the magnitude may be much smaller than you think.

The Real Composition of Cold Start Latency

Cold start is the core performance metric for isolation technologies — it determines the wait time between an agent "deciding to execute code" and "code starting to run." But "cold start" means completely different things across different technologies:

Technology	Cold Start (single)	Cold Start (warm pool)	Per-Instance Memory	CPU Overhead
Docker (runc)	10–50ms	N/A	~10MB	~0%
gVisor	~100ms	N/A	~20MB	10–30% (I/O), <5% (compute)
Firecracker	~125ms	<50ms (snapshot)	~5MB	3–11%
Kata (QEMU)	~500ms	N/A	~50MB	5–15%
Kata (Firecracker)	~125–200ms	N/A	~5MB	3–11%
Kata (Dragonball)	~200ms	<100ms	~30MB	~5%
Traditional VM (QEMU)	3–60s	N/A	GB-scale	5–20%

(Data sources: NumaVM 2026-03 Firecracker end-to-end benchmarks; arXiv:2602.15214 Docker startup analysis; NextKick Labs 2026-01; Alibaba Cloud Kata 3.0 release announcement.)

An Important Correction: Docker's Actual Cold Start

Many people believe Docker containers are "instant-start" (~10ms). This is correct — but only for a very small part. The arXiv:2602.15214 study was the first to systematically decompose Docker container startup latency: kernel namespace creation takes only 8–10ms (less than 1.5% of total time). The real bottleneck is storage-layer operations — image layer mounting and filesystem preparation consume 300–800ms.

What this means: from a user's perspective, Docker container actual cold start and Firecracker microVM cold start (~125ms + snapshot loading) don't differ by an order of magnitude in total experience. Firecracker snapshot restore (176ms) can even be faster than the cold start of certain Docker images.

Warm Pool: The Silver Bullet for Eliminating Cold Start

A warm pool is the most effective technique for solving cold-start problems — pre-start a set of sandbox instances, and when an agent request arrives, directly allocate an already-running instance. Its effect is dramatic:

AWS Lambda SnapStart: Java function cold start dropped from 6,100ms to 1,400ms (4.4× improvement); for Firecracker microVMs, snapshot restore takes only 176ms (of which snapshot loading is just 25ms, achieved via mmap)
VM pooling (Abhishek Dadwal 2026-01): per-request latency dropped from 8,700ms to 500ms (17× improvement)
Google GKE Agent Sandbox: through SandboxTemplate + WarmPool, sub-second sandbox dispatching

Instance Density: Memory Is the Real Bottleneck

A factor easily overlooked when selecting an isolation technology is instance density — how many sandboxes can run simultaneously on a single host. This directly determines infrastructure cost. NextKick Labs' January 2026 measured data (80GB host memory):

Technology	Per-Instance Memory	Instances on 80GB Host
Docker (runc)	~40MB	~2,000
Firecracker	~45MB	~1,778
Kata (QEMU)	~165MB	~485
Kata (Dragonball)	~80MB	~1,000

Key finding: Firecracker's instance density is nearly equivalent to Docker's (1,778 vs. 2,000). Kata 3.0, through the Dragonball VMM, doubled density (485 → 1,000). For AI agent workloads, this means: using Firecracker instead of Docker does not significantly increase infrastructure costs.

When Is the Performance Overhead Not Worth It?

Isolation is not free. In the following scenarios, the extra isolation overhead may not be justified:

Text-only agents (no tool calls): If the agent doesn't execute code, invoke shells, or access filesystems, the extra isolation overhead is wasted. Docker + seccomp is sufficient.
Trusted code execution: If the agent only executes team-authored, code-reviewed code (e.g., internal automation scripts), Docker + hardened seccomp + dropped capabilities provides adequate protection.
I/O-intensive batch processing: If the agent's core work is file processing (e.g., large-scale ETL), gVisor's 10–30% I/O overhead may become a bottleneck. In this case, containers (zero I/O overhead) or Firecracker (~3–11% CPU overhead, near-native I/O) are more suitable.
Sub-10ms latency requirements: If agent execution latency must be below 10ms, only WASM or containers can meet this. But note: in most agent scenarios, LLM inference latency (hundreds of milliseconds to seconds) far exceeds sandbox startup latency.

7. Decision Framework: Choose Isolation by Risk Level

After six chapters of technical analysis, the final question returns to an engineering decision: which isolation should my agent use? This isn't a technology question — it's a risk-matching question. The framework below maps agent capability scenarios to recommended isolation strategies.

Risk Level × Capability Scenario

Agent Capability	Risk Level	Recommended Isolation	Rationale
Text-only, no tools	Low	Docker + seccomp	Minimal attack surface, no extra isolation cost needed
Trusted code execution (internal scripts)	Medium	Docker + hardened seccomp + drop all caps	Known code, controlled dependencies, hardened container sufficient
LLM-generated code execution	High	gVisor (minimum) / Firecracker (recommended)	Unpredictable syscall patterns, requires syscall-level interception or hardware isolation
Multi-tenant code execution	Critical	Firecracker / Kata Containers	Must provide an independent kernel boundary for each tenant
Finance / Healthcare / PII	Critical	Firecracker + egress allowlist + secret injection	Compliance requires VM-level boundary
GPU-accelerated AI Agent	High	gVisor (GPU support) or Kata	Firecracker lacks GPU passthrough
Plugin / extension system	High	WASM or Firecracker	Capability confinement or hardware isolation
Browser-side agent	Low–Medium	WASM (inherits browser sandbox)	Browser built-in isolation

Seven Decision Rules

Here are seven hard decision rules — each directly corresponds to a yes/no judgment, helping you quickly narrow down choices in specific scenarios:

Is the code LLM-generated? → Yes: at minimum gVisor; for production multi-tenant scenarios use Firecracker. Never use bare Docker.
Do tenants share infrastructure? → Yes: independent kernel boundary required → Firecracker or Kata.
Is GPU passthrough needed? → Yes: exclude Firecracker → gVisor (added GPU support 2024–2025) or Kata.
Is Kubernetes the orchestration layer? → Yes: use the kubernetes-sigs/agent-sandbox controller; switch between Kata or gVisor via RuntimeClass.
Is sub-10ms startup required? → Yes: containers or WASM; Firecracker snapshot restore still requires ~176ms.
Is the workload compute-intensive with low I/O? → Yes: gVisor provides the best "performance-to-isolation ratio."
Are there compliance audit requirements? → Yes: VM boundaries (Firecracker/Kata) are standard isolation that auditors can understand and verify.

Decision Tree

                         ┌─────────────────────────┐
                         │  Does the agent execute code? │
                         └───────────┬─────────────┘
                         No          │          Yes
                         ▼           │           ▼
                  ┌──────────┐       │    ┌──────────────────┐
                  │ Docker + │       │    │ Code source?      │
                  │ seccomp  │       │    └────────┬─────────┘
                  └──────────┘       │    LLM-generated  │  Human-written
                                     │    ┌─────────┘  ┌──────────┐
                                     │    ▼            ▼          │
                                     │ ┌────────┐  ┌──────────┐   │
                                     │ │ Multi-  │  │ Docker + │   │
                                     │ │ tenant? │  │ hardened  │   │
                                     │ └──┬──┬──┘  └──────────┘   │
                                     │  Yes│  │No                     │
                                     │  ▼  ▼                       │
                                     │ ┌────────┐ ┌──────────┐     │
                                     │ │Firecrkr│ │ Need GPU? │     │
                                     │ │or Kata │ └──┬──┬────┘     │
                                     │ └────────┘   Yes│  │No        │
                                     │              ▼  ▼           │
                                     │        ┌──────┐ ┌────────┐  │
                                     │        │gVisor│ │Firecrkr│  │
                                     │        │or Kata│ │or gVisor│  │
                                     │        └──────┘ └────────┘  │
                                     └─────────────────────────────┘

Open Source Tool Quick Reference

Tool	Category	Language	Stars	Description
Firecracker	VMM (MicroVM)	Rust	~33.8K	AWS-built ultra-minimal VMM for KVM microVMs
gVisor	Userspace kernel	Go	~18.1K	Google's OCI-compatible syscall interceptor
Kata Containers	OCI runtime + VM	Rust	~7.8K	CNCF project, supports multiple VMM backends
Cloud Hypervisor	VMM	Rust	~5.4K	Intel-led modern VMM
youki	Container runtime	Rust	~6K	Rust rewrite of runc
Dragonball	VMM	Rust	—	Alibaba VMM, Kata 3.0 default backend

Alibaba Cloud Secure Sandbox: Kata 3.0 Validated in Practice

Alibaba Cloud Container Service ACK's secure sandbox runtime (based on Kata Containers + Dragonball VMM) provides production data validated at massive scale: compared to community Kata 2.0, Alibaba Cloud Secure Sandbox v2 achieved 90% overhead reduction, 3× faster startup, and 10× density improvement. This proves that with the right VMM choice (Dragonball replacing QEMU) and deep optimization, microVM isolation can achieve near-container efficiency in large-scale production environments.

Frequently Asked Questions

1. Is Docker with a seccomp profile enough?

No. seccomp can only filter system calls — it's an interception layer at the syscall entry point. But seccomp cannot change a fundamental fact: Docker containers (runc runtime) share the same Linux kernel with the host.

Attackers can bypass seccomp-only defenses through the following paths:

Kernel exploit: Any Linux kernel CVE (e.g., Dirty Pipe, Dirty COW) can be triggered from within a container, since the container directly accesses the host kernel. seccomp cannot defend against kernel vulnerabilities — the bug is in the kernel code, executing after seccomp's check.
Allowed syscall combination attacks: seccomp allowlists typically permit 100–200 syscalls. Even after excluding obviously dangerous calls (mount, ptrace), attackers can still construct attacks through combinations of allowed calls. For example, using openat + write to overwrite sensitive files.
seccomp configuration gaps: Docker's default seccomp profile blocks 44 syscalls — but approximately 300 syscalls remain available. The attack surface is still substantial.

The correct approach: Use seccomp as one layer in defense in depth, not as the sole defense. Low-risk scenarios (personal tool agents): Docker + seccomp + drop all capabilities + read-only rootfs + AppArmor. Medium-to-high-risk scenarios (LLM-generated code execution): gVisor or Firecracker — they provide an independent execution kernel, not just syscall filtering.

2. How to choose between Firecracker and gVisor?

The core difference is the isolation mechanism: gVisor intercepts system calls through a userspace kernel (Sentry) — software-level isolation; Firecracker provides a true VM boundary through KVM hardware virtualization — hardware-level isolation.

Choose gVisor when:

GPU passthrough is needed (Firecracker does not support GPU passthrough)
Compute-intensive workloads with low I/O — gVisor CPU overhead <5%, best price-performance ratio
Infrastructure does not support KVM (e.g., certain cloud environments, CI platforms) — gVisor can use Systrap mode without KVM
Single-tenant scenarios that don't require hardware isolation for compliance

Choose Firecracker when:

Multi-tenant platforms — each tenant must have an independent kernel boundary
Executing untrusted LLM-generated code — the hardware VM boundary is easier to justify in audits because VM boundaries are familiar and independently verifiable
Finance/healthcare/PII data processing — compliance frameworks (SOC 2, HIPAA) typically require VM-level isolation
Extremely low memory overhead is needed (Firecracker VMM ~5MB vs gVisor ~20MB)

When neither is suitable: sub-10ms startup latency required → use containers or WASM; full Linux compatibility needed with acceptable GB-level memory → traditional VMs.

3. What impact do microVMs have on CI/CD pipelines?

The impact of microVMs on CI/CD pipelines is manageable and centers on three areas:

1. Image build process changes: Firecracker uses a minimal rootfs (Alpine Linux ~63MB vs Ubuntu ~300MB), built differently from traditional Docker images. You'll need to maintain a rootfs build pipeline (using debootstrap or buildroot), but this can be integrated into existing CI — make rootfs building a CI pipeline stage, with artifacts uploaded to object storage.

2. Real-world impact of startup latency: A single microVM cold start is 125–200ms. For CI tasks (typically lasting seconds to minutes), this latency accounts for less than 2%. If your CI pipeline uses warm pools, the latency is negligible. Note: Docker container actual cold start (storage-layer operations 300–800ms) can actually exceed Firecracker's.

3. Simplified resource cleanup: microVMs auto-destroy on exit — no residual processes, files, or network state. This actually simplifies CI environment cleanup. No need for docker rm -f or worrying about dangling volumes.

Recommended approach: Use VM pooling (referencing Abhishek Dadwal's 17× speedup practice), pre-allocate microVMs when the CI agent starts, return to pool or destroy and recreate after execution. Google GKE Agent Sandbox's WarmPool pattern can be directly reused.

4. Are there any ready-to-use agent sandbox services?

Yes, split into self-hosted vs. managed categories:

Managed services (fastest time to value):

E2B (recommended) — open source (GitHub 12K+ stars, 480 releases), Firecracker-based agent sandbox platform. Cold start 80–200ms, supports 24h persistent sessions. One of the hosted sandbox providers listed in the OpenAI Agents SDK documentation. Provides Python/TypeScript SDK and MCP server support. Free tier available for trial.
Docker Sandboxes — Docker's official MicroVM sandbox service launched in 2026. Independent Docker daemon per sandbox, natively supports macOS/Windows/Linux. Best for teams already in the Docker ecosystem.
Northflank — Kata Containers-based agent sandbox platform, supports GPU and BYOC (bring your own container).

Self-hosted options (maximum control):

GKE Agent Sandbox — Google Cloud's kubernetes-sigs/agent-sandbox controller. Supports SandboxTemplate + WarmPool, switch between gVisor or Kata via RuntimeClass. Best for teams already running GKE clusters.
Self-hosted Firecracker cluster — use Firecracker Go SDK + gRPC sidecar pattern. Reference E2B's open-source architecture. Best for teams needing complete control over sandbox behavior and security policies.

Selection principle: team has Kubernetes operational ability and needs deep customization → self-host GKE Agent Sandbox or Firecracker cluster; need rapid time-to-market and accept provider management → E2B (most mature open-source option); already in the Docker ecosystem → Docker Sandboxes.

Next Steps

⬅️ Previous

Agent Command Execution Safety: Risk Boundaries for Shell, Filesystem, and Network Access

Sandboxes control the blast radius; command safety controls whether the fuse is lit — Policy Engine design and kernel-level hardening.

➡️ Next · Coming Soon

MCP Protocol Production Guide: Secure Deployment of the Model Context Protocol

Security practices at the tool protocol layer — MCP isolation, authentication, and transport security in production.

📚 Related Reading