Agent Runtime Isolation: Docker, Firecracker, VM Sandbox — How to Choose
⚡ TL;DR — 30 Seconds
- The Docker default runtime (runc) shares the host kernel — a single kernel CVE could lead to agent escape; it cannot be used for untrusted code execution
- Production choices: gVisor (lightweight, userspace kernel interception) → Firecracker (hardware VM, ~125ms startup) → Kata Containers (Kubernetes native)
- Decision rule of thumb: LLM-generated code → at minimum gVisor; multi-tenant → Firecracker/Kata; text-only agents → hardened Docker is sufficient
1. Why Docker Isn't Enough
Multiple agent-security practices and OWASP GenAI risk discussions treat untrusted code execution as a high-risk scenario and recommend isolated execution environments for LLM-generated code. Docker is most teams' first instinct — but it also happens to be the most easily misused solution.
The isolation mechanism of standard Docker containers (runc runtime) is Linux namespaces + cgroups. The core problem with this mechanism: containers share the same Linux kernel with the host. Namespaces provide view isolation for processes, networks, and filesystems, but they are not security boundaries — they are management boundaries.
"Docker containers are not virtual machines. They are process wrappers that share a kernel."
What does this mean? Any kernel vulnerability (CVE) can allow malicious code inside a container to break out to the host. In 2019, the AWS Lambda team proved this with data: QEMU had 1.4 million lines of code, hundreds of emulated devices, and a continuous stream of CVEs. They replaced QEMU with Firecracker — an ultra-minimal VMM of 50,000 lines of Rust code. AWS's choice sent a clear message: for scenarios involving untrusted code execution, even within a cloud provider, shared-kernel containers are not enough.
Docker itself has acknowledged this. In April 2026, Docker launched the Docker Sandboxes product — Docker Sandboxes use microVM isolation, isolated networking, and a per-sandbox Docker Engine.
This is the first key insight this article aims to establish: Docker is an excellent packaging and distribution tool, but its default runtime (runc) is not a security boundary for agent code execution. For AI agent scenarios — where agents can be induced by prompt injection to execute arbitrary code — we need stronger isolation.
The Agent Execution Security Chain — Review
In the first article of this series (Agent Code Sandbox Design), we built a five-boundary sandbox architecture — from process isolation to network isolation. In the second article (Agent Tool Permission Control), we defined which tools the agent can invoke. In the third article (Agent Command Execution Safety), we implemented a Policy Engine to review every command. All three of these layers operate within the same runtime environment.
This article answers the next-level question: what technology should isolate the runtime environment itself? When the Policy Engine approves a command, when the agent executes code inside the sandbox — how hard is the sandbox's underlying isolation boundary? Is it a container boundary sharing the host kernel, or a standalone hardware virtualization boundary?
2. The Isolation Spectrum: From WASM to Full VMs
Runtime isolation isn't a binary choice (isolated vs. not isolated) — it's a continuous spectrum. From ultra-lightweight language-level sandboxes to full hardware virtualization, each tier makes different tradeoffs among startup speed, memory overhead, and security strength.
Six Isolation Technologies at a Glance
| Technology | Isolation Mechanism | Cold Start | Memory Overhead | Host Kernel Exposure | Escape Blast Radius |
|---|---|---|---|---|---|
| WASI / WASM | Capability-based runtime | <1ms | <1MB | None | Within WASI interface |
| Cloudflare V8 Isolates | V8 runtime isolation | <1ms | ~1MB | None | Within V8 sandbox |
| Docker (runc) | Linux namespaces + cgroups | ~10–50ms | ~10MB | Fully shared | Host kernel CVE |
| gVisor (runsc) | Userspace kernel (Sentry) intercepts syscalls | ~100ms | ~20MB | None (Sentry intercepts) | Sentry + kernel |
| Firecracker | KVM hardware virtualization microVM | ~125ms | ~5MB | None (KVM hardware) | Hypervisor CVE |
| Kata Containers | Lightweight VM (KVM) wrapping OCI interface | ~200ms | ~30MB | None (KVM hardware) | Hypervisor CVE |
(Data sources: Zylos Research 2026-04 systematic comparison; NumaVM 2026-03 Firecracker end-to-end benchmarks; NextKick Labs 2026-01 Firecracker vs Docker security comparison.)
Note: the numbers below come from public papers, vendor documentation, and third-party benchmarks. Treat them as order-of-magnitude guidance; real latency and overhead vary by host hardware, kernel version, image size, storage layer, network model, and workload type.
Key Dividing Lines on the Spectrum
There are two critical dividing lines on this spectrum:
The first dividing line: shared kernel vs. independent kernel. Docker (runc) sits on the left side — it shares the host kernel. WASM and V8 Isolates, while having a higher security grade than Docker (they don't directly expose Linux syscalls), also have an extremely narrow capability scope — WASM can't execute shell commands, can't access the filesystem (unless explicitly granted). gVisor, Firecracker, and Kata sit on the right side — they all provide some form of independent kernel boundary. This is the baseline requirement for agent production deployments.
The second dividing line: software interception vs. hardware isolation. gVisor intercepts syscalls through software (a userspace Go program called Sentry), while Firecracker and Kata rely on hardware virtualization (Intel VT-x / AMD-V). Software interception is more flexible and has lower overhead, but the attack surface exists at the software layer (Sentry itself could be compromised); hardware isolation provides a stronger boundary, but requires KVM support and additional memory overhead.
Why WASM and V8 Isolates Are on the Far Left?
WASM (WebAssembly) and V8 Isolates provide the highest-performance isolation — sub-millisecond startup time, under 1MB of memory overhead. But their capability scope is severely limited: WASM modules cannot directly invoke system calls; all external interactions (filesystem, network) must go through the WASI interface with explicit authorization. This makes them ideal for pure-computation sandboxes — for example, an agent calling a Python function for math or JSON processing — but unsuitable for agent tool invocations that require a full Linux execution environment.
For agent scenarios requiring full Linux capabilities — shell commands, pip installs, git operations, etc. — WASM is too narrow. What we need is: retain full Linux capability while providing stronger isolation than Docker. That's where gVisor and Firecracker sit.
3. MicroVMs in Practice: Firecracker and Kata Containers
If you could pick just one "sweet spot" from the technology stack — VM-level isolation strength combined with near-container startup speed — the answer is microVMs. Firecracker is the category definer; Kata Containers is its Kubernetes-native cousin.
Firecracker: 50,000 Lines of Rust Minimalism
AWS's motivation for releasing Firecracker in 2018 was straightforward: Lambda needed to run thousands of tenants' functions simultaneously. QEMU was too heavy (1.4 million lines of code, hundreds of emulated devices, continuous CVE disclosures), but Docker's shared kernel couldn't satisfy multi-tenant isolation requirements. Firecracker's design philosophy: only do the absolute minimum needed to create a microVM; cut everything else.
This is reflected in several hard numbers:
- 50,000 lines of Rust code (4% of QEMU), minimal attack surface
- Only 3 virtual devices: virtio-block (block storage), virtio-net (networking), serial console — compared to QEMU's hundreds
- One independent VMM process per microVM — no shared daemon, a single point of failure doesn't cascade to other instances
- 150 microVM creations per second (per host), cold start ~125ms
- VMM process only ~5MB memory, supports 20× overcommit (tested), 10× overcommit (production)
Firecracker Security Architecture: Dual Concentric Rings
Firecracker's security design can be understood as two concentric rings:
┌──────────────────────────────────┐
│ Outer Ring: CPU Hardware Boundary │
│ Intel VT-x / AMD-V virtualization extensions │
│ Any VM escape must first breach the hardware boundary │
│ ┌────────────────────────────┐ │
│ │ Inner Ring: Jailer Sandbox │ │
│ │ • chroot filesystem isolation │ │
│ │ • seccomp limits to 24 syscalls │ │
│ │ • cgroup resource limits │ │
│ │ • isolated network namespace │ │
│ └────────────────────────────┘ │
└──────────────────────────────────┘
The outer ring is CPU hardware virtualization — physical isolation between different VMs on the same CPU. The inner ring is the Jailer — Firecracker's built-in secondary sandbox process that further constrains the VMM process itself through chroot, seccomp (only 24 syscalls allowed), and cgroups. Even if an attacker breaks out of the microVM boundary, they still need to pierce through the Jailer's isolation to reach the host.
This dual-isolation philosophy comes from AWS's Lambda operational experience. Lambda handles trillions of function invocations per month, and Firecracker's multi-tenant isolation has been validated at hyperscale.
Real-World Adoption
| Platform | Isolation Technology | Cold Start | Session Limit | Notes |
|---|---|---|---|---|
| AWS Lambda | Firecracker microVM | ~125ms (cold) / <50ms (SnapStart) | 15 min | Trillions of monthly invocations, 10× overcommit in production |
| E2B | Firecracker microVM | ~80–200ms | 24 hours | Open source (12K+ stars), one of the hosted sandbox providers listed in the OpenAI Agents SDK documentation |
| Anthropic Computer Use | Firecracker (via E2B) | ~150ms | Per session | Desktop sandbox, provides graphical isolated environment for computer-use agents |
| Fly.io Machines | Firecracker microVM | ~125ms | Per machine | Global anycast network |
| Sprites.dev | Firecracker microVM | ~150ms | Unlimited | Checkpoint/rollback support |
| Docker Sandboxes | MicroVM (custom VMM) | ~200ms | Per session | Independent Docker daemon per sandbox + MITM TLS filtering proxy |
(Data sources: E2B GitHub & docs 2023–2026; AWS Compute Blog 2025-08; Docker Blog 2026-04-16.)
Kata Containers: Kubernetes-Native VM-Level Isolation
If Firecracker is "the microVM born in the Lambda scenario," Kata Containers is "the secure container grown in the Kubernetes ecosystem." Kata's core design: each container runs inside a dedicated lightweight VM, but exposes a standard OCI interface externally. To Kubernetes, Kata is just another RuntimeClass — you can use the same Pod spec and switch to a different isolation level.
# Kubernetes RuntimeClass three-tier isolation strategy
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: isolated # gVisor — medium isolation
handler: runsc
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: highly-isolated # Kata — strong isolation
handler: kata
Kata 3.0 (2025–2026) is a major rewrite — migrating from Go to Rust, with significant performance improvements. The Kata 3.x ecosystem can use backends such as QEMU, Cloud Hypervisor, Firecracker, and Dragonball; some distributions and cloud-provider offerings use Dragonball for deep optimization.
- Compared to Kata 2.0: 90% overhead reduction, 3× faster startup, 10× density improvement
- Supports multiple VMM backends: QEMU, Cloud Hypervisor, Firecracker, Dragonball
- Built-in GPU passthrough support — a capability Firecracker lacks
Kata's unique value lies in bridging the gap between the "container ecosystem" and "VM isolation." If you're already running agent workloads on Kubernetes, Kata is the smoothest path to upgrading isolation — just change the RuntimeClass.
Firecracker vs. Kata: How to Choose?
| Dimension | Firecracker | Kata Containers |
|---|---|---|
| Startup Speed | ~125ms (no snapshot) / <50ms (snapshot restore) | ~200ms (Dragonball backend) |
| Memory Overhead | ~5MB (VMM) + guest kernel | ~30MB (VMM + kata-agent) |
| K8s Integration | Requires custom controller | Native RuntimeClass, community-maintained |
| GPU Support | Not supported | GPU passthrough supported |
| OCI Compatibility | Non-OCI (independent API) | Fully OCI-compatible |
| Best For | Custom agent sandbox platforms, serverless patterns | Kubernetes-native agent clusters |
Selection principle: building a custom agent sandbox platform → Firecracker; already on Kubernetes → Kata Containers. If you need both — for example, the Kubernetes ecosystem with Firecracker's extreme performance — Kata can use Firecracker as a VMM backend (Kata + Firecracker combination).
4. gVisor: The Art of Userspace Kernel Tradeoffs
gVisor takes a unique path — it doesn't isolate at the hardware level (unlike Firecracker with KVM), nor does it filter at the syscall level (unlike seccomp intercepting at the kernel entry point). Instead, it implements a complete Linux kernel in userspace.
Sentry: The Userspace Syscall Interceptor
gVisor's core component is called Sentry — a userspace kernel written in Go. When a process running inside a gVisor sandbox issues a system call:
- The process issues a system call (e.g.,
openat) - The kernel's seccomp redirects the system call to Sentry (a userspace process)
- Sentry processes the system call in its own memory space — validates parameters, checks permissions, executes logic
- Sentry issues only the genuinely necessary operations to the host kernel through a restricted set of system calls
The key: applications inside the sandbox can never directly issue system calls to the host kernel. All system calls are proxied by Sentry in userspace. This means even if an application inside the sandbox triggers a kernel CVE, that CVE requires the host kernel to execute a malicious operation — but the attacker can never reach the host kernel; they can only reach Sentry.
Platform Modes: Systrap vs. KVM
gVisor offers two platform modes:
| Mode | Mechanism | Performance | Security | Recommended For |
|---|---|---|---|---|
| Systrap | seccomp redirects syscalls | 10–30% I/O overhead | Relies on seccomp | Recommended default since 2024, no KVM required |
| KVM | Hardware virtualization + Sentry in guest ring 0 | Lower I/O overhead (~10%) | Hardware + Sentry dual-layer | Environments needing extreme performance with KVM support |
gVisor's Performance Profile: Asymmetric Overhead
gVisor's performance overhead isn't evenly distributed — it shows significant asymmetry between compute and I/O (data from Zylos Research 2026-04 and Safeguard.sh 2023-12):
- Pure compute (CPU-bound): overhead <5%, near-native performance. Sentry only intervenes when a system call is actually triggered.
- File I/O: overhead 10–30%. Reason: every file operation traverses the "sandbox process → Sentry → Gofer (file proxy process) → host filesystem" path, adding 2 userspace round-trips compared to a direct system call.
- Network I/O: throughput drops 20–40%. Reason: Sentry implements its own TCP/IP stack (based on Google's netstack) rather than using the host's network stack.
- Syscall-intensive workloads: performance drops 2–5×. Reason: every system call requires context switching and processing inside Sentry.
What this means: gVisor is the best price-performance choice for compute-intensive agents (e.g., model inference, data processing), but for agents requiring heavy file I/O (e.g., code compilation, large-scale file processing), I/O overhead may become a bottleneck.
If GPU is required, verify current platform support first. gVisor has GPU support documentation, and Kata also has GPU passthrough paths; Firecracker is generally not the default choice for GPU passthrough.
gVisor's Syscall Coverage
A practical limitation of gVisor: it only implements about 200 Linux system calls (roughly 70% of the complete Linux kernel). Most agent workloads only use a small subset of these, but certain edge cases may encounter unimplemented syscalls:
- ✅ Well supported: file operations (openat, read, write, close), process management (fork, execve, wait4), networking (socket, connect, sendto)
- ⚠️ Partially supported: certain ioctl subcommands, advanced networking features
- ❌ Not supported: kernel module operations (init_module), direct hardware access (iopl), certain special filesystem operations
Practical impact: the vast majority of Python/Node.js/Go agent workloads run well on gVisor. A third-party reverse-engineering analysis reported signs of gVisor in the OpenAI Code Interpreter environment.
gVisor Security Model: One Wall in Defense in Depth
From a security perspective, gVisor doesn't provide the hardest boundary (hardware virtualization), but rather the narrowest attack surface:
- Applications cannot directly call the host kernel — all syscalls are proxied by Sentry
- Sentry is a pure userspace Go program — a memory-safe language, avoiding the buffer overflow class of vulnerabilities common in C
- Gofer (the file proxy) runs with minimal privileges — can only access allowlisted directories
- Even if Sentry is compromised, the attacker is still inside a seccomp sandbox — with a limited set of syscalls
gVisor's positioning: an order of magnitude stronger than Docker, an order of magnitude lighter than Firecracker. It's best suited for scenarios where: you need stronger isolation than "shared host kernel," but your infrastructure doesn't support KVM (e.g., certain cloud environments, CI platforms), or you need GPU support.
5. Agent Runtime Integration Patterns
Once you've chosen an isolation technology, the next engineering question is: how does the agent's tool invocation connect to the isolated runtime? When the LLM decides to invoke a tool (e.g., executing Python code or a shell command), the actual execution should happen inside the isolated sandbox. There are five main patterns for this "handoff" process.
Five Integration Patterns Compared
| Pattern | Mechanism | Latency | Security Strength | Best For |
|---|---|---|---|---|
| API-managed | Call sandbox provider API (E2B, Modal) | +50–100ms RTT | Strong (provider-managed) | Cloud-hosted agents |
| exec local | Direct subprocess execution | ~0ms | Weakest | Local dev, trusted code |
| gRPC sidecar | Sidecar sandbox daemon | +1–5ms | Strong (local VM) | Self-hosted Firecracker clusters |
| OCI runtime | Docker/Podman + runsc/kata | +10–200ms | Medium–Strong | Kubernetes native |
| vsock | Inter-kernel communication (no network interface) | <1ms | Strong (no network exposure) | Firecracker host↔guest |
Pattern 1: API-Managed (E2B Pattern)
This is the most frictionless approach: use a managed sandbox service. The agent framework creates sandboxes, executes code, and retrieves results via API. E2B is the representative of this pattern — it provides Python/TypeScript SDKs; the agent just needs a single line: sandbox.run_code(code), while underneath it's a Firecracker microVM.
# E2B API pattern — agent code example
from e2b import Sandbox
# Create a Firecracker microVM sandbox (~80-200ms)
sandbox = Sandbox.create(template="python-3.12")
# Agent-generated code executes inside the isolated microVM
result = sandbox.run_code("""
import os
import subprocess
# Even dangerous operations inside the sandbox can't affect the host
subprocess.run(["ls", "-la"])
print(os.getcwd())
""")
print(result.logs) # Only stdout/stderr can be retrieved
sandbox.close() # microVM auto-destroyed
The advantages of this pattern are clear — zero ops, fast integration, and validated at scale (E2B is one of the hosted sandbox clients/providers listed in the OpenAI Agents SDK documentation). The cost is 50–100ms of network RTT per invocation, and data produced inside the sandbox must be transmitted back via API.
Pattern 2: gRPC Sidecar (Self-Hosted Firecracker)
If you need to self-host a sandbox platform (for compliance, cost, or customization reasons), the gRPC sidecar pattern provides low-latency local Firecracker integration. The architecture looks like this:
┌──────────────────────────────────────────────────────┐
│ Host │
│ ┌─────────────┐ gRPC ┌─────────────────────┐ │
│ │ Agent Process │ ←─────────→ │ Sandbox Daemon │ │
│ │ (Python/Go) │ (localhost) │ (manages microVM lifecycle)│ │
│ └─────────────┘ │ ┌─────────────────┐ │ │
│ │ │ Firecracker VMM │ │ │
│ │ │ ┌───────────┐ │ │ │
│ │ │ │ microVM │ │ │ │
│ │ │ │ (agent code)│ │ │ │
│ │ │ └───────────┘ │ │ │
│ │ └─────────────────┘ │ │
│ └─────────────────────┘ │
└──────────────────────────────────────────────────────┘
The Sandbox Daemon maintains a warm pool of pre-started microVMs to eliminate cold-start latency. When an agent requests code execution, the Daemon pulls a pre-started microVM from the pool, injects the code, executes it, returns the result, then returns the microVM to the pool or destroys and recreates it.
Abhishek Dadwal validated this pattern's performance in a January 2026 real-world report: through VM pooling, per-request latency dropped from 8,700ms to 500ms — a 17× improvement. With a goroutine concurrency pool (10 concurrent workers), throughput can reach dozens of agent code execution requests per second.
Pattern 3: OCI Runtime (Kubernetes Native)
If your agent infrastructure already runs on Kubernetes, the OCI runtime pattern is the most natural integration. Google's kubernetes-sigs/agent-sandbox controller (released November 2025) productized this pattern:
apiVersion: agentsandbox.io/v1alpha1
kind: SandboxTemplate
metadata:
name: python-agent
spec:
runtimeClass: kata # or runsc (gVisor)
image: python:3.12-slim
resources:
cpu: "1"
memory: "512Mi"
---
apiVersion: agentsandbox.io/v1alpha1
kind: WarmPool
metadata:
name: agent-warm-pool
spec:
templateRef: python-agent
minSize: 5 # Always maintain 5 warm sandboxes
maxSize: 20
The agent framework requests execution through the sandbox.run() API; underneath, the controller pulls a pre-warmed Pod from the WarmPool, injects code, executes it, and returns results. The advantage of this pattern: full integration into the Kubernetes ecosystem — monitoring, logging, resource limits, and scaling all use native K8s mechanisms.
Pattern 4: vsock (Zero Network Exposure)
Firecracker's unique vsock (virtio socket) mechanism enables a kernel-level communication channel between host and guest that bypasses the network stack entirely. Traditional TCP/IP communication between host and guest must traverse virtual NICs, network namespaces, and network policies — vsock completely bypasses all of these.
Security value: vsock creates no network interface, so even if the agent inside the microVM attempts network scanning or outbound connections, it cannot traverse vsock (vsock is a strictly point-to-point channel). The host side can precisely control what is received and sent over vsock.
OpenAI's Harness/Sandbox Separation Pattern
OpenAI Agents SDK introduces an architectural concept worth discussing separately: separation of harness (control plane) from sandbox (compute plane).
- Harness: control plane — LLM invocation, tool routing, user approval, security policy evaluation. This is "the smart part."
- Sandbox: compute plane — code execution, file operations, shell commands. This is "the potentially dangerous part."
Key design decision: credentials are injected into the sandbox as runtime configuration, not as prompt content. This means even if an attacker reads the agent's context (system prompt, conversation history) through prompt injection, they cannot obtain API keys or database passwords — this information is only injected at sandbox creation time via environment variables and is not present in the prompts.
This is an important security pattern: separate the credential channel from the data channel. The data channel (prompts, LLM output, tool call parameters) may be observed or manipulated by attackers; the credential channel (environment variable injection, secret manager mounts) is unidirectional and unreadable.
6. Performance Tradeoffs at Scale
"Stronger isolation means worse performance" — this is most people's intuition about security technology. In the context of agent runtime isolation, this intuition is broadly correct, but the magnitude may be much smaller than you think.
The Real Composition of Cold Start Latency
Cold start is the core performance metric for isolation technologies — it determines the wait time between an agent "deciding to execute code" and "code starting to run." But "cold start" means completely different things across different technologies:
| Technology | Cold Start (single) | Cold Start (warm pool) | Per-Instance Memory | CPU Overhead |
|---|---|---|---|---|
| Docker (runc) | 10–50ms | N/A | ~10MB | ~0% |
| gVisor | ~100ms | N/A | ~20MB | 10–30% (I/O), <5% (compute) |
| Firecracker | ~125ms | <50ms (snapshot) | ~5MB | 3–11% |
| Kata (QEMU) | ~500ms | N/A | ~50MB | 5–15% |
| Kata (Firecracker) | ~125–200ms | N/A | ~5MB | 3–11% |
| Kata (Dragonball) | ~200ms | <100ms | ~30MB | ~5% |
| Traditional VM (QEMU) | 3–60s | N/A | GB-scale | 5–20% |
(Data sources: NumaVM 2026-03 Firecracker end-to-end benchmarks; arXiv:2602.15214 Docker startup analysis; NextKick Labs 2026-01; Alibaba Cloud Kata 3.0 release announcement.)
An Important Correction: Docker's Actual Cold Start
Many people believe Docker containers are "instant-start" (~10ms). This is correct — but only for a very small part. The arXiv:2602.15214 study was the first to systematically decompose Docker container startup latency: kernel namespace creation takes only 8–10ms (less than 1.5% of total time). The real bottleneck is storage-layer operations — image layer mounting and filesystem preparation consume 300–800ms.
What this means: from a user's perspective, Docker container actual cold start and Firecracker microVM cold start (~125ms + snapshot loading) don't differ by an order of magnitude in total experience. Firecracker snapshot restore (176ms) can even be faster than the cold start of certain Docker images.
Warm Pool: The Silver Bullet for Eliminating Cold Start
A warm pool is the most effective technique for solving cold-start problems — pre-start a set of sandbox instances, and when an agent request arrives, directly allocate an already-running instance. Its effect is dramatic:
- AWS Lambda SnapStart: Java function cold start dropped from 6,100ms to 1,400ms (4.4× improvement); for Firecracker microVMs, snapshot restore takes only 176ms (of which snapshot loading is just 25ms, achieved via mmap)
- VM pooling (Abhishek Dadwal 2026-01): per-request latency dropped from 8,700ms to 500ms (17× improvement)
- Google GKE Agent Sandbox: through SandboxTemplate + WarmPool, sub-second sandbox dispatching
Instance Density: Memory Is the Real Bottleneck
A factor easily overlooked when selecting an isolation technology is instance density — how many sandboxes can run simultaneously on a single host. This directly determines infrastructure cost. NextKick Labs' January 2026 measured data (80GB host memory):
| Technology | Per-Instance Memory | Instances on 80GB Host |
|---|---|---|
| Docker (runc) | ~40MB | ~2,000 |
| Firecracker | ~45MB | ~1,778 |
| Kata (QEMU) | ~165MB | ~485 |
| Kata (Dragonball) | ~80MB | ~1,000 |
Key finding: Firecracker's instance density is nearly equivalent to Docker's (1,778 vs. 2,000). Kata 3.0, through the Dragonball VMM, doubled density (485 → 1,000). For AI agent workloads, this means: using Firecracker instead of Docker does not significantly increase infrastructure costs.
When Is the Performance Overhead Not Worth It?
Isolation is not free. In the following scenarios, the extra isolation overhead may not be justified:
- Text-only agents (no tool calls): If the agent doesn't execute code, invoke shells, or access filesystems, the extra isolation overhead is wasted. Docker + seccomp is sufficient.
- Trusted code execution: If the agent only executes team-authored, code-reviewed code (e.g., internal automation scripts), Docker + hardened seccomp + dropped capabilities provides adequate protection.
- I/O-intensive batch processing: If the agent's core work is file processing (e.g., large-scale ETL), gVisor's 10–30% I/O overhead may become a bottleneck. In this case, containers (zero I/O overhead) or Firecracker (~3–11% CPU overhead, near-native I/O) are more suitable.
- Sub-10ms latency requirements: If agent execution latency must be below 10ms, only WASM or containers can meet this. But note: in most agent scenarios, LLM inference latency (hundreds of milliseconds to seconds) far exceeds sandbox startup latency.
7. Decision Framework: Choose Isolation by Risk Level
After six chapters of technical analysis, the final question returns to an engineering decision: which isolation should my agent use? This isn't a technology question — it's a risk-matching question. The framework below maps agent capability scenarios to recommended isolation strategies.
Risk Level × Capability Scenario
| Agent Capability | Risk Level | Recommended Isolation | Rationale |
|---|---|---|---|
| Text-only, no tools | Low | Docker + seccomp | Minimal attack surface, no extra isolation cost needed |
| Trusted code execution (internal scripts) | Medium | Docker + hardened seccomp + drop all caps | Known code, controlled dependencies, hardened container sufficient |
| LLM-generated code execution | High | gVisor (minimum) / Firecracker (recommended) | Unpredictable syscall patterns, requires syscall-level interception or hardware isolation |
| Multi-tenant code execution | Critical | Firecracker / Kata Containers | Must provide an independent kernel boundary for each tenant |
| Finance / Healthcare / PII | Critical | Firecracker + egress allowlist + secret injection | Compliance requires VM-level boundary |
| GPU-accelerated AI Agent | High | gVisor (GPU support) or Kata | Firecracker lacks GPU passthrough |
| Plugin / extension system | High | WASM or Firecracker | Capability confinement or hardware isolation |
| Browser-side agent | Low–Medium | WASM (inherits browser sandbox) | Browser built-in isolation |
Seven Decision Rules
Here are seven hard decision rules — each directly corresponds to a yes/no judgment, helping you quickly narrow down choices in specific scenarios:
- Is the code LLM-generated? → Yes: at minimum gVisor; for production multi-tenant scenarios use Firecracker. Never use bare Docker.
- Do tenants share infrastructure? → Yes: independent kernel boundary required → Firecracker or Kata.
- Is GPU passthrough needed? → Yes: exclude Firecracker → gVisor (added GPU support 2024–2025) or Kata.
- Is Kubernetes the orchestration layer? → Yes: use the
kubernetes-sigs/agent-sandboxcontroller; switch between Kata or gVisor via RuntimeClass. - Is sub-10ms startup required? → Yes: containers or WASM; Firecracker snapshot restore still requires ~176ms.
- Is the workload compute-intensive with low I/O? → Yes: gVisor provides the best "performance-to-isolation ratio."
- Are there compliance audit requirements? → Yes: VM boundaries (Firecracker/Kata) are standard isolation that auditors can understand and verify.
Decision Tree
┌─────────────────────────┐
│ Does the agent execute code? │
└───────────┬─────────────┘
No │ Yes
▼ │ ▼
┌──────────┐ │ ┌──────────────────┐
│ Docker + │ │ │ Code source? │
│ seccomp │ │ └────────┬─────────┘
└──────────┘ │ LLM-generated │ Human-written
│ ┌─────────┘ ┌──────────┐
│ ▼ ▼ │
│ ┌────────┐ ┌──────────┐ │
│ │ Multi- │ │ Docker + │ │
│ │ tenant? │ │ hardened │ │
│ └──┬──┬──┘ └──────────┘ │
│ Yes│ │No │
│ ▼ ▼ │
│ ┌────────┐ ┌──────────┐ │
│ │Firecrkr│ │ Need GPU? │ │
│ │or Kata │ └──┬──┬────┘ │
│ └────────┘ Yes│ │No │
│ ▼ ▼ │
│ ┌──────┐ ┌────────┐ │
│ │gVisor│ │Firecrkr│ │
│ │or Kata│ │or gVisor│ │
│ └──────┘ └────────┘ │
└─────────────────────────────┘
Open Source Tool Quick Reference
| Tool | Category | Language | Stars | Description |
|---|---|---|---|---|
| Firecracker | VMM (MicroVM) | Rust | ~33.8K | AWS-built ultra-minimal VMM for KVM microVMs |
| gVisor | Userspace kernel | Go | ~18.1K | Google's OCI-compatible syscall interceptor |
| Kata Containers | OCI runtime + VM | Rust | ~7.8K | CNCF project, supports multiple VMM backends |
| Cloud Hypervisor | VMM | Rust | ~5.4K | Intel-led modern VMM |
| youki | Container runtime | Rust | ~6K | Rust rewrite of runc |
| Dragonball | VMM | Rust | — | Alibaba VMM, Kata 3.0 default backend |
Alibaba Cloud Secure Sandbox: Kata 3.0 Validated in Practice
Alibaba Cloud Container Service ACK's secure sandbox runtime (based on Kata Containers + Dragonball VMM) provides production data validated at massive scale: compared to community Kata 2.0, Alibaba Cloud Secure Sandbox v2 achieved 90% overhead reduction, 3× faster startup, and 10× density improvement. This proves that with the right VMM choice (Dragonball replacing QEMU) and deep optimization, microVM isolation can achieve near-container efficiency in large-scale production environments.
Frequently Asked Questions
1. Is Docker with a seccomp profile enough?
No. seccomp can only filter system calls — it's an interception layer at the syscall entry point. But seccomp cannot change a fundamental fact: Docker containers (runc runtime) share the same Linux kernel with the host.
Attackers can bypass seccomp-only defenses through the following paths:
- Kernel exploit: Any Linux kernel CVE (e.g., Dirty Pipe, Dirty COW) can be triggered from within a container, since the container directly accesses the host kernel. seccomp cannot defend against kernel vulnerabilities — the bug is in the kernel code, executing after seccomp's check.
- Allowed syscall combination attacks: seccomp allowlists typically permit 100–200 syscalls. Even after excluding obviously dangerous calls (mount, ptrace), attackers can still construct attacks through combinations of allowed calls. For example, using
openat+writeto overwrite sensitive files. - seccomp configuration gaps: Docker's default seccomp profile blocks 44 syscalls — but approximately 300 syscalls remain available. The attack surface is still substantial.
The correct approach: Use seccomp as one layer in defense in depth, not as the sole defense. Low-risk scenarios (personal tool agents): Docker + seccomp + drop all capabilities + read-only rootfs + AppArmor. Medium-to-high-risk scenarios (LLM-generated code execution): gVisor or Firecracker — they provide an independent execution kernel, not just syscall filtering.
2. How to choose between Firecracker and gVisor?
The core difference is the isolation mechanism: gVisor intercepts system calls through a userspace kernel (Sentry) — software-level isolation; Firecracker provides a true VM boundary through KVM hardware virtualization — hardware-level isolation.
Choose gVisor when:
- GPU passthrough is needed (Firecracker does not support GPU passthrough)
- Compute-intensive workloads with low I/O — gVisor CPU overhead <5%, best price-performance ratio
- Infrastructure does not support KVM (e.g., certain cloud environments, CI platforms) — gVisor can use Systrap mode without KVM
- Single-tenant scenarios that don't require hardware isolation for compliance
Choose Firecracker when:
- Multi-tenant platforms — each tenant must have an independent kernel boundary
- Executing untrusted LLM-generated code — the hardware VM boundary is easier to justify in audits because VM boundaries are familiar and independently verifiable
- Finance/healthcare/PII data processing — compliance frameworks (SOC 2, HIPAA) typically require VM-level isolation
- Extremely low memory overhead is needed (Firecracker VMM ~5MB vs gVisor ~20MB)
When neither is suitable: sub-10ms startup latency required → use containers or WASM; full Linux compatibility needed with acceptable GB-level memory → traditional VMs.
3. What impact do microVMs have on CI/CD pipelines?
The impact of microVMs on CI/CD pipelines is manageable and centers on three areas:
1. Image build process changes: Firecracker uses a minimal rootfs (Alpine Linux ~63MB vs Ubuntu ~300MB), built differently from traditional Docker images. You'll need to maintain a rootfs build pipeline (using debootstrap or buildroot), but this can be integrated into existing CI — make rootfs building a CI pipeline stage, with artifacts uploaded to object storage.
2. Real-world impact of startup latency: A single microVM cold start is 125–200ms. For CI tasks (typically lasting seconds to minutes), this latency accounts for less than 2%. If your CI pipeline uses warm pools, the latency is negligible. Note: Docker container actual cold start (storage-layer operations 300–800ms) can actually exceed Firecracker's.
3. Simplified resource cleanup: microVMs auto-destroy on exit — no residual processes, files, or network state. This actually simplifies CI environment cleanup. No need for docker rm -f or worrying about dangling volumes.
Recommended approach: Use VM pooling (referencing Abhishek Dadwal's 17× speedup practice), pre-allocate microVMs when the CI agent starts, return to pool or destroy and recreate after execution. Google GKE Agent Sandbox's WarmPool pattern can be directly reused.
4. Are there any ready-to-use agent sandbox services?
Yes, split into self-hosted vs. managed categories:
Managed services (fastest time to value):
- E2B (recommended) — open source (GitHub 12K+ stars, 480 releases), Firecracker-based agent sandbox platform. Cold start 80–200ms, supports 24h persistent sessions. One of the hosted sandbox providers listed in the OpenAI Agents SDK documentation. Provides Python/TypeScript SDK and MCP server support. Free tier available for trial.
- Docker Sandboxes — Docker's official MicroVM sandbox service launched in 2026. Independent Docker daemon per sandbox, natively supports macOS/Windows/Linux. Best for teams already in the Docker ecosystem.
- Northflank — Kata Containers-based agent sandbox platform, supports GPU and BYOC (bring your own container).
Self-hosted options (maximum control):
- GKE Agent Sandbox — Google Cloud's
kubernetes-sigs/agent-sandboxcontroller. Supports SandboxTemplate + WarmPool, switch between gVisor or Kata via RuntimeClass. Best for teams already running GKE clusters. - Self-hosted Firecracker cluster — use Firecracker Go SDK + gRPC sidecar pattern. Reference E2B's open-source architecture. Best for teams needing complete control over sandbox behavior and security policies.
Selection principle: team has Kubernetes operational ability and needs deep customization → self-host GKE Agent Sandbox or Firecracker cluster; need rapid time-to-market and accept provider management → E2B (most mature open-source option); already in the Docker ecosystem → Docker Sandboxes.
Next Steps
⬅️ Previous
Agent Command Execution Safety: Risk Boundaries for Shell, Filesystem, and Network Access
Sandboxes control the blast radius; command safety controls whether the fuse is lit — Policy Engine design and kernel-level hardening.
➡️ Next · Coming Soon
MCP Protocol Production Guide: Secure Deployment of the Model Context Protocol
Security practices at the tool protocol layer — MCP isolation, authentication, and transport security in production.
📚 Related Reading
- Agent Code Sandbox Design: Five-Boundary Architecture from Process to Network Isolation
- Agent Tool Permission Control: RBAC, ABAC, and Approval Flow Design
- Agent Command Execution Safety: Risk Boundaries for Shell, Filesystem, and Network Access
- Agent Error Recovery and Self-Healing: What to Do When an Agent Messes Up