Agent Human Approval Workflow: When Agents Should Pause, Ask, and Continue

30-Second Takeaway

  • Core Problem: Production agents execute irreversible actions without circuit breakers. Teams need a human approval layer above tool permissions — without creating approval fatigue or human bottlenecks.
  • The Solution: Four-level risk-based gating (AUTO → LOW_RISK → HIGH_RISK → CRITICAL) + formal approval state machine + structured ApprovalRequest schema + timeout escalation chains. Framework-agnostic, integrates with any agent architecture.
  • Key Implementation: State-machine-driven approval lifecycle (IDLE→PROPOSING→WAITING→APPROVED/REJECTED/MODIFIED/EXPIRED), idempotency protection, approval request context enrichment, and the Autonomy Ladder for progressive trust building.
  • What You'll Walk Away With: A production-ready human-in-the-loop design — when to pause an agent, how to structure approval requests, how to handle timeouts and escalation, and how to test approval workflows before deploying.

1. Why Full Autonomy Is a Trap

At 2 AM on a weekday in 2025, an ops agent completed what looked like a routine maintenance task: purging temp tables older than 30 days. It identified a table named tmp_backup_2025_04_01 and issued DROP TABLE. Everything appeared normal.

Then it kept going. The agent had discovered a pattern that worked, so it scanned the database for every table prefixed with tmp_ and deleted them one by one. Among them: tmp_prod_migration — a table a DBA had manually created the previous day for an actively running production data migration. The data hadn't finished moving. Four hours of downtime followed.

No brakes. No “are you sure?” dialog. The agent was designed under the assumption that all its operations are safe and reversible — an assumption that holds in a sandbox and fails the moment you point the agent at production.

Core Insight: The right question isn't “How do we remove humans?” — it's “Where do humans add the most value?” Full autonomy is not the goal. Controlled autonomy is.

Compounding Errors: What Happens Without a Circuit Breaker

A single bad agent action is bad enough. But the most dangerous property of autonomous agent systems is compounding error: the output of one operation becomes the input for the next, and errors amplify along the chain.

In the database incident above, the agent's first DROP TABLE succeeded — and that success reinforced its confidence that subsequent drops were safe. There was no mechanism to check whether a table was still in use. Every successful deletion made the next one feel more justified, right up until catastrophe. If you're familiar with “disengagement” in autonomous vehicle literature — the point where a system operates outside human oversight — agent compounding error is the same phenomenon. The system self-reinforces without external validation.

This is why circuit breakers matter: they insert an external judgment point — a human approval gate — at specific nodes in the operation chain. Not every git commit needs a manager's sign-off. But at defined risk thresholds, the system should pause, ask, and continue — not barrel through.

The Trust Dynamic: Start Tight, Loosen Over Time

When teams first introduce human-in-the-loop approval, the instinctive objection is: “This will make everything too slow. The whole point of an agent is to automate.” There's some truth to that instinct — if every operation waits on a human, the agent's value proposition collapses.

The answer isn't a binary “approve everything or approve nothing.” It's a progressive trust curve — what the Mindra framework calls the Autonomy Ladder:

This pattern mirrors how database query optimizers work: when statistics are fresh and reliable, the optimizer makes aggressive automatic decisions. When statistics are stale or missing, it falls back to conservative defaults. Your approval architecture should behave the same way.

Anti-Pattern: The System Prompt “Soft Brake”

A temptingly common approach: just add “Ask the user before deleting anything” to the system prompt. No code to write, one line of text — what could go wrong?

Everything. LLMs are not rule engines:

What this article builds instead is runtime enforcement: not constraints inside the LLM's “brain,” but an independent approval gate sitting above the tool execution layer. The LLM can reason, suggest, and propose — but for high-risk operations, execution must pass through a separate approval channel. That channel is not LLM-dependent. It's driven by a deterministic state machine and structured rules.

For the risk taxonomy that underpins tool-level gating, see Agent Command Execution Safety — which defines risk categories from safe queries to dangerous writes. This article builds the human decision layer on top of that foundation.

2. The Decision Framework — When Should Agents Pause?

Every approval architecture begins with one question: How dangerous is this operation? But “dangerous” isn't something you can sense intuitively at scale. It needs to be quantified and structured. Below is a four-level risk-gating model that turns fuzzy judgment into deterministic classification.

Four-Level Risk Gating

LevelDefinitionApproval StrategyTypical Operations
AUTO Read-only queries, harmless information retrieval Fully automatic, no approval ls, cat, query logs, check API status, read metrics
LOW_RISK Non-destructive writes, internal API calls Approve on first occurrence; subsequent similar operations can be memory-demoted to AUTO Create file, open PR, write to internal wiki, send internal Slack message
HIGH_RISK External API calls, configuration changes, data modifications Mandatory approval; supports batch approval (similar operations bundled) Modify nginx config, UPDATE production DB, call third-party API, send customer email
CRITICAL Deletion, financial transactions, credential changes, user data access Mandatory approval + two-person rule (dual authorization) DROP TABLE, process payment, modify IAM policy, rotate API keys, access PII

This isn't an arbitrary taxonomy. The underlying logic is blast radius × irreversibility:

Risk Scoring Matrix

Manual risk labeling doesn't scale. A more engineered approach is a risk scoring matrix that derives the level from three dimensions: action type, target environment, and blast radius.

┌────────────────┬──────────┬───────────┬───────────┐
│                 │  DEV     │  STAGING  │  PROD     │
├────────────────┼──────────┼───────────┼───────────┤
│ READ            │  AUTO    │  AUTO     │  AUTO     │
│ WRITE (new)     │  AUTO    │  LOW      │  LOW      │
│ WRITE (modify)  │  LOW     │  LOW      │  HIGH     │
│ DELETE          │  LOW     │  HIGH     │  CRITICAL │
│ EXTERNAL API    │  LOW     │  HIGH     │  HIGH     │
│ FINANCIAL       │  HIGH    │  CRITICAL │  CRITICAL │
│ CREDENTIALS     │  HIGH    │  CRITICAL │  CRITICAL │
└────────────────┴──────────┴───────────┴───────────┘

Three dimensions drive this matrix:

  1. Action type: READ / WRITE_NEW / WRITE_MODIFY / DELETE / EXTERNAL_API / FINANCIAL / CREDENTIALS
  2. Target environment: DEV / STAGING / PROD
  3. Blast radius: SINGLE_RESOURCE / SERVICE / ACCOUNT

Final risk level = max(matrix_lookup_value, blast_radius_modifier). Example: a DELETE in PROD scores CRITICAL from the matrix. If the blast radius is SINGLE_RESOURCE (deleting one log file), it can be downgraded to HIGH. If the blast radius is ACCOUNT (deleting all resources under an AWS account), it stays at CRITICAL and requires two-person approval.

For compliance-oriented teams: this matrix structure maps naturally to SOC 2 CC6.1 (logical access controls) and HIPAA §164.312(a)(1) (access control). Each cell in the matrix becomes a documented control — the auditor sees why a specific action×environment combination produces a specific risk level. The matrix is the control rationale.

Decision Tree

If a full matrix implementation is overkill for your current stage, this decision tree covers ~90% of production scenarios:


Operation type?
  ├── Read-only? (list, get, read, describe, cat, grep)
  │     └── YES → AUTO ✓ (no approval, execute immediately)
  │
  ├── Write new resource? (create, put, new, write)
  │     ├── Target = production?
  │     │     ├── YES → LOW_RISK ○ (approve first time, auto thereafter)
  │     │     └── NO  → AUTO ✓
  │     └── Involves external API?
  │           └── YES → HIGH_RISK ▲ (mandatory approval)
  │
  ├── Modify existing resource? (update, modify, patch, change)
  │     └── Target = production?
  │           ├── YES → HIGH_RISK ▲ (mandatory approval)
  │           └── NO  → LOW_RISK ○
  │
  ├── Delete? (delete, drop, remove, rm)
  │     └── HIGH_RISK ▲ (mandatory approval)
  │     └── Target = production? → CRITICAL ■ (two-person rule)
  │
  ├── Financial? (charge, transfer, payment, bill)
  │     └── CRITICAL ■ (two-person rule, regardless of environment)
  │
  └── Credential operation? (rotate, generate_key, change_password)
        └── CRITICAL ■ (two-person rule, regardless of environment)

Approval Gate — Code Implementation

Here's a Python implementation that turns the decision tree into a callable gate function. Note that this class is framework-agnostic — it takes a structured ActionContext and returns a RiskLevel. How you integrate it (middleware, decorator, pre-execution hook) depends on your architecture.

from enum import Enum
from dataclasses import dataclass
from typing import Optional

class RiskLevel(Enum):
    AUTO = "auto"           # fully automatic
    LOW_RISK = "low_risk"   # approve first occurrence
    HIGH_RISK = "high_risk" # mandatory approval
    CRITICAL = "critical"   # two-person rule

class ActionType(Enum):
    READ = "read"
    WRITE_NEW = "write_new"
    WRITE_MODIFY = "write_modify"
    DELETE = "delete"
    EXTERNAL_API = "external_api"
    FINANCIAL = "financial"
    CREDENTIALS = "credentials"

class Environment(Enum):
    DEV = "dev"
    STAGING = "staging"
    PROD = "prod"

@dataclass
class ActionContext:
    action_type: ActionType
    environment: Environment
    resource: str
    # blast radius: single / service / account
    blast_radius: str = "single"

class ApprovalGate:
    """Risk gating — maps action + environment + blast radius to risk level"""

    # Risk matrix: rows=action type, cols=environment
    RISK_MATRIX = {
        ActionType.READ: {
            Environment.DEV: RiskLevel.AUTO,
            Environment.STAGING: RiskLevel.AUTO,
            Environment.PROD: RiskLevel.AUTO,
        },
        ActionType.WRITE_NEW: {
            Environment.DEV: RiskLevel.AUTO,
            Environment.STAGING: RiskLevel.LOW_RISK,
            Environment.PROD: RiskLevel.LOW_RISK,
        },
        ActionType.WRITE_MODIFY: {
            Environment.DEV: RiskLevel.LOW_RISK,
            Environment.STAGING: RiskLevel.LOW_RISK,
            Environment.PROD: RiskLevel.HIGH_RISK,
        },
        ActionType.DELETE: {
            Environment.DEV: RiskLevel.LOW_RISK,
            Environment.STAGING: RiskLevel.HIGH_RISK,
            Environment.PROD: RiskLevel.CRITICAL,
        },
        ActionType.EXTERNAL_API: {
            Environment.DEV: RiskLevel.LOW_RISK,
            Environment.STAGING: RiskLevel.HIGH_RISK,
            Environment.PROD: RiskLevel.HIGH_RISK,
        },
        ActionType.FINANCIAL: {
            Environment.DEV: RiskLevel.HIGH_RISK,
            Environment.STAGING: RiskLevel.CRITICAL,
            Environment.PROD: RiskLevel.CRITICAL,
        },
        ActionType.CREDENTIALS: {
            Environment.DEV: RiskLevel.HIGH_RISK,
            Environment.STAGING: RiskLevel.CRITICAL,
            Environment.PROD: RiskLevel.CRITICAL,
        },
    }

    @classmethod
    def evaluate_risk(cls, ctx: ActionContext) -> RiskLevel:
        """Evaluate risk level for an action context"""
        base_risk = cls.RISK_MATRIX[ctx.action_type][ctx.environment]

        # Blast radius modifier
        if ctx.blast_radius == "account" and base_risk != RiskLevel.CRITICAL:
            return RiskLevel.CRITICAL
        elif ctx.blast_radius == "service" and base_risk == RiskLevel.HIGH_RISK:
            return RiskLevel.CRITICAL

        return base_risk

    @classmethod
    def requires_approval(cls, ctx: ActionContext) -> bool:
        """Does this action require human approval?"""
        return cls.evaluate_risk(ctx) != RiskLevel.AUTO


# --- Usage ---
ctx = ActionContext(
    action_type=ActionType.DELETE,
    environment=Environment.PROD,
    resource="tmp_backup_2025_04_01",
    blast_radius="single"
)

risk = ApprovalGate.evaluate_risk(ctx)
print(f"Risk level: {risk.value}")              # CRITICAL
print(f"Requires approval: {ApprovalGate.requires_approval(ctx)}")  # True

Note the relationship with Agent Tool Permission Control: tool permissions govern whether the agent can invoke a tool at all (pre-invocation check). The approval gate here governs whether the invocation's result should be executed (post-invocation, pre-execution). They're complementary layers — tool permissions fire at tool registration time; the approval gate fires on the execution path. When tool permission control marks an operation as ASK_APPROVAL, this article's approval state machine takes over the rest of the flow.

3. Approval State Machine Design

Once you've identified when to request approval, the next question is how — the lifecycle management of an approval request. This requires a formal state machine, not a pile of if-else branches.

State Definitions and Transition Diagram

The approval state machine has ten states and well-defined transition paths:


                    ┌──────────┐
                    │   IDLE   │  ← Agent's normal working state
                    └────┬─────┘
                         │ High-risk operation triggered
                         ▼
                    ┌──────────┐
                    │ PROPOSING │  ← Agent builds approval request
                    └─────┬─────┘
                          │ Proposal ready
                          ▼
                   ┌───────────────┐
                   │ WAITING_APPROVAL│ ← Waiting for human decision
                   └───┬─┬─┬─┬──┘
                       │  │  │  │
          ┌───────────┘  │  │  └───────────┐
          ▼               ▼  ▼               ▼
    ┌──────────┐  ┌──────────┐ ┌──────────┐ ┌──────────┐
    │ APPROVED │  │ REJECTED │ │ MODIFIED │ │ EXPIRED  │
    └────┬─────┘  └────┬─────┘ └────┬─────┘ └────┬─────┘
         │             │            │            │
         ▼             ▼            ▼            ▼
    ┌──────────┐  ┌──────────┐ ┌───────────┐ ┌──────────┐
    │ RESUMING │  │ ROLLBACK │ │RE-PROPOSE│ │ESCALATE │
    └────┬─────┘  └──────────┘ └───────────┘ └──────────┘
         │
         ▼
    ┌──────────┐
    │ EXECUTING │  ← Agent resumes execution
    └─────┬─────┘
          │ Execution complete
          ▼
    ┌──────────┐
    │ COMPLETED │  ← Returns to IDLE
    └──────────┘

One design principle governs this machine: every intermediate state has a clear exit path. No state can permanently trap the agent. Even WAITING_APPROVAL has EXPIRED as a guaranteed escape hatch — timeout isn't a system failure; it's a designed control path.

Compliance angle: The state machine is the audit evidence. Every state transition is a logged event with timestamp and principal — who proposed, who approved, who modified, when the timeout fired. For SOC 2, this produces the control evidence that answers “who did what and when.” For HIPAA, it provides the access decision trail for any operation touching ePHI. Don't bolt audit logging on afterward — bake it into the state transition itself.

State Transition Rules

Current StateTrigger EventTarget StateGuard Condition
IDLEtool_call_risk ≥ HIGH_RISKPROPOSINGrisk_level != AUTO
PROPOSINGrequest_readyWAITING_APPROVALRequest payload is valid
WAITING_APPROVALhuman_approveAPPROVEDApprover is in authorized list
WAITING_APPROVALhuman_rejectREJECTEDApprover has reject authority
WAITING_APPROVALhuman_modifyMODIFIEDRequest allows modification
WAITING_APPROVALtimeoutEXPIREDPast deadline
APPROVEDresume_agentRESUMINGCheckpoint state is valid
MODIFIEDre_proposePROPOSINGRe-propose with modified params
EXPIREDescalateWAITING_APPROVALEscalation chain has next approver
EXPIREDfallback_denyREJECTEDEscalation chain exhausted
RESUMINGcheck_idempotentEXECUTING or IDLECheck idempotency key
EXECUTINGexecution_doneCOMPLETEDOperation completed (success or failure)
COMPLETEDcleanupIDLERelease resources

Edge Case: Late Approval After Timeout

Here's a scenario that's easy to overlook but happens regularly in production: the approval request times out (EXPIRED), the system triggers escalation or auto-denial, and then — seconds later — the original approver clicks “Approve.”

The correct behavior: reject the late approval. The state machine has already left WAITING_APPROVAL. A late approval is an invalid event — the system should log it and notify the approver: “This operation timed out before your response. Your approval was not applied.”

The same principle applies to any state mismatch: if the world has changed since the approval request was issued (the user canceled the operation, the resource was deleted by another process, the agent's task was aborted), the approval should be invalid. This is why you need a state machine rather than a simple callback — linear state progression ensures approval events are only valid within the correct time window and world state.

Checkpoint Persistence: The Pause-and-Resume Foundation

Before entering PROPOSING, the agent must save its current state — a checkpoint. Without a checkpoint, the agent can't accurately resume after approval.

A checkpoint must capture at minimum:

@dataclass
class AgentCheckpoint:
    """Agent pause point — resumes from here after approval"""
    checkpoint_id: str
    agent_id: str
    pending_action: dict          # action awaiting approval
    conversation_snapshot: list   # conversation history snapshot
    tool_call_stack: list         # tool call stack
    task_state: dict              # task progress
    created_at: float             # Unix timestamp
    ttl_seconds: int = 600        # checkpoint validity window

On resume, the agent loads state from the checkpoint and does not ask the LLM to re-reason — it jumps directly to the execution step. This avoids the problem of “the LLM was approved, then it got confused and re-considered whether it should do the thing.” The resume path is deterministic.

Production note: the checkpoint must be persisted to durable storage (database, persistent volume), not in-memory. If your agent process restarts while waiting for approval, the checkpoint must survive.

Idempotency: One Approval, One Execution

Idempotency is table stakes for any distributed system, and approval workflows are no exception. Scenario: action A is approved and executed. Due to network retry or a duplicate event on the approval channel, the system receives a second “approved” event for the same action.

Idempotency key design:

import hashlib

def generate_action_id(agent_id, action, params):
    canonical = "%s:%s:%s" % (agent_id, action, sorted(params.items()))
    return hashlib.sha256(canonical.encode()).hexdigest()[:16]

# Usage
action_id = generate_action_id(
    "agent-01", "DROP TABLE",
    {"table": "tmp_backup_2025_04_01"}
)

# Check before executing
if execution_log.contains(action_id):
    print("Action %s already executed, skipping" % action_id)
    return  # idempotency protection

The execution log is a persistent set (Redis Set with SETNX for atomicity, or a database table). Check before execution, record after. These two operations should happen in the same transaction or use an atomic primitive. For production: Redis SETNX action_id "executing" NX EX 3600 — sets the key only if it doesn't exist, with a 1-hour TTL as a safety net against leaked locks.

Complete State Machine Implementation

from enum import Enum
from datetime import datetime, timedelta
from typing import Optional
import uuid
import hashlib

class ApprovalState(Enum):
    IDLE = "idle"
    PROPOSING = "proposing"
    WAITING_APPROVAL = "waiting_approval"
    APPROVED = "approved"
    REJECTED = "rejected"
    MODIFIED = "modified"
    EXPIRED = "expired"
    RESUMING = "resuming"
    EXECUTING = "executing"
    COMPLETED = "completed"


def generate_action_id(agent_id: str, action: str, params: dict) -> str:
    canonical = f"{agent_id}:{action}:{sorted(params.items())}"
    return hashlib.sha256(canonical.encode()).hexdigest()[:16]


class ExecutionLog:
    def __init__(self):
        self._executed: set[str] = set()
    def contains(self, action_id: str) -> bool:
        return action_id in self._executed
    def record(self, action_id: str):
        self._executed.add(action_id)

execution_log = ExecutionLog()


class ApprovalStateMachine:

    def __init__(self, request_id: str = None):
        self.request_id = request_id or str(uuid.uuid4())[:8]
        self.state = ApprovalState.IDLE
        self.checkpoint: Optional[dict] = None
        self.action_id: Optional[str] = None
        self.deadline: Optional[datetime] = None
        self.modified_params: Optional[dict] = None
        self.escalation_chain: list[str] = []
        self.current_escalation: int = 0

    def propose(self, action: dict, deadline_seconds: int = 300,
                escalation: list[str] = None):
        assert self.state == ApprovalState.IDLE
        self.checkpoint = {
            "action": action,
            "timestamp": datetime.now().isoformat(),
        }
        self.deadline = datetime.now() + timedelta(seconds=deadline_seconds)
        self.escalation_chain = escalation or []
        self.current_escalation = 0
        self.action_id = generate_action_id(
            "agent-01", action["tool"], action["params"])
        self.state = ApprovalState.PROPOSING
        self.state = ApprovalState.WAITING_APPROVAL
        return self

    def approve(self, approver: str) -> bool:
        if self.state != ApprovalState.WAITING_APPROVAL:
            return False
        if self._is_expired():
            return self._handle_expired()
        self.state = ApprovalState.APPROVED
        return True

    def reject(self, approver: str, reason: str = "") -> bool:
        if self.state != ApprovalState.WAITING_APPROVAL:
            return False
        self.state = ApprovalState.REJECTED
        return True

    def modify(self, approver: str, new_params: dict) -> bool:
        if self.state != ApprovalState.WAITING_APPROVAL:
            return False
        if self._is_expired():
            return self._handle_expired()
        self.modified_params = new_params
        self.state = ApprovalState.MODIFIED
        self.state = ApprovalState.PROPOSING
        self.state = ApprovalState.WAITING_APPROVAL
        self.deadline = datetime.now() + timedelta(seconds=300)
        return True

    def handle_timeout(self) -> bool:
        if self.state != ApprovalState.WAITING_APPROVAL:
            return False
        if not self._is_expired():
            return False
        self.state = ApprovalState.EXPIRED
        if self.current_escalation < len(self.escalation_chain):
            next_approver = self.escalation_chain[self.current_escalation]
            self.current_escalation += 1
            self.deadline = datetime.now() + timedelta(seconds=300)
            self.state = ApprovalState.WAITING_APPROVAL
            return True
        else:
            self.state = ApprovalState.REJECTED
            return False

    def resume(self) -> Optional[dict]:
        if self.state != ApprovalState.APPROVED:
            return None
        if execution_log.contains(self.action_id):
            self.state = ApprovalState.IDLE
            return None
        self.state = ApprovalState.RESUMING
        self.state = ApprovalState.EXECUTING
        return self.checkpoint["action"]

    def complete(self):
        assert self.state == ApprovalState.EXECUTING
        execution_log.record(self.action_id)
        self.state = ApprovalState.COMPLETED
        self.state = ApprovalState.IDLE

    def _is_expired(self) -> bool:
        return datetime.now() > self.deadline

    def _handle_expired(self) -> bool:
        self.state = ApprovalState.EXPIRED
        return self.handle_timeout()


sm = ApprovalStateMachine()
sm.propose(
    action={"tool": "DROP TABLE",
            "params": {"table": "tmp_backup_2025_04_01"}},
    deadline_seconds=300,
    escalation=["[email protected]", "[email protected]"]
)
sm.approve("[email protected]")
action = sm.resume()
if action:
    print(f"Executing: {action}")
    sm.complete()

This implementation covers the full lifecycle. In production, the state machine's state should be persisted to a database or Redis rather than held in memory — this enables cross-process approval and survives process restarts. The handle_timeout() method should be driven by a scheduled job (cron, Celery beat, Temporal timer) rather than checked only at approve() time — otherwise an unresponsive approval blocks the timeout path.

Design tradeoff — state machine vs. workflow engine: For teams already running Temporal, Cadence, or AWS Step Functions, you might ask: “Why build a custom state machine instead of using the workflow engine?” The answer depends on your approval granularity. Workflow engines excel at orchestrating multi-step business processes with human tasks (order approvals, document reviews). But agent approval happens at tool-execution frequency — potentially dozens of decisions per minute. A workflow engine's task-polling overhead (~100ms+) and per-decision cost model don't fit this granularity. Use the workflow engine for cross-agent orchestration; use a lightweight state machine like this for per-tool approval gating.

4. The Approval Request Schema — What the Human Sees

The first half of the approval workflow is about when to stop the agent. This section is about what happens after the stop. The approval request is not a dumb yes/no dialog — it's the human decision interface. Design it well, and the approver makes fast, accurate calls. Design it poorly, and you get blind approvals or decision paralysis.

Why Structured Schema Matters

Imagine you receive two approval notifications:

Notification A: “Delete tmp_backup_2025_04_01? [Approve / Reject]”

Notification B: “Agent-01 requests DROP TABLE tmp_backup_2025_04_01 on production database db-prod-1. This table was created yesterday by a DBA (data migration in progress, ~2 hours remaining). Risk level: CRITICAL. The agent inferred this table is 30+ days old based on its name pattern, matching cleanup policy — but actual creation time is yesterday. Recommendation: REJECT and update cleanup policy. If no response within 5 minutes, this escalates to oncall-dba. Modification allowed. [Approve / Reject / Modify Parameters]”

Notification A forces the approver to rely entirely on their own knowledge. If they don't know the table was created yesterday, they'll approve. Notification B gives the approver enough context to make the correct decision. The gap between the two is the power of structured schema.

ApprovalRequest JSON Schema

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://xslyl.com/schemas/approval-request-v1.json",
  "title": "Agent Approval Request",
  "description": "Structured data submitted when an agent requests human approval",
  "type": "object",
  "required": [
    "action",
    "parameters",
    "risk_level",
    "proposed_by",
    "context_window",
    "deadline",
    "idempotency_key"
  ],
  "properties": {
    "action": {
      "type": "object",
      "description": "The operation the agent requests to execute",
      "required": ["tool_name", "description"],
      "properties": {
        "tool_name": {
          "type": "string",
          "description": "Tool name, e.g. DROP_TABLE, SEND_EMAIL, CREATE_PR"
        },
        "description": {
          "type": "string",
          "description": "Human-readable description"
        }
      }
    },
    "parameters": {
      "type": "object",
      "description": "Tool invocation parameters (key-value)"
    },
    "risk_level": {
      "type": "string",
      "enum": ["AUTO", "LOW_RISK", "HIGH_RISK", "CRITICAL"],
      "description": "Risk level"
    },
    "proposed_by": {
      "type": "object",
      "description": "Proposing agent information",
      "required": ["agent_id", "agent_name"],
      "properties": {
        "agent_id": {"type": "string"},
        "agent_name": {"type": "string"},
        "agent_version": {"type": "string"}
      }
    },
    "context_window": {
      "type": "object",
      "description": "Context to help the approver make an informed decision",
      "required": ["reasoning", "relevant_tool_outputs"],
      "properties": {
        "reasoning": {
          "type": "string",
          "description": "Agent's reasoning chain"
        },
        "relevant_tool_outputs": {
          "type": "array",
          "description": "Summaries of relevant tool outputs collected during the execution path",
          "items": {"type": "object"}
        },
        "risk_justification": {
          "type": "string",
          "description": "Why this specific risk level was assigned"
        },
        "conversation_snippet": {
          "type": "string",
          "description": "The user conversation fragment that triggered this action"
        }
      }
    },
    "deadline": {
      "type": "string",
      "format": "date-time",
      "description": "Approval deadline (ISO 8601)"
    },
    "idempotency_key": {
      "type": "string",
      "description": "Idempotency key — prevents duplicate execution"
    },
    "fallback_on_timeout": {
      "type": "string",
      "enum": ["auto_deny", "auto_approve", "escalate"],
      "default": "escalate",
      "description": "Default behavior on timeout"
    },
    "fallback_on_reject": {
      "type": "string",
      "enum": ["rollback", "notify", "log_only"],
      "default": "notify",
      "description": "Agent's follow-up action after rejection"
    },
    "escalation_chain": {
      "type": "array",
      "items": {"type": "string"},
      "description": "Escalation chain: ordered list of approver identifiers"
    },
    "modification_allowed": {
      "type": "boolean",
      "default": true,
      "description": "Can the approver modify parameters before approving?"
    }
  }
}

Core Field Breakdown

context_window — This is the most important field in the entire schema. Without it, the approver cannot make a meaningful decision. It contains four sub-fields:

fallback fields — These define “what happens when nobody responds” and “what happens after rejection.” Neither is optional; without explicit fallbacks, the system exhibits undefined behavior on the exception path.

Safe defaults principle: Timeout fallback defaults to auto_deny (not auto_approve). Rejection fallback defaults to notify (log the event and instruct the agent to terminate the current task). Security-safe defaults beat convenience defaults — better to deny once too often than execute once too many. For regulated environments (SOC 2, HIPAA), auto_approve should require explicit justification and an audit log entry explaining why automatic approval on timeout was configured for this specific risk level.

Python Implementation: ApprovalRequest Data Structure

from dataclasses import dataclass, field, asdict
from datetime import datetime, timedelta
from typing import Any, Optional
import json
import hashlib

def generate_action_id(agent_id, action, params):
    canonical = "%s:%s:%s" % (agent_id, action, sorted(params.items()))
    return hashlib.sha256(canonical.encode()).hexdigest()[:16]


@dataclass
class ApprovalRequest:
    """Approval request — structured data for human decision-making"""

    # Required fields
    action: dict
    parameters: dict
    risk_level: str
    proposed_by: dict
    context_window: dict
    idempotency_key: str

    # Optional fields with safe defaults
    deadline: str = field(default_factory=lambda:
        (datetime.now() + timedelta(seconds=300)).isoformat())
    fallback_on_timeout: str = "escalate"
    fallback_on_reject: str = "notify"
    escalation_chain: list = field(default_factory=list)
    modification_allowed: bool = True

    def to_json(self):
        return json.dumps(asdict(self), ensure_ascii=False, indent=2)

    def to_notification(self, platform="slack"):
        risk_icon = "[CRITICAL]" if self.risk_level == "CRITICAL" else "[HIGH]"
        lines = [
            "*Agent Approval Request*",
            "*Action:* %s" % self.action.get("description", "N/A"),
            "*Risk Level:* %s %s" % (risk_icon, self.risk_level),
            "*Proposed By:* %s" % self.proposed_by.get("agent_name", "N/A"),
            "*Deadline:* %s" % self.deadline,
            "*Reasoning:* %s..." % (
                self.context_window.get("reasoning", "N/A")[:200]),
        ]
        if self.escalation_chain:
            chain = " -> ".join(self.escalation_chain)
            lines.append("*Escalation Chain:* %s" % chain)
        return "\\n".join(lines)

    @classmethod
    def build(cls, action, parameters, risk_level, agent_info,
              reasoning, tool_outputs, risk_justification="",
              conversation_snippet="", **kwargs):
        context = dict(
            reasoning=reasoning,
            relevant_tool_outputs=tool_outputs,
            risk_justification=risk_justification or (
                "Evaluated by ApprovalGate: %s" % risk_level),
            conversation_snippet=conversation_snippet,
        )
        idemp_key = generate_action_id(
            agent_info.get("agent_id", "unknown"),
            action.get("tool_name", ""),
            parameters
        )
        return cls(
            action=action,
            parameters=parameters,
            risk_level=risk_level,
            proposed_by=agent_info,
            context_window=context,
            idempotency_key=idemp_key,
            **kwargs
        )


# Usage example
req = ApprovalRequest.build(
    action=dict(
        tool_name="DROP_TABLE",
        description="Drop tmp_backup_2025_04_01 in production"
    ),
    parameters=dict(database="db-prod-1", table="tmp_backup_2025_04_01"),
    risk_level="CRITICAL",
    agent_info=dict(
        agent_id="agent-01",
        agent_name="Ops Cleanup Agent",
        agent_version="v2.3.1"
    ),
    reasoning=(
        "1. User requested cleanup of temp tables older than 30 days. "
        "2. Scanned DB, found table matching tmp_ pattern. "
        "3. Inferred from name date that table is 30+ days old. "
        "NOTE: Did NOT check actual creation time via DESCRIBE TABLE."
    ),
    tool_outputs=[
        dict(tool="SCAN_TABLES",
             output=["tmp_backup_2025_04_01", "tmp_prod_migration"]),
        dict(tool="DESCRIBE_TABLE",
             output=dict(created_at="yesterday",
                         description="data migration in progress"))
    ],
    risk_justification=(
        "DELETE x PROD = CRITICAL. "
        "Mitigation: single table only, "
        "but table is actively in use -- maintain CRITICAL."
    ),
    escalation_chain=["[email protected]", "[email protected]"]
)

print(req.to_json())
print(req.to_notification())

Context Enrichment Strategies for Production

The example above demonstrates the core value of context enrichment: the agent's reasoning chain contains a visible logical error (inferred age from name pattern instead of querying actual creation time), and the DESCRIBE TABLE output explicitly shows the table was created yesterday and is still in use. The approval request surfaces all of this, enabling the approver to spot the problem at a glance.

Strategies that make context genuinely useful rather than noisy:

Anti-Pattern: Sending Too Little Context

The most dangerous approval request is the one that says “Delete this? [Y/n].” Psychologically, when information is scarce, humans default to approval — because the cognitive cost of rejection (figuring out why you should reject) is higher than the cost of approval (clicking “yes”). This is approval fatigue in its most destructive form.

The only defense against approval fatigue is to pre-digest complexity: the agent does the reasoning and analysis upfront, extracts the key signals, and presents them to the approver. The approver's job isn't “figure out what's happening” — it's “judge a pre-assembled case file.” The cognitive load difference between these two tasks is an order of magnitude.

UX principle — the 30-second rule: A well-structured approval request should enable the approver to reach a decision in 30 seconds or less. If it takes longer, the context isn't rich enough — or there's too much noise. Measure this. If your median approval time exceeds 60 seconds, your schema needs work. The approval request should tell a story: what happened, why the agent wants to do this, what happens if approved, what happens if rejected.

The approval request is the bridge between agent autonomy and human judgment. Sections 5-8 will cover what happens after this bridge: timeout and escalation strategies, how different agent frameworks implement HITL (LangGraph, AutoGen, CrewAI comparison), multi-agent approval patterns where one agent signs off on another, and a production testing checklist with FAQs.

5. Timeout, Escalation & Fallback Strategies

Approvals aren't instant. Humans miss Slack notifications, go on vacation, change teams, or simply aren't at their desk when an agent needs a decision. A production-grade approval system must handle the case where nobody responds — and it must do so in a way that doesn't deadlock the agent or create security gaps.

Approval TTLs Per Risk Level

Not all approvals deserve the same thinking time. A CRITICAL operation (DROP TABLE on production) warrants a shorter timeout than a LOW_RISK one (create a file in staging). The timeout should match the operational urgency:

Risk LevelDefault TTLRationaleTimeout Action
LOW_RISK 24 hours Low blast radius; agent can proceed with most work while waiting. 24h covers overnight + next business morning. Escalate → Deny
HIGH_RISK 4 hours Service-impacting potential. 4h covers a half-day window; if the approver is unavailable for 4+ hours, there's likely a larger operational issue. Escalate → Escalate → Deny
CRITICAL 30 minutes Irreversible or compliance-impacting. If the agent can't get approval within 30 minutes, the operation likely shouldn't proceed. Short timeout forces human engagement. Escalate → Escalate → Escalate → Deny

These TTLs are starting defaults, not universal laws. A financial services team processing overnight batch payments might configure CRITICAL at 5 minutes. A data engineering team cleaning up stale resources might configure LOW_RISK at 48 hours. The key principle: shorter timeouts for higher blast radius. The timeout itself is a signal to the approver — if you see "Expires in 15 minutes" on a CRITICAL request, that urgency is by design.

Compliance note: Every TTL configuration change is a control modification. In SOC 2 audited environments, TTL values should be documented with a rationale (why 30 minutes, not 10?) and approved through a change control process. The rationale is the control evidence.

Escalation Chain Pattern

The escalation chain is an ordered list of approvers. When the primary approver doesn't respond within the TTL, the system moves to the next approver in the chain. Each transition is an audit event. The chain exhausts at a terminal state — auto-deny, with a full audit log explaining why no human made the decision.


Primary Approver (TTL: 30min)
  │
  ├─ Timeout ──► Team Lead (TTL: 15min)
  │                │
  │                ├─ Timeout ──► On-Call Engineer (TTL: 15min)
  │                │                │
  │                │                ├─ Timeout ──► Auto-Deny + Audit + Notify
  │                │                │
  │                │                └─ Responds ──► Decision recorded
  │                │
  │                └─ Responds ──► Decision recorded
  │
  └─ Responds ──► Decision recorded

Key design decisions in this pattern:

  1. Shrinking TTLs: Each escalation step gets a shorter timeout than the previous one. The primary approver gets 30 minutes; the on-call engineer gets 15. By the time you reach the last person in the chain, the system is in "this needs a decision now" mode, and the shorter window reflects that.
  2. Terminal action is always deny: If nobody responds, the system denies the operation. Auto-approve on timeout is a security anti-pattern — it converts "nobody is paying attention" into "execute anyway." For CRITICAL operations, auto-approve must never be the default.
  3. Every step is audited: The escalation isn't just a routing mechanism; it's an audit trail of organizational responsiveness. If an incident occurs because an approval was auto-denied after exhausting the chain, the audit log shows exactly who was notified, when, and that nobody responded.

Dead Letter Queue: Expired Approvals as Audit Artifacts

When an approval request reaches its terminal state (auto-denied after escalation chain exhaustion), the request doesn't disappear — it moves to a dead letter queue. This queue serves three purposes:

Circuit Breaker Patterns: Fail-Open vs. Fail-Closed

When the approval system itself fails — the notification service is down, the state machine persistence layer is unreachable, the escalation chain can't be resolved — the system faces a circuit breaker decision:

StrategyBehaviorRiskWhen to Use
Fail-Open When the approval system is unavailable, allow all operations to proceed without approval. Heavily audit every bypass. Agent executes dangerous operations unchecked. Acceptable only if the system has other guardrails (tool-level ACLs, resource quotas, read-only by default). Non-critical environments (staging, dev). Business-continuity scenarios where "the system must keep running" outweighs "every action needs a human."
Fail-Closed When the approval system is unavailable, block all operations that require approval. The agent stalls, but nothing dangerous executes. Agent is non-functional for high-risk operations. May block critical business workflows. Acceptable for environments where safety trumps availability. Production. Regulated environments (SOC 2, HIPAA, PCI DSS). Financial transaction systems.

The circuit breaker should be configured per environment, not per application. Production defaults to fail-closed. Development and staging environments may use fail-open to avoid blocking developer workflows, but every fail-open bypass must generate an immutable audit log entry and fire an alert to the on-call channel.

Implementation note: A common mistake is coupling the circuit breaker directly to the approval state machine's health check. Instead, use a separate circuit breaker service that monitors the approval system's health independently. If the approval persistence layer (PostgreSQL/Redis) is unreachable for > 60 seconds, the circuit breaker trips and all approval gates enter their configured failure mode. This separation prevents a cascading failure where the approval system's own health check is part of the failure domain.

Approver Offboarding: Reassigning Pending Approvals

When an approver leaves the team or the organization, their pending approvals can't simply evaporate. The system must:

  1. Detect the offboarding event: Integrate with your identity provider (Okta, Azure AD) or HR system to receive user deactivation events.
  2. Reassign pending approvals: All WAITING_APPROVAL requests assigned to the departing approver are reassigned to the next person in the escalation chain. This is a state machine transition: WAITING_APPROVAL → WAITING_APPROVAL (same state, new approver).
  3. Reset TTL: The new approver gets a fresh TTL — they shouldn't be penalized with a shortened window because the original approver left.
  4. Log the reassignment: Audit entry: {"event": "approver_reassigned", "from": "[email protected]", "to": "[email protected]", "reason": "user_deprovisioned", "pending_requests": 3}.

This requires team membership tracking. The escalation chain shouldn't be a static list of email addresses hardcoded in configuration — it should resolve dynamically against the current team roster. If the escalation chain is ["team-lead-dba", "oncall-dba"], the system resolves team-lead-dba to the current person holding that role at approval-request time, not at configuration time.

On-Call Rotation Integration

Production approval routing should integrate with your existing on-call infrastructure (PagerDuty, Opsgenie, VictorOps). The pattern:

  1. Escalation chain resolves to on-call schedules: Instead of hardcoding individuals, the chain specifies schedule names: ["primary-dba", "secondary-dba", "oncall-engineering"].
  2. At approval-request time, resolve to current on-call: Query the PagerDuty/Opsgenie API for the person currently on call for each schedule. Cache the resolution for the TTL window.
  3. Notification routing: Send approval requests to the on-call engineer's preferred notification channel (SMS, push notification, phone call) as configured in the on-call platform — don't assume Slack or email availability.
  4. Incident correlation: If an active incident exists in the on-call platform that matches the same service, the approval request should reference that incident. The approver can see: "This request relates to active incident INC-00421 (database migration in progress)."

EscalationPolicy Implementation

from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import Optional, Callable
import threading
import time

@dataclass
class EscalationStep:
    """One step in an escalation chain"""
    approver_id: str
    ttl_seconds: int
    notified_at: Optional[float] = None

# WARNING: threading.Timer is not production-ready — timers are lost on process restart.
# In production, persist deadlines to a database or message queue and use a scheduler.
@dataclass
class EscalationPolicy:
    """Manages TTL timers and escalation chain traversal"""
    request_id: str
    steps: list      # list of EscalationStep
    on_expire: Callable  # called when chain is exhausted
    on_escalate: Callable  # called on each step transition
    current_step: int = 0
    timer: Optional[threading.Timer] = None

    def start(self):
        """Begin the first escalation step"""
        if self.current_step >= len(self.steps):
            self.on_expire(self.request_id)
            return
        step = self.steps[self.current_step]
        step.notified_at = time.time()
        self.on_escalate(self.request_id, step.approver_id, step.ttl_seconds)
        self.timer = threading.Timer(step.ttl_seconds, self._on_timeout)
        self.timer.start()

    def _on_timeout(self):
        """Current step timed out — move to next"""
        self.current_step += 1
        if self.current_step >= len(self.steps):
            self.on_expire(self.request_id)  # chain exhausted → terminal action
            return
        step = self.steps[self.current_step]
        step.notified_at = time.time()
        self.on_escalate(self.request_id, step.approver_id, step.ttl_seconds)
        self.timer = threading.Timer(step.ttl_seconds, self._on_timeout)
        self.timer.start()

    def cancel(self):
        """Human responded — cancel the escalation timer"""
        if self.timer:
            self.timer.cancel()


# --- Usage ---
def escalate(request_id, approver, ttl):
    print(f"[ESCALATE] {request_id} → {approver} (TTL: {ttl}s)")

def expire(request_id):
    print(f"[EXPIRE] {request_id} — escalation chain exhausted, auto-deny")

policy = EscalationPolicy(
    request_id="req-01",
    steps=[
        EscalationStep("[email protected]", ttl_seconds=30),
        EscalationStep("[email protected]", ttl_seconds=15),
        EscalationStep("[email protected]", ttl_seconds=10),
    ],
    on_escalate=escalate,
    on_expire=expire,
)
policy.start()  # begins 30s timer for alice
# After 30s: escalates to bob (15s timer)
# After 45s total: escalates to oncall-eng (10s timer)
# After 55s total: on_expire fires → auto-deny

In production, replace threading.Timer with a durable timer (Redis keyspace notifications, database poll_interval job, or Temporal timer). In-memory timers don't survive process restarts — a CRITICAL approval that timed out before a restart would be lost entirely. See Agent Error Recovery for patterns on handling timeout interruptions, and Agent Observability for monitoring escalation rate and queue health metrics.

6. Framework Comparison — LangGraph, AgentGraph, AutoGen, CrewAI

Human-in-the-loop isn't a feature you bolt onto a framework — it's a capability that's either deeply integrated into the framework's execution model or hacked around its edges. This section compares four major agent frameworks on their HITL primitives, not on their overall feature sets.

HITL Capabilities Matrix

CapabilityLangGraphAgentGraphAutoGenCrewAI
Interrupt mechanism interrupt() + Command(resume) graph.Interrupt() + ResumeCommand UserProxyAgent(human_input_mode) human_input flag
Checkpoint persistence SqliteSaver / PostgresSaver Built-in (Redis/DB) No built-in No built-in
Approval timeout Manual (thread timer or external scheduler) Context deadline Manual timeout context Manual
Multi-node approval At any graph node At any graph node One UserProxyAgent Per-task flag
Escalation support None built-in None built-in None built-in None built-in
Compliance readiness Partial (checkpoint logs) Partial Low Low
Ecosystem maturity High (v0.2+ community) Growing (Tencent) High (Microsoft) High (simple API)

The most revealing row is Escalation support: None built-in across all four frameworks. No major agent framework ships with timeout escalation chains, dead letter queues, or circuit breaker patterns. This is the gap this article fills — you'll build the escalation layer yourself regardless of which framework you choose. The framework comparison therefore focuses on what each framework does provide: the interrupt-and-resume primitive that makes human-in-the-loop possible at all.

Framework-Specific HITL Patterns

LangGraph: interrupt() + Command(resume)

LangGraph's HITL model is the most mature of the four. The interrupt() function pauses graph execution at any node, serializes the full graph state to a checkpoint, and waits for external input via Command(resume=...). The checkpoint is persisted to a Saver backend (SQLite for development, PostgreSQL for production).

from langgraph.graph import StateGraph, interrupt
from langgraph.types import Command

def dangerous_node(state):
    # Pause here — serialize state, wait for human
    approval = interrupt({
        "action": "DROP_TABLE",
        "params": state["pending_table"],
        "risk": "CRITICAL"
    })
    if approval.get("decision") == "approved":
        execute_drop(state["pending_table"])
    # Graph resumes from here after Command(resume=...)

# External resume:
graph.invoke(Command(resume={"decision": "approved"}), config)

The strength here is that interrupt() is a first-class language primitive, not a library wrapper. The checkpoint is automatic — you don't manually snapshot state. The weakness: no built-in timeout. You implement the approval TTL externally (cron job, asyncio.wait_for around graph.invoke) and push a Command(resume={"decision": "expired"}) when the deadline passes.

AgentGraph: graph.Interrupt() + ResumeCommand

AgentGraph (Tencent's agent framework) follows a similar pattern to LangGraph but uses graph.Interrupt() with a ResumeCommand primitive. The key difference is that AgentGraph supports context deadlines natively — you can attach a deadline to an interrupt, and the framework will automatically trigger the timeout path if no ResumeCommand arrives in time.

import agentgraph as ag

def approval_node(ctx):
    result = ag.graph.Interrupt(
        ctx,
        message="DROP TABLE approval required",
        deadline_seconds=1800  # 30-minute context deadline
    )
    if result.action == "resume":
        execute(ctx.params)
    # Context deadline triggers fallback path automatically

AgentGraph's built-in context deadline is a meaningful advantage over LangGraph for approval workflows — it eliminates the external scheduler requirement. However, the ecosystem is substantially smaller, and production tooling (monitoring, alerting, operational dashboards) is less mature.

AutoGen: UserProxyAgent with human_input_mode

AutoGen takes a different architectural approach. Instead of interrupting graph execution, it designates a UserProxyAgent that acts as the human's proxy in the agent conversation. The human_input_mode parameter controls when the proxy pauses for human input:

from autogen import UserProxyAgent, AssistantAgent

user_proxy = UserProxyAgent(
    name="human_approver",
    human_input_mode="ALWAYS",  # or "TERMINATE" or "NEVER"
    code_execution_config=False
)
assistant = AssistantAgent(name="agent")

# When assistant proposes a dangerous action, user_proxy
# pauses and prompts the human via terminal input
user_proxy.initiate_chat(assistant, message="Clean up old DB tables")

The limitation: AutoGen's HITL model is tied to a single UserProxyAgent. If you have a multi-agent topology with five agents, only the one designated as the human proxy can request approval — other agents must route through it. This works for simple workflows but becomes a bottleneck in complex multi-agent orchestration. There's also no built-in checkpoint persistence or timeout mechanism; both must be implemented manually.

CrewAI: Task(human_input=True)

CrewAI offers the simplest HITL model: a boolean flag on a Task. When human_input=True, the agent pauses before executing that task and prompts the user via the terminal.

from crewai import Task, Agent, Crew

delete_task = Task(
    description="Delete tmp_backup_2025_04_01 from production DB",
    agent=db_agent,
    human_input=True  # pause before execution
)

crew = Crew(agents=[db_agent], tasks=[delete_task])
crew.kickoff()  # prompts user via terminal before delete_task runs

CrewAI's strength is simplicity — the API surface is minimal, and the mental model is "check a box on the task." The weakness: it's effectively a synchronous CLI prompt. No checkpoint persistence, no timeout, no escalation, no structured approval schema. For production systems with compliance requirements, you'd need to wrap CrewAI's task execution with the full approval state machine described in Section 3.

Decision Guide: Which Framework for Which HITL Requirements

RequirementRecommended FrameworkWhy
Production compliance (SOC 2 / HIPAA)LangGraph + custom escalationCheckpoint persistence (PostgresSaver) provides the audit trail foundation. interrupt() is a clean pause point. Build escalation externally.
Rapid prototyping, internal toolsCrewAI or AutoGenMinimal setup. human_input=True or human_input_mode gets you a working HITL flow in minutes.
Multi-agent with approval routingLangGraphAny graph node can call interrupt() — approval isn't centralized in one agent. Supports multi-node approval topologies.
Built-in timeout supportAgentGraphContext deadline on Interrupt() eliminates the need for an external timeout scheduler. Good fit if you're in the Tencent ecosystem.
Microsoft / Azure ecosystemAutoGenDeep integration with Azure services. If your approval backend is Azure Logic Apps or Teams, AutoGen has the shortest integration path.

The framework you choose determines where you place the approval gate, not whether you need one. Every framework in this comparison requires you to build escalation, timeout management, and audit trail yourself. The ApprovalStateMachine from Section 3 is designed to plug into any of these frameworks at the interrupt point, giving you the escalation and compliance layer that none of them provide natively. For more on multi-agent topologies, see Multi-Agent Orchestration. For how approval requests interact with tool design, see Agent Tool Design.

7. Multi-Agent Approval & Notification Integration

Sections 1-6 assume a single agent requesting human approval. But production systems rarely have just one agent. When Agent A delegates to Agent B, and B encounters a CRITICAL action, the approval request must carry the full delegation context — the human needs to understand not just what B wants to do, but why B is doing it and who asked for it.

Multi-Agent Scenario: Delegation Chain in Approval Context

Consider this topology:


User: "Optimize our cloud spending"
  │
  ▼
Agent A (Orchestrator) — analyzes billing data, identifies candidates
  │
  ├──► Agent B (DB Cleanup) — proposes DROP TABLE on old backups
  │      │
  │      └──► CRITICAL: DROP TABLE tmp_prod_backup_2025_03
  │             │
  │             └──► Who requests approval? What context goes to the human?
  │
  └──► Agent C (Compute) — proposes downsizing underutilized instances

The rule: the agent performing the dangerous action issues the approval request. Agent B requests approval for the DROP TABLE, not Agent A. But the request carries the full delegation chain:

{
  "action": {"tool_name": "DROP_TABLE", "description": "..."},
  "proposed_by": {
    "agent_id": "agent-b-db-cleanup",
    "agent_name": "DB Cleanup Agent"
  },
  "delegation_chain": [
    {
      "delegated_by": "agent-a-orchestrator",
      "task": "Optimize cloud spending — identify removable DB resources",
      "timestamp": "2026-05-31T14:32:00Z"
    }
  ],
  "context_window": {
    "reasoning": "Agent A asked me to find removable DB resources...",
    "upstream_task": "Optimize cloud spending (Agent A → Agent B)"
  }
}

The delegation_chain field is an ordered list showing every hop from the original user request to the current agent. The human approver sees the full journey: the user asked for cost optimization → Agent A delegated DB cleanup to Agent B → Agent B found this specific table → Agent B now wants to drop it. Without this context, the approver sees "Agent B wants to drop a table" with no understanding of why — and most humans will approve a request they don't understand just to clear the queue.

Centralized vs. Distributed Approval Queue

ArchitectureHow It WorksTradeoffsBest For
Centralized Single approval service. All agents submit requests to one queue. One dashboard, one notification system. Pro: Unified view, single audit log, consistent notification. Con: Single point of failure. If approval service is down, all agents are blocked. Small teams (≤ 5 agents), single-tenant systems. Compliance environments where a single audit trail is mandatory.
Distributed Each agent (or agent cluster) runs its own approval queue. Queues federate via a shared notification layer. Pro: No single point of failure. Each agent's queue is independently scalable. Con: Harder to get a unified view of all pending approvals. Multiple audit logs to correlate. Large-scale systems (10+ agents), multi-tenant platforms. Systems where different agents have different approver pools.

A hybrid pattern works well in practice: distributed queues with a centralized read replica. Each agent maintains its own state machine and approval queue (distributed), but all queues replicate their state to a central read-only database for the operations dashboard and compliance reporting. The central database is not in the approval critical path — if it goes down, agents continue processing approvals from their local queues.

CLI Agent Approval UX Patterns

Not all agents run in a web dashboard. CLI-based coding agents (Claude Code, OpenCode, Cursor agent mode) need a different approval UX. Three patterns have emerged:

Pattern 1: Inline Prompt (Synchronous)

Agent: I need to drop table tmp_backup_2025_04_01 [CRITICAL]
       Reason: Matches cleanup pattern for tables > 30 days old
       Risk: DROP TABLE on production db-prod-1
       ┌──────────────────────────────────────────┐
       │ Approve? [Y]es / [N]o / [M]odify / [D]etails  │
       └──────────────────────────────────────────┘
       █

Best for interactive CLI sessions where the developer is present and watching. The agent blocks until the human responds. Simple, low-latency. The downside: the human must be present — if they step away, the agent is frozen.

Pattern 2: Background Notification (Asynchronous)

Agent: [APPROVAL-REQ#42] DROP TABLE tmp_backup_2025_04_01 → queued.
       Notification sent to Slack #agent-approvals.
       I'll continue with other tasks while waiting.
       To check status: /approvals list

Best for long-running agents or batch operations. The agent doesn't block — it continues with non-dependent work while the approval is pending. The human receives a Slack/email notification and can approve asynchronously. The agent polls or receives a webhook callback when the decision is made.

Pattern 3: Web Dashboard (Rich Context)

┌─────────────────────────────────────────────────────────┐
│ Pending Approvals (3)                          [Dashboard] │
├─────────────────────────────────────────────────────────┤
│ ⬤ CRITICAL  DROP TABLE tmp_backup_2025_04_01    [View Diff] │
│   Agent: DB Cleanup Agent  •  14 min remaining             │
│   Delegated by: Orchestrator Agent                         │
│   [Approve]  [Reject]  [Modify Parameters]  [View Context]  │
│                                                            │
│ ⬤ HIGH      UPDATE nginx config (rate limiting)  [View Diff] │
│   Agent: Config Agent  •  3h 12min remaining               │
│   [Approve]  [Reject]  [Modify Parameters]  [View Context]  │
│                                                            │
│ ⬤ HIGH      SEND customer email (billing alert)  [View Diff] │
│   Agent: Notification Agent  •  1h 45min remaining          │
│   [Approve]  [Reject]  [Modify Parameters]  [View Context]  │
└─────────────────────────────────────────────────────────┘

Best for operations teams managing multiple agents. The dashboard provides batch approve/reject, rich diff views (showing exactly what will change), and queue health metrics. This is the pattern to build toward for production systems — the inline prompt is a development convenience; the dashboard is the production interface.

Notification Integration: Slack, Email, PagerDuty

The approval request is only as effective as its delivery channel. A Slack message is useless if the approver has notifications muted. An email is useless if it lands in the "Updates" tab. The notification strategy must match the approval urgency:

Risk LevelPrimary ChannelEscalation ChannelFormat
LOW_RISKSlack / Teams messageEmail digestDaily summary or batched notification
HIGH_RISKSlack / Teams with @mentionEmail + mobile pushIndividual notification with 30-second decision context
CRITICALPagerDuty / Opsgenie alertSMS + phone callIncident-level alert, interrupts current workflow

Here's a Slack webhook handler for approval notifications — note the structured message format designed for rapid human decision-making:

import requests
import json
from datetime import datetime

class SlackApprovalNotifier:
    """Sends structured approval requests to Slack via webhook"""

    def __init__(self, webhook_url: str):
        self.webhook_url = webhook_url

    def send_approval_request(self, request: dict) -> str:
        """Send an approval request to Slack. Returns message_ts for threading."""
        risk_emoji = {"CRITICAL": "🔴", "HIGH_RISK": "🟠", "LOW_RISK": "🟡"}
        emoji = risk_emoji.get(request["risk_level"], "⚪")

        blocks = [
            {
                "type": "header",
                "text": {"type": "plain_text",
                         "text": f"{emoji} Agent Approval Required — {request['risk_level']}"}
            },
            {
                "type": "section",
                "fields": [
                    {"type": "mrkdwn", "text": f"*Action:*\n{request['action']['description']}"},
                    {"type": "mrkdwn", "text": f"*Agent:*\n{request['proposed_by']['agent_name']}"},
                    {"type": "mrkdwn", "text": f"*Deadline:*\n{request['deadline']}"},
                    {"type": "mrkdwn", "text": f"*ID:*\n`{request['idempotency_key']}`"},
                ]
            },
            {
                "type": "section",
                "text": {"type": "mrkdwn",
                         "text": f"*Reasoning:*\n{request['context_window']['reasoning'][:500]}"}
            },
            {
                "type": "actions",
                "elements": [
                    {"type": "button", "text": {"type": "plain_text", "text": "✅ Approve"},
                     "style": "primary", "value": request["idempotency_key"],
                     "action_id": "approval_approve"},
                    {"type": "button", "text": {"type": "plain_text", "text": "❌ Reject"},
                     "style": "danger", "value": request["idempotency_key"],
                     "action_id": "approval_reject"},
                    {"type": "button", "text": {"type": "plain_text", "text": "📝 Modify"},
                     "value": request["idempotency_key"],
                     "action_id": "approval_modify"},
                ]
            }
        ]

        if request.get("delegation_chain"):
            chain_text = " → ".join(
                d["delegated_by"] for d in request["delegation_chain"])
            blocks.insert(2, {
                "type": "context",
                "elements": [{"type": "mrkdwn",
                              "text": f"*Delegation chain:* {chain_text} → {request['proposed_by']['agent_name']}"}]
            })

        payload = {
            "text": f"Agent approval required: {request['action']['description']}",
            "blocks": blocks
        }

        resp = requests.post(self.webhook_url, json=payload, timeout=10)
        resp.raise_for_status()
        return resp.text  # Slack returns "ok"


# Usage
notifier = SlackApprovalNotifier(
    "https://hooks.slack.com/services/T00/B00/xxxx"
)
notifier.send_approval_request(request_dict)

The buttons in the Slack message trigger interactive component callbacks — your approval service receives the action_id (approve/reject/modify) and the value (idempotency key), then calls the corresponding state machine transition. For compliance environments, log the Slack interaction's response_url and trigger_id as part of the audit trail — they provide cryptographic linkage between the Slack interaction and the approval decision.

Audit Trail with Cryptographic Evidence

Every approval decision must be not just logged, but tamper-evident. A simple database row with approved_by=alice can be modified by anyone with database access. The production-grade approach is a hash chain of approval decisions:

import hashlib
import json

class ApprovalEvidenceChain:
    """Cryptographic hash chain — tamper-evident audit trail"""

    def __init__(self, request_id: str):
        self.request_id = request_id
        self.chain_hash = hashlib.sha256(request_id.encode()).hexdigest()
        self.events: list = []

    def append_event(self, event_type: str, data: dict, principal: str):
        """Append an event to the chain. Each event's hash depends on the previous hash."""
        event = {
            "type": event_type,
            "data": data,
            "principal": principal,
            "timestamp": datetime.now().isoformat(),
            "prev_hash": self.chain_hash
        }
        event_bytes = json.dumps(event, sort_keys=True).encode()
        self.chain_hash = hashlib.sha256(event_bytes).hexdigest()
        event["chain_hash"] = self.chain_hash
        self.events.append(event)
        return event

    def verify(self) -> bool:
        """Verify the entire chain hasn't been tampered with."""
        expected = hashlib.sha256(self.request_id.encode()).hexdigest()
        for event in self.events:
            event_copy = {k: v for k, v in event.items() if k != "chain_hash"}
            event_copy["prev_hash"] = expected
            expected = hashlib.sha256(
                json.dumps(event_copy, sort_keys=True).encode()
            ).hexdigest()
            if expected != event["chain_hash"]:
                return False
        return True


# Usage during approval lifecycle
chain = ApprovalEvidenceChain("req-01")
chain.append_event("proposed", {"risk": "CRITICAL"}, "agent-b")
chain.append_event("escalated", {"from": "alice", "to": "bob"}, "system")
chain.append_event("approved", {"decision": "approved"}, "[email protected]")

# Verify before presenting audit evidence
assert chain.verify()  # True — chain is intact
# If any event is modified, verify() returns False

This hash chain pattern means that an auditor can verify the entire approval decision trail without trusting your database. Store the final chain_hash in an append-only ledger (or periodically publish it to a transparency log), and any modification to any event in the intermediate chain becomes detectable. For SOC 2 CC7.2 (monitoring of system changes) and HIPAA §164.312(b) (audit controls), this provides cryptographic assurance that the audit trail hasn't been tampered with. See Agent Audit Log Design for a complete treatment of agent audit trail architecture, and Agent Context Protocol Design for how approval context flows between agents.

8. Testing Approval Workflows & Production Checklist

Approval workflows sit on the critical path between agent autonomy and production safety. A bug here doesn't just break functionality — it either blocks legitimate operations (false negatives) or executes dangerous ones without human sign-off (false positives). Testing must cover both paths.

Unit Testing Approval Flows

The approval state machine is deterministic — given an initial state and an event, the next state is predictable. This makes it an ideal candidate for table-driven unit testing:

Test CaseInitial StateEventExpected StateGuard Condition
Normal approval flowIDLEtool_call_risk=HIGHPROPOSINGrisk != AUTO
Human approvesWAITING_APPROVALhuman_approve (valid approver)APPROVEDApprover in authorized list
Human rejectsWAITING_APPROVALhuman_rejectREJECTED
Human modifiesWAITING_APPROVALhuman_modify (valid mod)MODIFIED → PROPOSINGModification allowed
Timeout + escalateWAITING_APPROVALtimeout (chain has next)EXPIRED → WAITING_APPROVALEscalation chain not exhausted
Timeout + auto-denyWAITING_APPROVALtimeout (chain exhausted)EXPIRED → REJECTEDEscalation chain exhausted
Late approval rejectedEXPIREDhuman_approveNo transition (invalid)State != WAITING_APPROVAL
Idempotency: double-approveEXECUTINGduplicate_approveNo transition (idempotent)Action ID already in execution log
Resume with valid checkpointAPPROVEDcheck_idempotent (clean)RESUMING → EXECUTINGAction ID not in execution log
Resume after executionAPPROVEDcheck_idempotent (dirty)RESUMING → IDLEAction ID already executed

Use a MockApprover that returns predefined responses (approve, reject, timeout) on demand. This isolates the state machine from the actual human interaction channel during testing:

class MockApprover:
    """Configurable mock — returns approve/reject/timeout on demand"""

    def __init__(self, responses: list):
        self.responses = responses  # ["approve", "timeout", "reject"]
        self.call_count = 0

    def decide(self, request) -> str:
        if self.call_count >= len(self.responses):
            return "timeout"  # default: timeout if no more responses
        decision = self.responses[self.call_count]
        self.call_count += 1
        return decision

# Test: approve then verify execution
mock = MockApprover(["approve"])
assert sm.state == ApprovalState.WAITING_APPROVAL
decision = mock.decide(request_dict)
if decision == "approve":
    sm.approve("test-approver")
assert sm.state == ApprovalState.APPROVED

Integration Testing: Timeout Simulation

Unit tests verify state transitions. Integration tests verify that the escalation chain actually fires when the clock runs out:

  1. Short TTL test: Configure a CRITICAL approval with a 2-second TTL and a 3-step escalation chain with 1-second steps. Submit the request, wait 5 seconds, and assert that the escalation chain exhausted and the final state is REJECTED.
  2. Rejection recovery test: Agent proposes action → human rejects → agent receives REJECTED state → agent logs the rejection, notifies the user, and continues with the next task (does not retry the rejected action unless explicitly instructed).
  3. Modification flow test: Agent proposes action → human modifies parameters → agent re-proposes with modified parameters → human approves modified version → agent executes with modified params. Assert that the executed parameters match the modified version, not the original.

Chaos Testing: Kill the Approval Service Mid-Flight

The approval system is part of your infrastructure — it can crash. Chaos testing answers the question: what happens when it does?

  1. Kill approval service during WAITING_APPROVAL: The persistence layer (PostgreSQL/Redis) must survive. After restart, all WAITING_APPROVAL requests must be recovered with their correct TTL and escalation state. No duplicate notifications should fire.
  2. Kill approval service during EXECUTING: The idempotency check must survive. After restart, if the operation completed (execution log has the action ID), the system must not re-execute. If the operation was in-flight (execution log has no record), the system should either safely re-execute or detect the ambiguity and request human intervention.
  3. Network partition between agent and approval service: The agent should handle timeout on its side — if it can't reach the approval service within a configured window, it should treat the situation as a circuit-breaker event (fail-closed: stop; fail-open: proceed with audit).

Production Readiness Checklist

Before deploying agent approval workflows to production, verify every item on this list:

Alerting Thresholds

Operational metrics to monitor with recommended alert thresholds:

MetricWarning ThresholdCritical ThresholdWhat It Means
Approval queue depth> 10 pending> 50 pendingApprovers are falling behind — risk of stale approvals and timeout auto-denials.
MTTD (Mean Time To Decide)> 5 minutes> 30 minutesApprovers are taking too long — escalation chains are likely firing. Review TTL values or approver availability.
Escalation rate> 5% of requests> 20% of requestsPrimary approvers are consistently unresponsive — check notification delivery, team composition, or TTL values.
Auto-deny rate> 1% of requests> 5% of requestsEscalation chains are exhausting — the system is making decisions without humans. Investigate immediately.
Dead letter queue depth> 0 (any)> 5 in rolling hourApprovals are auto-denying. Someone should be reviewing these decisions.
Circuit breaker trips> 0 (any)> 2 in 24 hoursThe approval infrastructure is unstable — investigate persistence layer health.

These metrics should be part of your agent observability dashboard. For guidance on building that dashboard, see Agent Observability. For security testing of the approval channel itself (injection attacks, bypass attempts), see Agent Security Evaluation.

FAQ

How does this relate to agent-tool-permission-control?

Tool permission control is the front layer: it decides whether an action is ALLOW (execute immediately), DENY (block), or ASK_APPROVAL (needs human sign-off). This article picks up where ASK_APPROVAL leaves off — how to construct the approval request, manage the state machine, and handle timeouts and escalation. They're upstream/downstream relationships.

Does the approval state machine need to be persistent? What happens if it isn't?

Without persistence, all pending approvals evaporate when the agent process restarts. The agent forgets what it was waiting for; the human has nothing to approve. Production systems MUST persist approval state — PostgreSQL, Redis, or framework checkpoints (LangGraph SqliteSaver/PostgresSaver).

How do you avoid approval fatigue? Doesn't every action need approval?

No. The four-level risk gating is designed to minimize unnecessary approvals. AUTO and LOW_RISK actions skip approval entirely. Only HIGH_RISK and CRITICAL trigger the human-in-the-loop. Further optimizations: sampling (approve only 10% of HIGH_RISK), progressive trust (auto-promote agents that make consistently good LOW_RISK decisions).

Should expired approvals default to deny or approve?

Default to deny (fail-safe). CRITICAL operations must NEVER auto-approve. LOW_RISK operations MAY be configured for auto-approve on timeout (fail-open) in specific business contexts, but every auto-approve must be audit-logged and the approver notified post-facto. Each RiskLevel should have a configurable timeout_action.

In a multi-agent workflow, which agent issues the approval request?

The agent performing the dangerous action issues the request. If Agent A delegates to Agent B and B's action is HIGH_RISK, Agent B requests approval. The request includes the delegation chain (delegated_by: Agent A) so the human reviewer sees the full context — why B is doing this and who asked for it.

Can this approval design integrate with existing enterprise approval systems (ServiceNow, Jira, custom OA)?

Yes. The ApprovalRequest schema is a framework-agnostic data structure. Implement an enterprise adapter: translate ApprovalRequest into the target system's ticket format, and map ticket resolution (approved/rejected) back to state machine transitions. Handle bidirectional sync: OA timeout, OA reassignment, OA withdrawal — these map to the state machine's existing timeout/reassign/expire transitions.

This article is part of the Agent Communication and Protocols series. Recommended reading path:

New to AI Agent engineering? Start with What Is an AI Agent.