Agent Release Gate Design: From QA to VERIFIED Production Releases

⚡ 30-Second Takeaway

  • Core Problem: An AI Agent can modify dozens of files, generate hundreds of lines of new code, and rewrite critical configs — all in under a minute. Traditional CI/CD pipelines (lint → test → review) cannot handle this speed or volume. Agents have unique failure modes — hallucinated code that looks correct but is logically absurd, context drift that derails long tasks, and output velocity that outpaces human review capacity.
  • The Solution: A Layer 0 responsibility-chain foundation plus eight executable gates — Research → Author → QA → Review → Conformity → READY → Deploy → VERIFIED. Each gate has independent pass conditions, failure responses, and audit evidence. No gate can be skipped. No gate can be silently degraded. Only changes that pass every gate reach production.
  • Key Design: Gates are not CI/CD stages — they are a responsibility chain. Each gate is an independent processor that owns one dimension of quality. Deterministic checks (lint, schema, tests) are fully automated. Semantic checks (logic, hallucination) are routed to a different agent than the writer. The chain is enforced by a state machine — state transitions are driven by gate pass/fail events, not LLM text output.
  • What You Will Walk Away With: A complete gate configuration schema you can adapt to your own agent pipeline — YAML/JSON definitions for every layer, pass conditions you can automate, failure responses that are actionable (not just "failed"), and a responsibility chain model that separates writing, checking, and deploying into independently verifiable stages.

1. Why Agent Output Needs Release Gates

A traditional CI/CD pipeline assumes a human developer pushes a commit — three to five files changed, a few hundred lines of diff, one coherent unit of work. The pipeline reacts: lint, build, test, deploy. This model works because the commit boundary is meaningful and the change volume is reviewable.

An AI Agent breaks every assumption in that model. In a single execution session, an agent can modify dozens of files, generate hundreds of lines of new code, rewrite configuration, and restructure directory layouts — all in under a minute. Worse, the agent does not produce a single clean commit. It iterates: write, revert, rewrite, split, merge — generating a stream of intermediate states, none of which is a "commit" in the traditional sense. Firing a CI/CD pipeline on every intermediate save would drown the pipeline in noise. Waiting for the agent to "finish" before running anything means minutes of unchecked generation — and by the time the pipeline catches an error, the agent has already drifted through five more faulty iterations built on top of it.

Beyond speed, agents have unique failure modes that traditional CI/CD cannot detect:

  1. Hallucinated Code: The agent calls agent_registry.get_active_agents(), but no module named agent_registry exists anywhere in the repository. Lint and type checking pass — the function call is syntactically valid Python. The error is semantic: the agent fabricated an API that looks correct but does not exist. Traditional CI/CD has no "does-this-module-actually-exist" check because human developers do not invent libraries whole.
  2. Context Drift: During a long task, the agent gradually shifts from the original intent. Each individual diff looks reasonable in isolation — a renamed variable here, an extracted method there — but the cumulative result implements a different feature than what was requested. No single commit is "wrong" enough to trigger a test failure; the error is in the trajectory, not any one step.
  3. Premature Completion Bias: LLM agents are trained to be helpful and complete tasks. Under ambiguity, they default to "this is probably correct" rather than "I should verify this." The agent finishes, declares success, and the output looks plausible — but crucial edge cases, error handling, and cross-file consistency checks were never performed.

Human review cannot scale to this volume. An active agent can produce more diff in a day than a small team writes in a week. Asking a human reviewer to inspect every line of agent-generated code is not just impractical — it defeats the purpose of having the agent in the first place. The reviewer becomes the bottleneck, and under time pressure, review degrades to a mechanical approval gesture.

📌 The Core Insight: Agent output needs a fundamentally different verification model than human output — not because agents are worse, but because their failure modes are structurally different. You cannot fix hallucination with more lint rules. You cannot fix context drift with more unit tests. You need layered, independent verification gates — each gate owning one dimension of quality, each gate making a binary pass/fail decision, each gate producing auditable evidence.

The solution is a Layer 0 responsibility-chain foundation plus eight executable release gatesResearch → Author → QA → Review → Conformity → READY → Deploy → VERIFIED. Each gate is an independent processor. Deterministic checks — lint, schema validation, test execution — are fully automated and run in seconds. Semantic checks — hallucination detection, logic review, requirement alignment — are routed to a different agent than the writer. No gate can be skipped. No gate can silently degrade its checks. Only changes that pass every gate reach production.

Here is the gate chain configuration schema — a single JSON document that defines every gate, its checks, its pass conditions, and its failure behavior:


{
  "pipeline": "agent-release-gate",
  "version": "2.0",
  "gates": [
    {
      "name": "RESEARCH",
      "order": 1,
      "executor": "automated",
      "responsibility": "information gathering only — no code or file modifications",
      "checks": [
        {"name": "keyword_coverage", "min_ratio": 0.8, "blocking": false, "action": "auto_retry_search"},
        {"name": "reference_sources_found", "min_count": 3, "blocking": true},
        {"name": "faq_candidates", "min_count": 2, "blocking": false, "action": "warn"},
        {"name": "internal_link_suggestions", "min_count": 2, "blocking": false}
      ],
      "on_failure": "REDIRECT",
      "timeout_seconds": 120,
      "degraded_allowed": true
    },
    {
      "name": "AUTHOR",
      "order": 2,
      "executor": "writer_agent",
      "responsibility": "generate draft outputs — all files are draft until cleared by downstream gates",
      "checks": [
        {"name": "all_expected_files_exist", "match_field": "expected_outputs", "blocking": true},
        {"name": "no_empty_files", "blocking": true},
        {"name": "format_parse", "tool": "ast_parse|html_parse|md_parse", "blocking": true}
      ],
      "on_failure": "RETURN_TO_AUTHOR",
      "writer_agent": "claude-code",
      "reviewer_agent": "opencode",
      "note": "Writer and Reviewer MUST be different agents"
    },
    {
      "name": "QA",
      "order": 3,
      "executor": "automated",
      "responsibility": "deterministic structural checks — lint, tests, schema validation",
      "checks": [
        {"name": "lint", "tool": "ruff", "severity": "error", "blocking": true},
        {"name": "type_check", "tool": "mypy", "blocking": true},
        {"name": "unit_test", "tool": "pytest", "min_coverage": 0.80, "blocking": true},
        {"name": "security_scan", "tool": "bandit", "severity": "medium", "blocking": true},
        {"name": "schema_valid", "schema_path": "./schemas/", "blocking": true},
        {"name": "dead_link", "tool": "lychee", "blocking": false, "action": "warn"},
        {"name": "file_size", "max_lines": 500, "blocking": false, "action": "warn"}
      ],
      "on_failure": "AUTO_RETRY",
      "max_retries": 3,
      "timeout_seconds": 300,
      "sandbox": "qa-sandbox:v2.1"
    },
    {
      "name": "REVIEW",
      "order": 4,
      "executor": "reviewer_agent",
      "responsibility": "semantic correctness — hallucination, requirement alignment, design quality",
      "checks": [
        {"name": "hallucination_detect", "method": "cross_reference_symbols", "blocking": true},
        {"name": "requirement_alignment", "match_field": "acceptance_criteria", "blocking": true},
        {"name": "internal_link_semantics", "method": "embedding_similarity", "threshold": 0.3, "blocking": true},
        {"name": "design_review", "reviewer": "human", "blocking": true}
      ],
      "on_failure": "REQUEST_CHANGES",
      "timeout_hours": 4,
      "auto_approve_if": {"changed_files": 1, "max_lines": 50, "only": ["docs/*", "*.md"]}
    },
    {
      "name": "CONFORMITY",
      "order": 5,
      "executor": "automated",
      "responsibility": "cross-project consistency — SEO, i18n, template compliance, internal link coverage",
      "checks": [
        {"name": "hreflang_bidirectional", "auto_fix": true, "blocking": true},
        {"name": "jsonld_completeness", "auto_fix": true, "blocking": true},
        {"name": "seo_tags_complete", "auto_fix": true, "blocking": true},
        {"name": "internal_link_outbound", "min_count": 3, "blocking": true},
        {"name": "internal_link_inbound", "min_count": 1, "blocking": true},
        {"name": "html_structure", "blocking": true},
        {"name": "bilingual_sync", "blocking": true}
      ],
      "on_failure": "BLOCK",
      "timeout_seconds": 120
    },
    {
      "name": "READY",
      "order": 6,
      "executor": "automated",
      "responsibility": "meta-audit — verify all upstream evidence is complete, consistent, and untampered",
      "checks": [
        {"name": "evidence_chain_complete", "required_gates": ["RESEARCH","AUTHOR","QA","REVIEW","CONFORMITY"], "blocking": true},
        {"name": "evidence_checksum_match", "blocking": true},
        {"name": "no_unresolved_critical", "blocking": true},
        {"name": "seo_requirements", "blocking": true},
        {"name": "user_approval_valid", "blocking": true},
        {"name": "audit_log_integrity", "blocking": true}
      ],
      "on_failure": "BLOCK",
      "on_pass": "MARK_READY_TO_PUBLISH"
    },
    {
      "name": "DEPLOY",
      "order": 7,
      "executor": "automated",
      "responsibility": "controlled merge to main, build, and deploy — with rollback on failure",
      "checks": [
        {"name": "user_deploy_approval", "blocking": true},
        {"name": "ready_gate_pass", "blocking": true},
        {"name": "target_branch_clean", "blocking": true},
        {"name": "build_success", "tool": "hugo", "blocking": true},
        {"name": "cdn_deploy_success", "blocking": true}
      ],
      "on_failure": "ROLLBACK",
      "timeout_seconds": 600,
      "rollback_strategy": "git_revert_and_rebuild"
    },
    {
      "name": "VERIFIED",
      "order": 8,
      "executor": "automated",
      "responsibility": "post-deploy verification — live site integrity, no regressions",
      "checks": [
        {"name": "live_page_accessible", "blocking": true},
        {"name": "meta_tags_present", "blocking": true},
        {"name": "no_broken_links", "blocking": true},
        {"name": "canonical_correct", "blocking": true}
      ],
      "on_failure": "ALERT_ONCALL",
      "timeout_seconds": 300,
      "on_pass": "MARK_VERIFIED"
    }
  ],
  "global_rules": {
    "skip_gate_order": false,
    "require_gate_evidence": true,
    "gate_transition_log": ".agent-workspace/log/gate-transitions.jsonl",
    "evidence_checksum_algorithm": "sha256",
    "max_pipeline_duration_minutes": 480
  }
}
  

This schema is not abstract — it is a runnable configuration. Each gate defines exactly what it checks, how it checks, what constitutes a pass, and what happens on failure. The global_rules enforce three non-negotiable constraints: gates execute in strict order (no skipping), every gate must produce evidence (no silent passes), and every pass/fail decision is written to an append-only transition log (full audit trail).

In one sentence: Release gates turn "the agent said it's done" into "nine independent processors confirmed it's done — each for a different reason, each with evidence you can audit."

2. Layer 0: Gates Are Not CI/CD — They're Agent Responsibility Chains

Before diving into each gate's internal design, there is a foundational concept that must be established clearly: Agent release gates operate on an entirely different execution model than CI/CD pipelines. Treating them as if they were CI/CD stages leads to systematic mismatches — gates that fire at the wrong time, decisions that take too long, and failures that mean the wrong thing.

Traditional CI/CD makes three core assumptions — all of which fail in agent workflows:

  1. Assumption: The change is a single commit. CI/CD triggers on git push — a commit represents one complete, self-consistent change. The pipeline runs against that commit exactly once: build → test → deploy. An agent's work pattern is a continuous stream: in a single session, the agent may write, revert, rewrite, split, and merge files through a dozen intermediate versions. There is no clear "commit moment" — and if you fire a pipeline on every intermediate save, the gate system is overwhelmed by noise.
  2. Assumption: Stages run in isolated environments with generous time budgets. CI/CD spins up fresh containers per stage, each with minutes to tens of minutes of execution time. Agent gates must decide in the same execution context, within seconds. The agent does not "wait for the pipeline to finish and then continue" — it needs immediate feedback to adjust subsequent actions. A 5-minute CI pipeline is normal; a 5-minute gate decision is a dead agent.
  3. Assumption: Failure means broken build. Stop everything. CI/CD failure → build halts, notification sent, human engineer fixes the issue and resubmits. Agent gate failure means something radically different: workflow redirection. QA gate fails → agent auto-fixes lint errors and retries. Review gate requests changes → agent reads the review comments and regenerates. Conformity gate blocks → task enters a terminal state requiring human intervention. Each failure type maps to a different redirect path.

Based on these differences, the correct model for agent gates is a Chain of Responsibility. Each gate is an independent processor on the chain. It receives the output of the previous gate, applies its own checks, makes a binary decision, and passes the result (with evidence) to the next gate. The chain enforces strict ordering and non-bypassability. Each gate's failure handling is self-contained — the chain does not "crash," it redirects.

📌 Gate = Independent Processor: A gate is defined by four components — pass_condition (what must be true to proceed), on_failure (what action to take if not true), evidence (auditable proof of the decision), and next (which gate to pass to on success). This is a responsibility chain, not a build stage. Gates own their quality dimension and nothing else.

Here is the same gate system expressed as a responsibility chain in YAML — same semantics as the JSON schema above, but the perspective shifts from "what is configured" to "who is responsible for what":


# Gate Responsibility Chain: each gate is an independent processor
# Chain enforcement: gates execute in strict order, no skipping
# Modeled as a state machine — gate pass/fail events drive transitions

chain:
  - gate: RESEARCH
    responsibility: >
      Gather information and reference sources. Write nothing to the source tree.
      All output stays in .agent-workspace/. Produces source-pack.md and
      research-evidence.json.
    pass_condition: keyword_coverage >= 0.8 AND reference_sources >= 3
    on_failure: REDIRECT        # Expand search, retry, degrade if timeout
    evidence: research-evidence.json
    next: AUTHOR

  - gate: AUTHOR
    responsibility: >
      Generate draft output files based on brief.md and source-pack.md.
      All outputs are marked draft — not deployable until cleared by downstream gates.
      Writer agent (claude-code) and Reviewer agent (opencode) MUST be different.
    pass_condition: all_expected_files_exist AND no_empty_files AND format_parse_pass
    on_failure: RETURN_TO_AUTHOR  # Fix missing files or format errors
    evidence: author-evidence.json
    next: QA

  - gate: QA
    responsibility: >
      Deterministic structural validation. Lint, type-check, unit tests,
      schema validation, security scan. This gate answers "is it structurally
      sound?" — not "is it correct?"
    pass_condition: all_blocking_checks_pass AND test_coverage >= 0.80
    on_failure: AUTO_RETRY      # Agent fixes issues, retries up to 3 times
    evidence: qa-report.json
    next: REVIEW

  - gate: REVIEW
    responsibility: >
      Semantic correctness review. Detect hallucinated APIs, verify requirement
      alignment, validate internal link semantics, assess design quality.
      This is the only gate that may involve human judgment.
    pass_condition: no_critical_issues AND design_review_approved
    on_failure: REQUEST_CHANGES  # Return to Author with structured feedback
    evidence: review-report.json
    next: CONFORMITY

  - gate: CONFORMITY
    responsibility: >
      Cross-project consistency. SEO tags, hreflang bidirectional links,
      JSON-LD completeness, template structure, internal link coverage,
      bilingual synchronization (en/zh parity).
    pass_condition: all_checks_pass (auto_fix applied where possible)
    on_failure: BLOCK           # Non-auto-fixable issues → human intervention
    evidence: conformity-report.json
    next: READY

  - gate: READY
    responsibility: >
      Meta-audit of the entire evidence chain. Verify every upstream gate
      produced a PASS verdict, checksums match (no tampering), no unresolved
      critical issues remain, user approval is valid.
    pass_condition: all_evidence_valid AND no_tampering AND user_approved
    on_failure: BLOCK           # Point to the specific gate that needs re-run
    evidence: ready-report.json
    next: DEPLOY

  - gate: DEPLOY
    responsibility: >
      Controlled deployment. Merge to main, build, deploy to CDN.
      Every step has pre-condition checks and post-condition verification.
      Any failure triggers full rollback.
    pass_condition: merge_ok AND build_ok AND deploy_ok
    on_failure: ROLLBACK        # git revert, rebuild, verify rollback
    evidence: deploy-report.json
    next: VERIFIED

  - gate: VERIFIED
    responsibility: >
      Post-deploy live verification. Confirm pages are accessible, meta tags
      present, no broken links, canonical URLs correct. This is the final
      confirmation that the change is live and correct.
    pass_condition: all_live_checks_pass
    on_failure: ALERT_ONCALL    # Live site issue — immediate human attention
    evidence: verified-report.json
    next: null                  # End of chain — task is VERIFIED
  

The responsibility chain model maps directly to the Agent State Machine: each gate's pass event triggers a state transition (RESEARCH_PASSED → AUTHOR_PASSED → QA_PASSED → REVIEW_PASSED → CONFORMITY_PASSED → READY_TO_PUBLISH → DEPLOYED → VERIFIED), and each failure event triggers the corresponding rollback transition. The state machine guarantees that gates execute in strict order — no skipping, no reordering — enforced by the transition table, not by convention.

The Agent Context Protocol solves the orthogonal problem of inter-gate context passing. When QA produces a lint report, Review needs to reference it. When Review flags a hallucinated API, Conformity needs to know which files were affected. These gate-to-gate messages are not ad-hoc file dumps — they are structured context protocol messages with typed fields, versioned schemas, and checksums, passed through the evidence files that each gate produces.

In one sentence: CI/CD says "run these stages against the commit." Agent gates say "each processor in this chain takes responsibility for one dimension of quality — pass it on, or redirect it back." The former is designed for deterministic builds of human-authored code. The latter is designed for non-deterministic generation where the output needs independent verification across multiple quality dimensions before it can be trusted.

3. Layer 1: Research Gate — Information Enters, Code Does Not

The Research Gate is the first checkpoint in the chain — and the only gate where the agent is explicitly forbidden from writing code or modifying files. Its sole purpose is to gather information and produce a structured research artifact. No source files are changed. No directories are created outside of .agent-workspace/. The boundary is absolute: Research knows what (know-what), not how (know-how).

Input and output. The gate receives a task description — a natural language specification of what content is needed, what domain it covers, and what format it should take. The output is a single file: source-pack.md, a structured research artifact that bundles all gathered information into one document. This artifact becomes the single source of truth for downstream gates — the Author Gate reads it, the Review Gate cross-references it, and the READY Gate audits that it exists and passes its checks.

Pass conditions. The gate measures coverage, not correctness. Four dimensions are checked: (1) keyword coverage — the ratio of required search terms to terms actually found in reference sources, with a minimum threshold of 0.8 (80%); (2) minimum reference sources — at least three distinct, verifiable sources must be found and cited; (3) FAQ candidates — at least two question-answer pairs inferred from the research material, serving as seed content for future FAQ pages; (4) internal link suggestions — at least two existing site pages that are contextually related and should be linked from the eventual article. The first two are blocking (failure halts progression); the latter two are non-blocking warnings.

📌 Why "know-what" only: Premature solutioning is one of the most common agent failure modes. An agent given a research task will often start writing code "because it knows where this is going." The Research Gate's file-system sandbox — write access only to .agent-workspace/ — physically prevents this. If the agent cannot touch the source tree, it cannot drift into implementation. The gate is a constraint that creates focus.

Failure response. When research is insufficient, the gate does not block permanently — it redirects. The agent auto-retries searches with expanded query terms, degrades to available sources if external APIs are unreachable, and flags risk areas (topics where source coverage is thin) so downstream reviewers know which sections need extra scrutiny. The timeout is generous (120 seconds) because external search and crawling are inherently slow. If all retries fail, the gate transitions to a human-approval state where an operator can manually supply missing sources or override the requirement.

Here is the evidence file produced by the Research Gate — research-evidence.json:


{
  "gate": "RESEARCH",
  "task_id": "agent-release-gate-design",
  "timestamp": "2026-06-07T10:23:00Z",
  "verdict": "PASS",
  "agent": "claude-code",
  "session_id": "sess_8f3a2",
  "output": ".agent-workspace/source-pack.md",
  "evidence": {
    "keyword_coverage": {
      "required": ["agent release gate", "AI code review", "CI/CD vs agent", "hallucination detection", "gate responsibility chain"],
      "found": ["agent release gate", "AI code review", "CI/CD vs agent", "hallucination detection", "gate responsibility chain"],
      "ratio": 1.0,
      "passed": true
    },
    "reference_sources": [
      {"title": "Building Reliable AI Agents", "url": "https://example.com/reliable-ai-agents", "type": "article"},
      {"title": "CI/CD for LLM Pipelines", "url": "https://example.com/cicd-llm", "type": "whitepaper"},
      {"title": "Hallucination Detection in Code Generation", "url": "https://example.com/hallucination-code", "type": "paper"},
      {"title": "State Machine Driven Agent Workflows", "url": "https://example.com/agent-state-machines", "type": "blog"}
    ],
    "reference_count": 4,
    "minimum_met": true,
    "faq_candidates": [
      {"question": "How do agent gates differ from CI/CD stages?", "source_line": 42},
      {"question": "Can agent gates prevent hallucinated code from reaching production?", "source_line": 128},
      {"question": "What happens when a gate fails — does the whole pipeline stop?", "source_line": 215}
    ],
    "faq_count": 3,
    "minimum_met": true,
    "internal_link_suggestions": [
      "/en/posts/agent-observability.html",
      "/en/posts/agent-message-schema-design.html",
      "/en/posts/agent-human-approval-workflow.html",
      "/en/posts/agent-state-machine-design.html"
    ],
    "internal_link_count": 4,
    "minimum_met": true,
    "risk_areas": [],
    "degraded_sources": false
  },
  "duration_ms": 45230,
  "retry_count": 0
}
  

The Research Gate produces evidence that is quantitative, not qualitative. It does not judge whether the sources are "good" — it judges whether the research effort met minimum coverage thresholds. Semantic quality assessment belongs to the Review Gate (Layer 4), not here. This separation of concerns is fundamental to the gate design: each gate owns exactly one dimension of verification and defers all others to downstream processors. For deeper integration with agent monitoring, see Agent Observability; for structured gate-to-gate communication, see Agent Message Schema Design; and for when a gate deadlocks and needs human intervention, see Agent Human Approval Workflow.

4. Layer 2: Author Gate — Writing Space, Changes Are Drafts

The Author Gate is where content becomes real — but only provisionally. Every file produced by this gate is marked as a draft. The gate's governing principle is that writing and reviewing must be performed by different entities: the Writer Agent produces, the Reviewer Agent judges. This separation is not optional — it is the structural equivalent of "you cannot proofread your own work."

Input and output. The Author Gate receives brief.md — a condensed writing specification distilled from the Research Gate's source-pack.md — and a task definition that specifies expected output files, their formats (HTML, Markdown, JSON), and their target directories. The gate produces the actual output files: for a content site like xslyl.com, this means en/posts/<slug>.html and zh/posts/<slug>.html. Importantly, the gate also produces author-evidence.json, which records every file created, its checksum, its draft status, and the writer agent's session identifier.

Pass conditions. Three checks determine whether the Author Gate passes: (1) all expected files exist — the gate compares the actual file list against the expected_outputs field from the task spec; if any declared output is missing, the gate fails immediately (blocking); (2) no empty files — every output file must have non-zero content, catching partial writes and truncated generations (blocking); (3) format parse check — each file is passed through a format-specific parser (ast_parse for Python, html_parse for HTML, md_parse for Markdown) to confirm it is syntactically well-formed (blocking). These checks are deliberately shallow — the gate confirms that output exists and is parseable, not that it is correct. Correctness is the Review Gate's concern.

📌 Writer ≠ Reviewer: The agent that writes the content (claude-code) must be different from the agent that reviews it (opencode). This is not about capability — both agents are perfectly capable of either task. It is about fresh perspective. A writer agent sees what it intended to write; a different reviewer agent sees what was actually written. Hallucinated APIs and context drift are far more visible to an agent that did not generate the text. This principle applies across the entire chain: no gate's executor may be the same agent instance that produced the input it is checking.

Failure response. When the Author Gate fails — missing files, empty files, or unparseable output — the response is RETURN_TO_AUTHOR. The writer agent receives structured feedback specifying exactly which file failed, which check triggered the failure, and a suggested fix path. The agent regenerates the affected files and resubmits. This loop can repeat, but after three return trips without a pass, the gate escalates to human review.

Here is the author-evidence.json produced by this gate, showing draft status and checksums:


{
  "gate": "AUTHOR",
  "task_id": "agent-release-gate-design",
  "timestamp": "2026-06-07T10:45:12Z",
  "verdict": "PASS",
  "writer_agent": "claude-code",
  "writer_session": "sess_8f3a2",
  "reviewer_agent": "opencode",
  "reviewer_session": null,
  "files": [
    {
      "path": "en/posts/agent-release-gate-design.html",
      "checksum": "sha256:a1b2c3d4e5f6...",
      "size_bytes": 28450,
      "status": "draft",
      "format": "html",
      "parse_result": "PASS"
    },
    {
      "path": "zh/posts/agent-release-gate-design.html",
      "checksum": "sha256:f6e5d4c3b2a1...",
      "size_bytes": 29120,
      "status": "draft",
      "format": "html",
      "parse_result": "PASS"
    },
    {
      "path": ".agent-workspace/tasks/agent-release-gate-design/brief.md",
      "checksum": "sha256:9a8b7c6d...",
      "size_bytes": 3420,
      "status": "draft",
      "format": "markdown",
      "parse_result": "PASS"
    }
  ],
  "checks": {
    "all_expected_files_exist": {"expected": 3, "actual": 3, "passed": true},
    "no_empty_files": {"empty_count": 0, "passed": true},
    "format_parse": {"total": 3, "passed": 3, "failed": 0, "passed": true}
  },
  "retry_count": 0,
  "duration_ms": 142300
}
  

Notice that every file is explicitly marked "status": "draft". This status does not change until the READY Gate (Layer 6) confirms that all upstream evidence is valid and the content is approved for deployment. The checksums serve a dual purpose: they enable the READY Gate to detect post-generation tampering, and they allow the Author Gate to detect whether a file changed between write and parse (catching race conditions in agent file I/O). For the full protocol that governs how evidence files like this one are passed between gates, see Agent Context Protocol Design. For the sandbox environment that constrains what the writer agent can modify during this gate, see Agent Code Sandbox Design.

5. Layer 3: QA Gate — Rule-Based Verification, Not Human Review

The QA Gate is the first fully automated verification gate — and it is designed to be deterministic, fast, and semantically blind. Every check in this gate is a rule that can be evaluated by a machine without any understanding of what the code or content means. The gate asks one question: "Does this output satisfy every structural rule we can define mechanically?" It does not ask "Is this output correct?" — that question belongs to the Review Gate.

The QA mindset. This is the crucial distinction that separates an effective QA Gate from a cargo-cult pipeline: QA checks are chosen because they can be automated with zero semantic understanding. File existence, schema validation, lint rules, unit test execution, security scans, dead link detection, and file size limits — every one of these can be evaluated by a deterministic program that has no idea what the code is supposed to do. This is a feature, not a limitation. Deterministic checks are fast (milliseconds to seconds), reproducible (same input → same output, always), and produce unambiguous pass/fail results. There is no "maybe." There is no "this looks suspicious." There is only "rule R violated at location L."

Checks in detail. The gate runs seven categories of verification: (1) Lint — language-specific style and error detection via ruff (Python) or equivalent, configured to error severity only (warnings are informational, not blocking); (2) Type Checking — static type analysis via mypy, catching argument mismatches, missing imports, and type inconsistencies that lint alone cannot detect; (3) Unit Testspytest execution with a minimum 80% coverage threshold, ensuring every function has at least one test exercising it; (4) Security Scanbandit static security analysis at medium severity and above, catching SQL injection vectors, hardcoded secrets, and unsafe deserialization; (5) Schema Validation — JSON/YAML/HTML output validated against project schemas, ensuring every generated file conforms to the expected structure; (6) Dead Link Detectionlychee link checker verifying that all hyperlinks in generated HTML resolve to live pages (non-blocking, generates a warning report); (7) File Size — maximum line count per file enforcement, preventing runaway generation from producing unreadable monoliths (non-blocking warning).

📌 Auto-retry is not a CI/CD feature: In traditional CI/CD, a lint failure means "push a fix commit." In agent gates, it means "the agent fixes the issues in the same session and resubmits." The QA Gate supports automatic retry up to three times — the agent reads the structured bug report, fixes each issue at the specified location, and re-submits. After three failures, the gate escalates to human review. This auto-retry loop turns the QA Gate from a passive quality checkpoint into an active quality feedback loop.

Structured failure response. When the QA Gate fails, it does not simply say "lint failed." It produces a structured bug report where every issue includes a location (file path and line number), a rule (the specific check that triggered, e.g., F841 local variable assigned but never used), and a fix_hint (actionable guidance the agent can follow, e.g., Remove the unused variable or prefix with _ to signal intentional non-use). This structured format is critical — it allows the agent to programmatically locate and fix issues without re-parsing human-readable error messages.

Here is a complete qa-report.json showing the full check results:


{
  "gate": "QA",
  "task_id": "agent-release-gate-design",
  "timestamp": "2026-06-07T10:48:05Z",
  "verdict": "PASS",
  "sandbox": "qa-sandbox:v2.1",
  "retry_attempt": 0,
  "duration_ms": 48200,
  "checks": {
    "lint": {
      "tool": "ruff",
      "severity": "error",
      "total_issues": 0,
      "passed": true,
      "details": []
    },
    "type_check": {
      "tool": "mypy",
      "total_issues": 0,
      "passed": true,
      "details": []
    },
    "unit_test": {
      "tool": "pytest",
      "total": 24,
      "passed": 24,
      "failed": 0,
      "skipped": 0,
      "coverage_pct": 87.5,
      "coverage_minimum": 80.0,
      "coverage_passed": true,
      "passed": true
    },
    "security_scan": {
      "tool": "bandit",
      "severity": "medium",
      "total_issues": 0,
      "passed": true,
      "details": []
    },
    "schema_valid": {
      "schema_path": "./schemas/",
      "files_checked": 3,
      "valid": 3,
      "invalid": 0,
      "passed": true,
      "details": []
    },
    "dead_link": {
      "tool": "lychee",
      "total_links": 14,
      "alive": 14,
      "dead": 0,
      "timeout": 0,
      "passed": true,
      "blocking": false,
      "details": []
    },
    "file_size": {
      "max_lines": 500,
      "files": [
        {"path": "en/posts/agent-release-gate-design.html", "lines": 348, "within_limit": true},
        {"path": "zh/posts/agent-release-gate-design.html", "lines": 356, "within_limit": true}
      ],
      "all_within_limit": true,
      "passed": true,
      "blocking": false
    }
  },
  "blocking_checks": ["lint", "type_check", "unit_test", "security_scan", "schema_valid"],
  "all_blocking_passed": true,
  "next_gate": "REVIEW"
}
  

The QA Gate's output feeds directly into the Review Gate — but the two gates ask fundamentally different questions. QA asks "Does this satisfy every mechanical rule?" Review asks "Does this mean what it's supposed to mean?" A file can pass QA with perfect lint scores, 100% test coverage, and zero security issues — and still contain hallucinated APIs, misaligned requirements, and logically broken logic. That is by design: QA catches what machines can catch, freeing Review to focus on what only semantic analysis can catch. For the security dimension of QA checks in depth, see Agent Security Evaluation. For how QA evidence is structured and passed to downstream gates via the context protocol, see Agent Message Schema Design. For the sandbox environment that isolates QA tool execution from the source tree, see Agent Code Sandbox Design.

6. Layer 4: Review Gate — Semantic Inspection Beyond Rule Checks

The Review Gate is where the release gate system transitions from deterministic verification to semantic judgment. It is the only gate that asks "Does this output mean what it is supposed to mean?" — a question that no lint rule, type checker, or unit test can answer. Review Gate ≠ QA Gate. QA checks format; Review checks substance. The distinction is absolute and non-negotiable: a file that scores 100% on every QA check can still be semantically wrong, and only the Review Gate is designed to catch those failures.

What Review catches that QA cannot. The Review Gate inspects four categories of semantic defect. (1) Hallucinated APIs: the agent invokes registration_service.register_agent(), but no such function exists in the codebase. Lint passes — the syntax is valid. Type checking passes — the call signature is internally consistent. Only cross-referencing every symbol against the actual codebase reveals the fabrication. (2) Design quality: the generated code solves the problem, but through a convoluted three-class inheritance chain where a single function would suffice. QA sees correct syntax; Review sees unnecessary complexity. (3) Requirement misinterpretation: the task asked for a read-only dashboard, but the agent built a full CRUD interface. Every file is well-formed, every test passes, and the output is completely wrong. (4) Hallucinated internal links: the agent inserts a link to /en/posts/nonexistent-page.html because it "feels relevant." The link is syntactically valid HTML; it is semantically a broken promise to the reader.

Pass condition and evidence. The Review Gate passes when: no critical semantic defects are found, all internal links resolve to real pages on the site, and the technical approach is judged reasonable by the reviewer. The gate produces review-report.json, which catalogs every issue found, its severity (critical/major/minor), its location, and the reviewer's recommended fix. On pass, the report confirms that every blocking check was satisfied. On failure, the report becomes a structured change request — annotated problem areas with specific notes the writer agent can act on programmatically.

📌 Fresh Eyes Principle: The reviewer MUST be a different agent instance from the writer. This is not about capability — both agents are equally competent. It is about perspective asymmetry. A writer agent sees what it intended to write; a fresh reviewer agent sees what was actually written. Hallucinated APIs and context drift are statistically far more visible to an agent that did not generate the text, because it has no internal narrative about "what this function was supposed to do." This principle is enforced structurally: the writer_agent and reviewer_agent fields in the gate configuration must reference different agent identities, and the gate transition log records both to ensure auditability.

Failure response and structured feedback. When the Review Gate fails, it does not return a vague "needs improvement." It produces a review-failure-report.json with machine-actionable annotations:


{
  "gate": "REVIEW",
  "verdict": "FAIL",
  "reviewer_agent": "opencode",
  "reviewer_session": "sess_b7d4e",
  "timestamp": "2026-06-07T11:15:00Z",
  "failure_reasons": [
    {
      "severity": "critical",
      "category": "hallucinated_symbol",
      "location": "src/dashboard.py:142",
      "symbol": "agent_registry.get_active_agents",
      "evidence": "No module named 'agent_registry' exists in repository. Closest match: 'task_registry.list_tasks' in src/registry.py",
      "fix_hint": "Replace with task_registry.list_tasks() or declare agent_registry module first"
    },
    {
      "severity": "major",
      "category": "requirement_misalignment",
      "location": "src/dashboard.py:1-300",
      "requirement": "Read-only task status dashboard",
      "deviation": "Implements write operations: create_task(), update_task(), delete_task()",
      "fix_hint": "Remove all write methods. Dashboard must be read-only per requirement spec §3.1"
    },
    {
      "severity": "major",
      "category": "hallucinated_link",
      "location": "en/posts/agent-release-gate-design.html:405",
      "url": "/en/posts/nonexistent-page.html",
      "fix_hint": "Remove link or replace with verified existing page from sitemap"
    }
  ],
  "action": "REQUEST_CHANGES",
  "retry_allowed": true,
  "max_retries": 2
}
  

Each failure entry includes a fix_hint — not a generic suggestion, but a specific, actionable instruction the writer agent can follow without ambiguity. The severity field determines the gate's behavior: critical issues are always blocking; major issues may allow degradation if explicitly approved; minor issues are recorded but non-blocking. When the Review Gate fails repeatedly (after max_retries), the pipeline escalates to human review via the Agent Human Approval Workflow — an operator inspects the review report, decides whether the failure is genuine or a false positive, and either overrides the gate or confirms the rejection.

7. Layer 5: Conformity Gate — Upstream Standard Verification

The Conformity Gate answers a question no other gate asks: "Does this output fit the project?" QA checks universal rules — valid HTML, no lint errors, schema compliance. Review checks semantic correctness — no hallucination, requirements met, design sound. Conformity checks project-specific standards: the SEO tag conventions unique to xslyl.com, the hreflang structure that must mirror across language variants, the internal link coverage requirements that maintain the site's knowledge graph, and the bilingual synchronization rules that keep English and Chinese content structurally identical. These are not universal truths about code quality; they are local truths about this project's quality standards.

Checks in detail. The Conformity Gate verifies seven project-specific dimensions. (1) Hreflang bidirectional correctness: for every English page that declares hreflang="zh", the corresponding Chinese page must declare hreflang="en" pointing back — and the URLs must match exactly. A one-directional hreflang declaration is a broken internationalization signal to search engines. (2) JSON-LD completeness: every page must carry both Article and BreadcrumbList structured data, with all required fields present and valid. Missing dateModified or malformed itemListElement is a hard block. (3) SEO tag completeness: title, meta description, og:title, og:description, og:image, twitter:card, canonical URL — all must be present and populated. Empty fields are autofixed from the template; missing fields block. (4) Internal link outbound coverage: every new page must link to at least three existing site pages, establishing it in the site's navigation graph. (5) Internal link inbound coverage: at least one existing page must be updated to link to the new page — orphan pages are rejected. (6) HTML structure compliance: the page must match the project template (breadcrumb nav → main → article → footer), include the 30-second takeaway box, and use the correct GoatCounter analytics snippet. (7) Bilingual synchronization: if an English page exists, a Chinese page must also exist with identical section count and structure, and vice versa.

Pass condition and auto-fix. The Conformity Gate is unique in its use of auto-fix: for deterministic, template-driven fixes (adding a missing meta tag, correcting a hreflang URL, inserting the analytics snippet), the gate modifies the file in place before recording the check result. If all issues are auto-fixable, the gate passes and the fixed files carry forward. If any issue is not auto-fixable (a missing Chinese translation, an orphan page with no natural inbound link candidate), the gate hard-blocks and requires human intervention. The pass condition is: all template fields correct, minimum internal links met, SEO tags complete and valid, bilingual parity confirmed.

📌 Conformity ≠ QA redux: A common design mistake is merging Conformity checks into the QA Gate. Resist this. QA checks are project-agnostic: lint rules, type checks, and schema validators work identically on every Python project. Conformity checks are project-specific: hreflang structure, internal link minimums, and bilingual sync rules are unique to xslyl.com. Separating them means: (a) QA can be reused unchanged across projects, (b) Conformity can evolve independently as project standards change, and (c) failures are routed to different handlers — QA failures trigger auto-retry; Conformity failures trigger auto-fix-or-block.

Failure response. When the Conformity Gate encounters a non-auto-fixable issue, it produces a conformity-report.json with precisely categorized failures and blocks the pipeline. Here is a shell script snippet showing how a Conformity check is executed — note the separation of auto-fixable checks from hard-block checks:


#!/bin/bash
# Conformity Gate: project-specific standard verification
# Phase 1: Auto-fix checks (modify files in place)
# Phase 2: Hard-block checks (only report, cannot fix)

FAILED=0
PAGE="$1"
LANG="$2"  # en or zh

echo "=== Conformity Gate: $PAGE ($LANG) ==="

# --- Phase 1: Auto-fixable checks ---
# 1.1: Ensure hreflang bidirectional
if ! grep -q "hreflang=\"$LANG\"" "$PAGE"; then
  echo "  [AUTO-FIX] Adding missing hreflang=$LANG"
  sed -i "/" "$PAGE"
fi

# 1.2: Ensure JSON-LD completeness
if ! grep -q '"dateModified"' "$PAGE"; then
  echo "  [AUTO-FIX] Adding dateModified to JSON-LD"
  sed -i 's/"datePublished"/"datePublished": "2026-06-07",\n  "dateModified": "2026-06-07"/' "$PAGE"
fi

# 1.3: Ensure goatcounter snippet
if ! grep -q "goatcounter" "$PAGE"; then
  echo "  [AUTO-FIX] Adding GoatCounter analytics snippet"
  sed -i '/<\/body>/i ' "$PAGE"
fi

# --- Phase 2: Hard-block checks ---
# 2.1: Internal link outbound count
OUTBOUND=$(grep -oP 'href="/en/posts/[^"]+' "$PAGE" | sort -u | wc -l)
if [ "$OUTBOUND" -lt 3 ]; then
  echo "  [BLOCK] Internal outbound links: $OUTBOUND (minimum: 3)"
  FAILED=1
fi

# 2.2: Bilingual parity
OTHER_LANG=$( [ "$LANG" = "en" ] && echo "zh" || echo "en" )
OTHER_FILE="${PAGE/$LANG/$OTHER_LANG}"
if [ ! -f "$OTHER_FILE" ]; then
  echo "  [BLOCK] Missing bilingual counterpart: $OTHER_FILE"
  FAILED=1
fi

if [ "$FAILED" -eq 1 ]; then
  echo "=== Conformity: BLOCKED (non-auto-fixable issues) ==="
  exit 1
else
  echo "=== Conformity: PASS ==="
  exit 0
fi
  

The script illustrates the two-phase design: Phase 1 applies fixes deterministically (sed replacements for missing tags), and Phase 2 evaluates conditions that require human judgment (which pages to link to, whether a full translation is needed). The Conformity Gate's evidence is consumed downstream by the READY Gate, which verifies that every auto-fix was applied correctly and no hard-block issues remain unresolved. For how Conformity evidence is structured in the inter-gate message protocol, see Agent Message Schema Design.

8. Layer 6: READY Gate — Combined Pre-Release Checkpoint

The READY Gate is the final pre-release checkpoint — the gate that does not perform its own quality checks but instead verifies that every upstream gate performed theirs correctly. It is a meta-audit: an independent processor that reads the entire evidence chain, confirms that every gate produced a PASS verdict, validates that no evidence has been tampered with, and only then declares the output ready for deployment. If any upstream gate failed, was skipped, or produced corrupted evidence, the READY Gate blocks — no exceptions, no bypasses.

Input: the full evidence chain. The READY Gate receives one evidence file from each prior gate: research-evidence.json (Research Gate), author-evidence.json (Author Gate), qa-report.json (QA Gate), review-report.json (Review Gate), and conformity-report.json (Conformity Gate). Each file carries a checksum generated at the moment of gate completion. The READY Gate's first action is to recompute every checksum and compare against the recorded value — if any checksum mismatches, the evidence chain has been tampered with post-generation, and the gate blocks immediately. This is not a theoretical concern: in long-running agent sessions, intermediate files can be accidentally modified by concurrent processes, tool runs, or even the agent itself during retry loops. The checksum chain provides end-to-end integrity from Research through Conformity.

Pass conditions. Six sub-checks determine the READY Gate's verdict: (1) Evidence chain complete — all five upstream evidence files exist, are valid JSON, and contain the expected gate field; (2) Evidence checksums match — every file's current sha256 matches the checksum recorded at gate completion; (3) No unresolved critical defects — scanning all upstream reports, no P0 or P1 defect remains in an open state; (4) SEO requirements met — meta description is present, non-empty, and between 120-160 characters; title is present and descriptive; canonical URL is valid and resolves to a real page path; (5) User approval valid — if the pipeline required human approval at any gate, the approval token must be present, non-expired, and signed by an authorized approver; (6) Audit log integrity — the gate transition log (.agent-workspace/log/gate-transitions.jsonl) contains exactly one entry per gate in the expected order, with no gaps, duplicates, or out-of-order transitions. Only when all six sub-checks pass does the READY Gate emit its verdict.

📌 The READY Gate is the accountability anchor: Every other gate can be retried, auto-fixed, or degraded — but the READY Gate is binary and final. It does not "mostly pass." It does not "pass with warnings." A single missing evidence file, a single checksum mismatch, or a single unresolved critical defect means BLOCK. This is by design: the READY Gate is the last line of defense before deployment, and ambiguity at this stage means an untrusted change reaches production. If this gate blocks, the pipeline rolls back to the specific failed upstream gate — re-running Research, Author, QA, Review, or Conformity as needed — and re-enters READY only after the upstream issue is resolved.

Failure response. When the READY Gate blocks, it does not simply say "something failed." It identifies exactly which upstream gate needs re-execution, what evidence is missing or corrupted, and what specific condition was not met. The on_failure action is BLOCK — the pipeline halts, and the task returns to the exact gate that failed. This is fundamentally different from the auto-retry or request-changes patterns of earlier gates: READY Gate failure means the evidence chain itself is broken, not that the output has a fixable defect.

Here is the ready-report.json — the final audit document produced by the READY Gate:


{
  "gate": "READY",
  "task_id": "agent-release-gate-design",
  "timestamp": "2026-06-07T11:30:00Z",
  "verdict": "PASS",
  "sub_checks": {
    "evidence_chain_complete": {
      "required_gates": ["RESEARCH", "AUTHOR", "QA", "REVIEW", "CONFORMITY"],
      "present": ["RESEARCH", "AUTHOR", "QA", "REVIEW", "CONFORMITY"],
      "missing": [],
      "passed": true
    },
    "evidence_checksums": {
      "algorithm": "sha256",
      "files": [
        {"file": "research-evidence.json", "recorded": "sha256:a1b2...", "computed": "sha256:a1b2...", "match": true},
        {"file": "author-evidence.json",   "recorded": "sha256:c3d4...", "computed": "sha256:c3d4...", "match": true},
        {"file": "qa-report.json",         "recorded": "sha256:e5f6...", "computed": "sha256:e5f6...", "match": true},
        {"file": "review-report.json",     "recorded": "sha256:g7h8...", "computed": "sha256:g7h8...", "match": true},
        {"file": "conformity-report.json", "recorded": "sha256:i9j0...", "computed": "sha256:i9j0...", "match": true}
      ],
      "all_match": true,
      "passed": true
    },
    "no_unresolved_critical": {
      "p0_issues": 0,
      "p1_issues": 0,
      "all_resolved": true,
      "passed": true
    },
    "seo_requirements": {
      "meta_description": {"present": true, "length": 147, "within_range": true},
      "title": {"present": true, "descriptive": true},
      "canonical_url": {"present": true, "valid": true, "resolves": true},
      "passed": true
    },
    "user_approval_valid": {
      "approval_required": false,
      "approval_token_present": false,
      "passed": true,
      "note": "No human approval required for this task"
    },
    "audit_log_integrity": {
      "log_path": ".agent-workspace/log/gate-transitions.jsonl",
      "entries": 6,
      "expected_order": ["TASK_CREATED", "RESEARCH_PASSED", "AUTHOR_PASSED", "QA_PASSED", "REVIEW_PASSED", "CONFORMITY_PASSED"],
      "actual_order": ["TASK_CREATED", "RESEARCH_PASSED", "AUTHOR_PASSED", "QA_PASSED", "REVIEW_PASSED", "CONFORMITY_PASSED"],
      "gaps": [],
      "duplicates": [],
      "order_correct": true,
      "passed": true
    }
  },
  "all_sub_checks_passed": true,
  "action": "MARK_READY_TO_PUBLISH",
  "next_gate": "DEPLOY"
}
  

The READY Gate's MARK_READY_TO_PUBLISH action is the transition event that the Agent State Machine consumes to move the task from CONFORMITY_PASSED to READY_TO_PUBLISH. This state transition carries the full ready-report.json as its context payload — meaning the Deploy Gate can independently verify every sub-check before executing its own deployment sequence. The audit log integrity check within the READY report ties directly into the Agent Audit Log Design, which ensures that every gate transition is recorded immutably and can be replayed for compliance verification. And when human approval is required — for high-risk changes, first-time deployments, or gate overrides — the approval token validation sub-check integrates with the Agent Human Approval Workflow to cryptographically verify that the approver's signature is valid and within its validity window.

The READY Gate is the final firewall. Every gate before it — Research, Author, QA, Review, Conformity — verifies one dimension of quality. The READY Gate verifies that they all did their jobs. And only after that verification is the output marked ready for the world to see.

9. Layer 7: Deploy Gate — Gatekeeper from Repository to Production

The Deploy Gate is the last checkpoint before code leaves the repository for production. Every gate before it — Research through READY — operated entirely within the workspace. The Deploy Gate crosses the boundary: it reads the READY verdict, confirms that all prior evidence is complete, checks for explicit user approval to proceed, and then executes the deployment sequence. If any step in that sequence fails, the gate triggers a full rollback — reverting main to its previous state, rebuilding, and confirming the rollback was successful before reporting failure.

Input and precondition. The Deploy Gate requires two inputs to proceed: the complete gate evidence chain (all reports from Research through READY, with checksums validated by the READY Gate), and a user deploy approval token. The approval is not implicit — the user must explicitly issue the 发布 (publish) command, which generates a signed approval token with a short validity window (typically 5 minutes). Without this token, the Deploy Gate blocks regardless of how clean the evidence chain is. This is the one gate where human intent is structurally required: deployment is an irreversible action, and no automated system should take it without explicit human instruction.

Pass conditions. Five sequential checks gate the deployment, each gated on the success of the previous: (1) user deploy approval valid — the approval token exists, is not expired, and carries a valid signature; (2) READY Gate passed — the ready-report.json verdict is PASS with all sub-checks green; (3) target branch clean — the main branch has no uncommitted changes and no merge conflicts; (4) build success — the static site build (Hugo) completes without error, producing the full public/ directory; (5) CDN deploy success — the built assets are synced to the CDN and the CDN responds with 200 for the new pages. The checks are strictly sequential: if the branch is dirty, do not attempt a build; if the build fails, do not attempt a CDN sync. Each check functions as a pre-condition for the next, minimizing blast radius on failure.

📌 Deploy is the one irreversible action: Research can be rerun. Author output can be regenerated. QA and Review can be retried. Conformity auto-fixes. READY audits. But Deploy pushes code to production — a live site visible to real users. This is why the Deploy Gate requires explicit human approval (the 发布 command) and why its failure response is not "retry" but ROLLBACK. The gate is designed on the assumption that deployment will fail sometimes, and when it does, the system's first responsibility is to restore the previous known-good state, not to keep trying until something sticks.

Failure response: rollback, not retry. On any failure, the Deploy Gate executes a full rollback: git revert to the previous commit on main, a fresh Hugo build to regenerate the previous version's static files, a CDN sync of the rolled-back assets, and a final verification that the rollback succeeded (live pages return 200 for the previous version's content). The rollback is recorded in deploy-report.json alongside the original failure, providing a complete audit trail. If the rollback itself fails — a rare but possible scenario — the gate escalates to ALERT_ONCALL for immediate human intervention.

Here is the deploy-report.json — capturing the full deployment sequence with timestamps and a file manifest:


{
  "gate": "DEPLOY",
  "task_id": "agent-release-gate-design",
  "timestamp": "2026-06-07T12:05:00Z",
  "verdict": "PASS",
  "approval": {
    "token_id": "dep_9f4a2b",
    "issued_by": "xslyl",
    "issued_at": "2026-06-07T12:04:30Z",
    "expires_at": "2026-06-07T12:09:30Z",
    "valid": true
  },
  "pre_conditions": {
    "ready_gate_pass": true,
    "ready_report": "ready-report.json",
    "target_branch": "main",
    "target_branch_clean": true,
    "merge_conflicts": 0,
    "passed": true
  },
  "deploy_sequence": [
    {"step": 1, "action": "merge_to_main", "commit": "a7f3e91", "result": "success", "duration_ms": 320},
    {"step": 2, "action": "hugo_build", "result": "success", "duration_ms": 8400, "pages_built": 247},
    {"step": 3, "action": "cdn_sync", "target": "s3://xslyl-prod", "files_synced": 512, "result": "success", "duration_ms": 12400}
  ],
  "file_manifest": [
    {"path": "en/posts/agent-release-gate-design.html", "size_bytes": 31200, "checksum": "sha256:k1l2m3..."},
    {"path": "zh/posts/agent-release-gate-design.html", "size_bytes": 31850, "checksum": "sha256:n4o5p6..."},
    {"path": "sitemap.xml", "updated": true, "checksum": "sha256:q7r8s9..."}
  ],
  "rollback": null,
  "duration_ms": 21120,
  "next_gate": "VERIFIED"
}
  

The file manifest provides a post-deploy record of exactly what was shipped — every file, its size, and its checksum. This manifest is consumed by the VERIFIED Gate to confirm that what was deployed matches what was built, closing the loop between the build artifact and the live site. For the broader topic of how agent commands (including 发布) are validated and executed safely, see Agent Command Execution Safety. For the sandbox and isolation model that separates the Deploy Gate's execution environment from the authoring workspace, see Agent Runtime Isolation.

10. Layer 8: VERIFIED Gate — Final Verification Archive

The VERIFIED Gate closes the loop. Deployment is not the end — it is the beginning of the live verification window. Code on the CDN does not guarantee code rendered correctly in a browser. A deployment can succeed in every technical sense — merge clean, build green, sync complete — and still produce broken pages: a 404 on the Chinese variant, a missing canonical tag, a homepage that no longer lists the new article. The VERIFIED Gate exists to catch these post-deploy regressions before they become user-facing incidents.

Checks in detail. The VERIFIED Gate runs five live-site verifications: (1) Both language URLs return HTTP 200 — the English and Chinese pages are fetched from the production CDN and confirmed accessible; a 404 or 500 on either variant is a hard block. (2) Hreflang tags correct — the live HTML is parsed to confirm that hreflang="en" points to the English URL, hreflang="zh" points to the Chinese URL, and the x-default is set correctly; a unidirectional or mismatched hreflang is a hard block. (3) Canonical URL correct — the live page's <link rel="canonical"> matches the expected canonical URL for that language variant; a self-referencing canonical that points to the wrong language is a hard block. (4) Sitemap updated — the production sitemap.xml is fetched and checked for the new page URL; a missing sitemap entry means search engines will not discover the page — hard block. (5) Homepage updated — if the new article should appear in the homepage listing (as most do), the live homepage HTML is checked for a link to the new page; a missing listing is a warning (non-blocking), since some articles may intentionally skip the homepage.

📌 VERIFIED is not a second Deploy check — it's a live reality check: The Deploy Gate confirms that the push succeeded — files on the CDN, build green, sync complete. The VERIFIED Gate confirms that those files actually work from a user's perspective. A CDN propagation delay, a misconfigured redirect rule, a stale cache layer — none of these are deploy failures, but all of them break the live site. The VERIFIED Gate is the only check that tests the system as a user experiences it, not as an operator configured it.

Failure response. When the VERIFIED Gate fails, the situation is urgent: a live page is returning errors, or the sitemap is missing entries. The failure response is ALERT_ONCALL — immediate notification to the human operator with the specific URLs that failed and the nature of the failure. The gate does not attempt automatic remediation because live-site failures can have multiple root causes (CDN cache, DNS propagation, build artifact corruption) and incorrect automated fixes can compound the damage. The operator diagnoses the issue, corrects it, and re-runs the VERIFIED check.

Here is the verify-report.json — the final evidence document that closes the gate chain:


{
  "gate": "VERIFIED",
  "task_id": "agent-release-gate-design",
  "timestamp": "2026-06-07T12:10:30Z",
  "verdict": "PASS",
  "live_checks": {
    "en_page_accessible": {
      "url": "https://xslyl.com/en/posts/agent-release-gate-design.html",
      "status": 200,
      "content_length": 28450,
      "passed": true
    },
    "zh_page_accessible": {
      "url": "https://xslyl.com/zh/posts/agent-release-gate-design.html",
      "status": 200,
      "content_length": 29120,
      "passed": true
    },
    "hreflang_correct": {
      "en_page_hreflang_zh": "https://xslyl.com/zh/posts/agent-release-gate-design.html",
      "zh_page_hreflang_en": "https://xslyl.com/en/posts/agent-release-gate-design.html",
      "bidirectional": true,
      "x_default_set": true,
      "passed": true
    },
    "canonical_correct": {
      "en_canonical": "https://xslyl.com/en/posts/agent-release-gate-design.html",
      "zh_canonical": "https://xslyl.com/zh/posts/agent-release-gate-design.html",
      "both_correct": true,
      "passed": true
    },
    "sitemap_updated": {
      "sitemap_url": "https://xslyl.com/sitemap.xml",
      "new_page_present": true,
      "entry": "https://xslyl.com/en/posts/agent-release-gate-design.html2026-06-07",
      "passed": true
    },
    "homepage_updated": {
      "homepage_url": "https://xslyl.com/en/",
      "article_link_present": true,
      "link_found": "/en/posts/agent-release-gate-design.html",
      "passed": true
    }
  },
  "all_live_checks_passed": true,
  "action": "MARK_VERIFIED",
  "artifacts": [
    "verify-report.json",
    "live-review.md",
    "final-report.md"
  ],
  "duration_ms": 18500,
  "next_gate": null
}
  

Only after the VERIFIED Gate emits its MARK_VERIFIED action is an article truly "published" in the system's internal state model. Before this point, the task state is DEPLOYED — live, but unconfirmed. After VERIFIED passes, the state transitions to VERIFIED — live and confirmed correct. This distinction matters for monitoring and alerting: a task stuck in DEPLOYED for more than 5 minutes triggers a warning (the VERIFIED check may have stalled); a task in VERIFIED is closed and silent. The live-review.md and final-report.md artifacts provide human-readable summaries of the entire gate journey from Research through VERIFIED — useful for compliance audits and retrospectives. For deeper integration of live-site checks into agent monitoring dashboards, see Agent Observability.

11. Failure Paths: Gate Retry, Degradation, and Rollback Strategies

Not all gate failures are equal. A lint error is trivial — the agent fixes a typo and resubmits. A broken evidence chain is catastrophic — the entire pipeline must halt until a human diagnoses the corruption. The release gate system defines four distinct failure strategies, each mapped to the severity and recoverability of the failure that triggered it. Understanding these strategies is essential to operating the gate pipeline — misapplying a retry to a block-worthy failure can mask systemic issues, while blocking on a transient network blip can stall the pipeline unnecessarily.

Strategy 1: Retry (transient, automatically recoverable). Transient failures — network timeouts, temporary file lock contention, API rate-limit backoffs — trigger automatic retry with exponential backoff. The gate retries up to three times: first retry after 2 seconds, second after 4 seconds, third after 8 seconds. If all three retries fail, the failure escalates to the next strategy level (typically Degrade or Block, depending on the gate). Retry is used by the Research Gate (search API flakiness), QA Gate (test environment spin-up delays), and the Deploy Gate (CDN sync timeouts). The key constraint: retry is only safe for idempotent operations — operations that produce the same result whether executed once or three times. A merge to main is not retry-safe; a lint check is.

Strategy 2: Degrade (non-critical, warn but pass). Non-critical failures — dead links in a secondary reference, a file slightly over the recommended size limit, an optional check timing out — trigger degradation: the gate emits a warning and passes, but records the degradation in its evidence file. Degradation is never silent. Every degraded pass is tagged with "degraded": true and a "degradation_reason" field in the gate's report. The READY Gate inspects these tags and can escalate a pattern of degradation (e.g., three consecutive degraded passes in the same check) to a hard block. A concrete example: the Review Gate's design_review check normally routes to a human reviewer, but if no human is available within the timeout window, the gate degrades to a ChatGPT-powered fallback review — flagged as degraded, but sufficient to unblock progress on low-risk changes.

Strategy 3: Block (critical, human intervention required). Critical failures — evidence chain broken, checksum mismatch, user approval invalid, non-auto-fixable Conformity defect — trigger a hard block. The pipeline halts. No further gates execute. The task enters a terminal state that requires explicit human action to resolve: either re-run the failed gate, override the block with a signed approval, or cancel the task entirely. Block is the default failure strategy for the READY, Deploy, and Conformity gates — gates whose failures indicate a structural integrity problem, not a fixable defect. A block is not a punishment; it is a safety lock. The system refuses to proceed because proceeding would mean deploying an untrusted or broken change.

Strategy 4: Rollback (post-deploy, restore previous state). When a deployment succeeds technically but post-deploy verification fails — a CDN propagation glitch causes a 404 on the new page, or the homepage fails to update within the verification window — the response is rollback. The Deploy Gate reverses the merge, rebuilds the previous version, syncs the rolled-back assets to the CDN, and verifies that the rollback restored the site to its pre-deployment state. Rollback is not a "fix"; it is a "restore to known-good." The defect that caused the verification failure must be diagnosed and corrected before re-deployment is attempted. The rollback execution is recorded in deploy-report.json with the full revert sequence — every git command, every build step, every CDN sync — for auditability.

Here is the failure strategy configuration — mapping each gate to its strategy with retry limits and degradation rules:


{
  "failure_strategies": {
    "RETRY": {
      "max_attempts": 3,
      "backoff": "exponential",
      "backoff_intervals_ms": [2000, 4000, 8000],
      "on_exhausted": "DEGRADE_OR_BLOCK",
      "applies_to": ["RESEARCH", "QA"]
    },
    "DEGRADE": {
      "mode": "warn_and_pass",
      "requires_flag": true,
      "max_consecutive_degradations": 3,
      "on_threshold_exceeded": "BLOCK",
      "applies_to": ["REVIEW"]
    },
    "BLOCK": {
      "mode": "halt_pipeline",
      "requires_human_override": true,
      "override_window_hours": 24,
      "applies_to": ["CONFORMITY", "READY", "DEPLOY"]
    },
    "ROLLBACK": {
      "mode": "restore_previous_version",
      "verify_rollback": true,
      "on_rollback_failure": "ALERT_ONCALL",
      "applies_to": ["DEPLOY"]
    }
  }
}
  

The strategy configuration is not theoretical — it is referenced by the state machine at every gate transition. When a gate reports a failure verdict, the state machine reads the failure strategy, executes the prescribed action (retry, degrade, block, or rollback), and writes the outcome to the transition log. This ensures that failure handling is deterministic and auditable — the same input always produces the same recovery action, and every recovery action is recorded. For the full error recovery protocol that governs how agents handle failures within and across gates, see Agent Error Recovery. For the audit log design that ensures every retry, degradation, block, and rollback is immutably recorded, see Agent Audit Log Design.

12. Summary: From Freefall to Layered Responsibility

An AI agent without release gates operates in freefall — it generates, it saves, it declares success, and no independent processor verifies any of it. The output reaches production on the agent's own authority. This is not safe. It is not auditable. It is not reliable. The 8-gate release system described in this article transforms that freefall into a layered responsibility chain where every quality dimension is owned by an independent processor that produces auditable evidence of its decision.

The 8-gate mnemonic. The gates form a memorable chain: Research → Author → QA → Review → Conformity → READY → Deploy → VERIFIED. Each gate's name encodes its responsibility: Research gathers information without touching source code. Author generates draft content. QA runs deterministic structural checks. Review performs semantic inspection. Conformity enforces project-specific standards. READY audits the evidence chain. Deploy controls the production release. VERIFIED confirms live correctness. The names are deliberately active verbs — they describe what the gate does, not what it checks. A gate named "Lint Gate" tells you the tool; a gate named "QA Gate" tells you the responsibility.

Four core design principles. The entire system rests on four principles that apply at every layer. (1) Deterministic first — automate everything that can be checked mechanically (lint, type checking, schema validation, dead links), and reserve human attention for what cannot (design judgment, semantic correctness, requirement interpretation). The QA Gate runs in seconds; the Review Gate may take hours. This is correct: fast checks should be fast, and slow checks should be worth the wait. (2) Artifact handoff — every gate produces a structured evidence file (JSON) that the next gate consumes. No gate reads "the code"; every gate reads the previous gate's report plus the output files. This creates a formal interface between quality dimensions and prevents gate implementations from coupling to each other's internal logic. (3) Agent separation — the agent that writes is never the agent that reviews. The agent that deploys is never the agent that verifies. Fresh perspective is the cheapest and most effective defense against hallucination and context drift — and it costs nothing but a second agent instance. (4) Evidence = gate — a gate that produces no evidence did not execute. The evidence files (research-evidence.json, qa-report.json, ready-report.json, etc.) are not optional logs; they are the output of the gate. If the evidence file is missing or corrupted, the gate's verdict is null — and the READY Gate treats null as failure.

📌 Well-designed gates make agents faster, not slower: The intuitive objection to a multi-gate pipeline is "this will slow everything down." The opposite is true. Without gates, the agent self-reviews — and self-review is slow, unreliable, and produces no evidence. With gates, the agent focuses entirely on generation while independent processors handle verification in parallel or immediately after. QA checks complete in tens of seconds. Review runs while the agent moves to the next task. The pipeline adds structure, not latency. In practice, a gated agent pipeline completes end-to-end faster than an ungated one because errors are caught at the point of generation (QA Gate, auto-retry) rather than discovered days later by a debugging human. The gates are not speed bumps — they are guardrails that keep the agent on the road and out of the ditch.

The release gate system is not a CI/CD pipeline. It is not a checklist. It is a responsibility architecture — a formal declaration that no single agent, no single check, and no single quality dimension is sufficient to trust generated output. Trust is built across layers, verified by independent processors, and evidenced by immutable audit records. That is the core idea: trust nothing, verify everything, and record every verification.

Frequently Asked Questions

Q1: How is a gate different from a CI/CD stage?

A CI/CD stage is a build step — it runs against a commit, produces artifacts, and either passes or fails the build. A gate is a responsibility processor — it receives output from the previous gate, applies independent verification on one quality dimension, produces auditable evidence, and either passes the output forward or redirects it back. CI/CD stages are ordered build steps. Gates are ordered responsibility handoffs. The key difference: CI/CD assumes a clean, self-consistent commit as input; gates assume a continuous stream of agent output that may contain intermediate, incomplete, or hallucinated content. Gates are designed for the reality of agent workflows; CI/CD is designed for the reality of human-authored commits.

Q2: Do all gates need human approval?

No. Of the eight gates, only the Deploy Gate structurally requires human approval (the 发布 command). The Review Gate may route to a human reviewer for design assessment on complex changes, but for routine changes with small diffs, Review can auto-approve. All other gates — Research, Author, QA, Conformity, READY, VERIFIED — are fully automated. The system is designed so that human attention is reserved for the two highest-leverage decisions: semantic correctness (Review) and the irreversible action (Deploy). Everything else is deterministic and automated.

Q3: What happens when a gate fails?

It depends on the failure strategy assigned to that gate. Transient failures (network timeouts, API rate limits) trigger automatic retry with exponential backoff — up to 3 attempts. Non-critical failures (optional checks, soft warnings) trigger degradation — the gate passes but flags the issue. Critical failures (evidence corruption, checksum mismatch, missing bilingual content) trigger a hard block — the pipeline halts until a human resolves the issue. Post-deploy verification failures trigger a full rollback — the deployment is reversed and the site returns to its previous state. Every failure, regardless of strategy, is recorded in the gate transition log with full context for auditability.

Q4: Can gates be merged or simplified for smaller projects?

Yes, but with care. The 8-gate system is designed for production agent workflows where the cost of a bad deployment is high. For smaller projects, you can merge adjacent gates that share similar execution profiles: Research + Author can run sequentially in the same agent session; QA + Conformity can share a single execution environment; READY + VERIFIED can be collapsed into a single "Pre/Post Deploy Audit" gate. What you should not merge is the writer/reviewer separation — the Author gate and Review gate must always use different agent instances, regardless of project size. And the Deploy Gate must always require explicit human approval. These two constraints are non-negotiable safety properties.

Q5: How do you test the gates themselves?

Gates are tested at three levels. Unit tests verify each check function independently — given a known-bad input, does the lint check correctly report the error at the expected location? Integration tests feed a complete evidence chain through the gate and verify the verdict — given a research-evidence.json with keyword_coverage of 0.5, does the READY Gate correctly identify the Research Gate as the failure source? End-to-end tests run a full task through the entire pipeline with a known-good output and verify every gate passes; then inject controlled failures (a missing file, a hallucinated symbol) and verify the correct gate blocks, degrades, or retries as specified. The test suite for the gate system is itself run through the gate system — gates verify the code that implements the gates, a meta-verification that catches regressions in the verification logic.

Q6: Can gates be bypassed in emergencies?

Yes, but the bypass is itself a gate action — it is never invisible or silent. An emergency bypass requires a signed approval token with the "emergency": true flag, issued by an authorized operator. The bypass is recorded in the gate transition log with the operator's identity, the reason for the bypass, and the gates that were skipped. The bypassed task is flagged as VERIFIED_WITH_BYPASS rather than VERIFIED, and all downstream monitoring treats bypassed tasks as elevated risk — they trigger additional post-deploy verification checks and faster alerting on anomalies. Bypass is not a loophole; it is a controlled, auditable, accountable exception mechanism for situations where the cost of delay exceeds the risk of the unverified change.

Next Steps: Articles That Extend This Architecture

The release gate system is the backbone — but it integrates with several other architectural components that together form a complete agent reliability framework. The articles below dive into each adjacent concern. Read them in any order; each stands alone but references the gate system as context.

Agent State Machine Design

The state machine that enforces gate ordering — state transitions are driven by gate pass/fail events, not LLM text output. Every RESEARCH_PASSED → AUTHOR_PASSED → QA_PASSED transition is a deterministic state change with full audit trail.

Agent Context Protocol Design

How evidence files (research-evidence.json, qa-report.json, etc.) are structured, versioned, checksummed, and passed between gates. Defines the formal interface that keeps gate implementations decoupled.

Agent Human Approval Workflow

The approval token system that gates the Deploy Gate and Review Gate's human-in-the-loop path. Covers token issuance, signature verification, expiry windows, and emergency bypass protocols.

Agent Audit Log Design

The append-only transition log that records every gate pass, fail, retry, degradation, and bypass. Ensures the entire gate pipeline is replayable for compliance audits and root-cause analysis.

Agent Observability

Monitoring and alerting for the gate pipeline — live dashboards showing gate status per task, VERIFIED Gate latency tracking, degradation pattern detection, and alerting integration for BLOCK and ALERT_ONCALL events.

Agent Security Evaluation

Security scanning within the QA Gate — bandit configuration for agent-generated Python, dependency vulnerability checks, secret detection, and sandbox policy enforcement that prevents agents from exfiltrating data during gate execution.