Structured Debate Protocol

In the previous article, we used two agents challenging each other to tackle the cognitive bias problem in single models. That debate.py code works — but it has an obvious flaw:

It has no structure. Pro speaks, then Con speaks, then Pro speaks again — it's essentially alternating monologues with no inherent logical framework. You could loop them 10 rounds, but after round 5, both sides are mostly just repeating themselves and going in circles.

Real debate — whether academic peer review, courtroom cross-examination, or presidential debates — has strict phase structure. Each phase has a clear objective and constraints. That's what this article brings to the multi-agent debate system.

Four Problems with Free-Form Debate

Let's be precise about what's wrong with L1's "free-form" debate. It's not useless — it's far better than a single answer. But if you want to use it for genuinely important decisions (tech stack choices, investment strategy, product direction), here are the problems you need to know.

Problem 1: Topic Drift

A debate without structural constraints is like a meeting without an agenda. Pro is talking about cost, Con suddenly pivots to security; Pro is still responding about security, Con jumps to team skill set. By round three, you've forgotten what the core proposition was even about.

A debate without anchors = a boxing match without a referee. Both sides throw punches, but nobody knows what this round is actually about.

Problem 2: False Consensus

In free-form debate, Con might say "on the cost issue, I agree with Pro's analysis" — then both move on. A judge reading this might think "cost is settled — consensus reached." But Con may have only politely acknowledged the point without genuine concession. Or Con may have misunderstood but skipped it to keep the debate moving.

Without structured consensus tracking, you can't distinguish "actual agreement" from "underexamined potential disagreement."

Problem 3: Shallow Depth

Free-form debate encourages breadth — each round covers multiple arguments. But important decisions often require deep drilling into a single argument. Pro says "this approach has better performance," Con says "performance isn't the bottleneck" — and the topic moves on. Nobody asks: "Better by what definition? P99 latency? Throughput? Under what test conditions and benchmarks?"

Problem 4: Unreproducible Judging

L1's judge used a prompt: "Pro strengths, Con strengths, areas of agreement, uncertain areas, overall recommendation." Better than "who won," but still flawed: run it twice on the same debate and the conclusions can differ substantially. Because the prompt gives the judge no specific scoring dimensions — it's just improvising.

Problem Free-Form Behavior Structure Needed
Topic Drift Topics shift arbitrarily; core proposition blurs Each round has an explicit theme and constraints
False Consensus Polite concessions mistaken for agreement Structured consensus tracking and confirmation
Shallow Depth Broad coverage but surface-level on each point Cross-exam phase forces deep drilling on key arguments
Unreproducible Improvised judgment; two runs, two results Explicit scoring dimensions and decision rules

The 3-Round Structured Debate Protocol

Here's the protocol designed in this article. Three rounds, each with explicit objectives, inputs, outputs, and termination conditions.

Protocol Overview

Round Phase Objective Deliverable
R1 Opening Statement Present full position, list core arguments 3-5 structured arguments (Claim + Reasoning + Evidence)
R2 Cross-Examination Challenge each opponent argument, drill into weak points Challenge-response pair for each argument
R3 Closing Statement Synthesize the debate, concede valid rebuttals, state final position Structured summary + positions held + concessions made

Round 1: Opening Statement

Objective: Both sides present their full position without interruption.

This isn't "say whatever comes to mind." Opening statements have strict formatting requirements:

Here's an example — not this:

❌ "Microservices improve development efficiency because teams can work independently, and many big companies use them."

But this:

Argument 1: Independent deployment shortens release cycles
Claim: Microservices architecture can significantly shorten the cycle from code commit to production deployment.
Reasoning: In monolithic architecture, any code change requires full build, full test, full deployment. For a 5-person team, each deployment cycle is approximately 4 hours. Microservices allow independent deployment — each service can be independently built, tested, and released without affecting others.
Evidence: Our benchmark testing shows that the same feature change takes an average of 3.8 hours to deploy in monolithic architecture vs. 0.7 hours in microservices (P95: 6.2 hours monolithic, 1.4 hours microservices). That's a 5.4x difference.

See the difference? The latter is cross-examinable — the opponent can ask: "Under what conditions were your benchmarks run? Are the test cases representative? Why does the P95 improvement drop to only 4.4x?"

💡 Design principle: The quality of the opening statement sets the ceiling for the entire debate. If one side can't present defensible arguments up front, the cross-examination phase will only expose more problems. Our code validates argument format — non-conforming submissions get rejected.

Round 2: Cross-Examination

Objective: This is the most critical phase of the entire protocol. Both sides must respond to each and every one of the opponent's opening arguments — no skipping, no vague hand-waving.

Cross-examination has three mandatory requirements:

  1. Point-by-point response: For each opponent argument, exactly one of four response types: Refute (points out a logical or factual error), Challenge (argues insufficient evidence or unmet conditions), Concede (accepts the argument as valid), Partial (accepts the core but disputes degree or scope).
  2. Targeted follow-up question: For each argument, at least one specific, sharp follow-up question. E.g., "You cite data under condition X, which doesn't apply in our scenario — how do you demonstrate transferability?"
  3. No new arguments: The cross-examination phase prohibits introducing brand-new arguments. This is crucial — if new arguments were allowed in cross-exam, the opponent would never get a chance to respond, and the judge couldn't evaluate them. All new arguments must be raised in R1 opening statements.

This phase mirrors the academic peer review process — a reviewer must address specific claims in the paper, not say "overall this direction feels wrong" and reject it.

⚠️ Why ban new arguments? This is the key constraint for structural integrity. Allowing new arguments in cross-exam would degenerate the debate back to L1's free-form — new arguments get thrown in, old ones forgotten, the judge can't track which points both sides had a chance to respond to.

Round 3: Closing Statement

Objective: After seeing the opponent's cross-examination, both sides make their final case. This is not "repeat the opening statement."

The closing statement has three components:

The closing statement doesn't need to be long. 100-200 words is plenty — the judge needs the distilled final state, not a third repetition.

Judge Agent Deep Design

L1's judge was already better than "who won" — it used 5 dimensions. But we need a more rigorous design where judgments are reproducible, auditable, and quantifiable.

Multi-Dimensional Scoring Rubric

The judge doesn't give a vague "Pro wins" or "Con wins." It independently scores each opening argument on both sides, then aggregates:

Scoring Dimension Criteria Weight
Logic (1-10) Is the reasoning chain self-consistent? Any gaps, circular reasoning, or equivocation? 30%
Evidence Quality (1-10) Is the evidence specific, verifiable, and relevant? Or just vague generalities? 30%
Responsiveness (1-10) Did they respond directly to cross-examination? Point-by-point rebuttal or evasion? 25%
Honesty (1-10) Did they concede when warranted? Any exaggeration or distortion? 15%

Each opening argument's final score = Logic × 0.3 + Evidence × 0.3 + Responsiveness × 0.25 + Honesty × 0.15. The average across all arguments is that side's total score.

Note: This scoring rubric measures debate quality, not "whose position is more correct." A position could be wrong, but if it's argued rigorously with strong evidence and honest responses — it deserves a high score. Conversely, a position could be right, but argued poorly — it deserves a low score.

📌 Important distinction: Debate quality and position correctness are separate things. In real decision-making, you ultimately judge based on "which arguments survived cross-examination unrebutted" — that's the topic of the next article (L3: Scoring & Consensus).

Logical Fallacy Detection

Beyond scoring, the judge also detects common logical fallacies. We embed a fallacy checklist in the judge's system prompt:

Fallacy Definition Example
Straw Man Distorting the opponent's argument to attack something they didn't say "Pro thinks monoliths are worthless" — Pro never said that
Appeal to Authority Using an authority's name in place of actual reasoning "Google uses microservices, so you should too" — lacks context
Slippery Slope Assuming one action triggers an uncontrollable chain reaction "If you pick microservices, you'll need K8s, then a DevOps team, then you'll be 20 people…"
False Dilemma Reducing a complex issue to an either-or choice "You're either all microservices or all monolithic" — ignoring modular monoliths
Anecdotal Evidence Using isolated cases instead of systematic evidence "I saw one project collapse after adopting microservices" — N=1

The judge doesn't need to make final determinations on every suspected fallacy — it just needs to flag suspicious reasoning patterns and list them in the report.

⚠️ Fallacy flags ≠ verdicts: Judge-flagged fallacies are hints for the human decision-maker. Final judgment rests with the human. The automated system's role is "help you spot problems" — not "decide for you."

Argument Trace Table

This is the most practical part of the judge's output. It's a table tracking every opening argument's fate from introduction to conclusion:

Argument ID Proposer Core Claim Cross-Exam Result Final Status
PRO-1 Pro Independent deployment shortens release cycles 5.4x Con challenged test conditions Partially Upheld — test environment idealized
CON-2 Con Distributed systems increase operational complexity Pro failed to effectively rebut Upheld — ops cost will indeed increase
PRO-3 Pro Team tech stack flexibility improves Con partially conceded Upheld — both sides agree

The value of this table: you won't drown in a long debate transcript. At a glance, you see which arguments stood, which were refuted, and which need more information to judge. This is the core advantage of structured debate over free-form.

Code Implementation

Below is the complete implementation. It inherits L1's debate.py architecture (DebateAgent + JudgeAgent + engine function) and adds protocol round management, argument format validation, multi-dimensional scoring, and fallacy detection.

Save it as debate_protocol.py in the same directory as debate.py and you're ready to run.

"""
Structured Debate Protocol — 3-Round Debate + Multi-Dimensional Judge Scoring
Extends L1's debate.py with protocol round management.

Requires: pip install openai
"""

import os
import json
import re
from enum import Enum
from dataclasses import dataclass, field
from openai import OpenAI

# ──────────────────────────────────────────────
# 1. Initialize LLM client (placeholder credentials)
# ──────────────────────────────────────────────
client = OpenAI(
    api_key="your-api-key",
    base_url="https://api.example.com/v1"
)


# ──────────────────────────────────────────────
# 2. Data structures
# ──────────────────────────────────────────────
class RoundType(Enum):
    OPENING = "opening"           # Opening statement
    CROSS_EXAM = "cross_exam"     # Cross-examination
    CLOSING = "closing"           # Closing statement


@dataclass
class Argument:
    """A structured argument"""
    id: str                      # Argument ID, e.g. PRO-1, CON-3
    claim: str                   # Core assertion
    reasoning: str               # Reasoning chain
    evidence: str                # Supporting evidence

    def to_text(self) -> str:
        return (
            f"[{self.id}] Claim: {self.claim}\n"
            f"Reasoning: {self.reasoning}\n"
            f"Evidence: {self.evidence}"
        )


@dataclass
class CrossExamResponse:
    """Response to a single argument during cross-examination"""
    target_arg_id: str           # Target argument ID
    response_type: str           # refute | challenge | concede | partial
    reasoning: str               # Response reasoning
    follow_up_question: str      # Follow-up question

    def to_text(self) -> str:
        return (
            f"Response to [{self.target_arg_id}] [{self.response_type}]:\n"
            f"{self.reasoning}\n"
            f"Follow-up: {self.follow_up_question}"
        )


@dataclass
class ScoringResult:
    """Judge's scoring of a single argument"""
    argument_id: str
    logic_score: int             # 1-10
    evidence_score: int          # 1-10
    responsiveness_score: int    # 1-10
    honesty_score: int           # 1-10
    fallacies_detected: list[str] = field(default_factory=list)
    notes: str = ""

    @property
    def weighted_score(self) -> float:
        return (
            self.logic_score * 0.30 +
            self.evidence_score * 0.30 +
            self.responsiveness_score * 0.25 +
            self.honesty_score * 0.15
        )


# ──────────────────────────────────────────────
# 3. Structured Debate Agent (extends L1's DebateAgent)
# ──────────────────────────────────────────────
class StructuredDebateAgent:
    """
    A debate agent that holds a specific stance and generates
    structured output per protocol round.

    Key differences from L1's DebateAgent:
    - Outputs are structured as Arguments with explicit IDs
    - Separate methods per round (opening / cross_examine / closing)
    - Maintains its own argument list for cross-examination reference
    """

    def __init__(self, name: str, stance: str, system_prompt: str):
        self.name = name
        self.stance = stance
        self.system_prompt = system_prompt
        self.history: list[dict] = []
        self.arguments: list[Argument] = []  # This side's opening arguments

    def _prefix(self) -> str:
        """Return argument ID prefix"""
        return "PRO" if "For" in self.stance else "CON"

    def _call_llm(self, user_prompt: str, temperature: float = 0.7,
                  max_tokens: int = 1000) -> str:
        """Unified LLM call wrapper"""
        messages = [{"role": "system", "content": self.system_prompt}]
        for entry in self.history:
            messages.append(entry)
        messages.append({"role": "user", "content": user_prompt})

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens
        )
        reply = response.choices[0].message.content
        self.history.append({"role": "assistant", "content": reply})
        return reply

    def opening_statement(self, topic: str) -> list[Argument]:
        """
        Round 1: Opening Statement.
        Output 3-5 structured arguments, each with claim/reasoning/evidence.
        """
        prefix = self._prefix()

        prompt = (
            f"Topic: \"{topic}\"\n"
            f"Your position: {self.stance}\n\n"
            f"Deliver your opening statement. Output 3-5 arguments "
            f"as a JSON array.\n"
            f"Each argument must follow this JSON format:\n"
            f'{{"id": "{prefix}-N", "claim": "core assertion", '
            f'"reasoning": "reasoning chain", "evidence": "evidence or data"}}\n\n'
            f"Requirements:\n"
            f"1. ID format: {prefix}-1, {prefix}-2, incrementing\n"
            f"2. claim must be a specific, falsifiable proposition\n"
            f"3. reasoning must contain a causal chain, not mere assertion\n"
            f"4. evidence must be verifiable specific facts/data, "
            f"not vague phrases like \"many companies\"\n"
            f"5. Output pure JSON array only, no other text"
        )

        reply = self._call_llm(prompt, temperature=0.6, max_tokens=1200)

        # Parse JSON response
        try:
            # Strip possible markdown code block markers
            cleaned = re.sub(r'```(?:json)?\s*', '', reply).strip()
            data = json.loads(cleaned)
            self.arguments = [
                Argument(
                    id=item["id"],
                    claim=item["claim"],
                    reasoning=item["reasoning"],
                    evidence=item["evidence"]
                )
                for item in data
            ]
            return self.arguments
        except (json.JSONDecodeError, KeyError) as e:
            # JSON parse failed — fallback to free text
            print(f"Warning: {self.name} JSON parse failed ({e}), "
                  f"using free text fallback.")
            self.arguments = [
                Argument(
                    id=f"{prefix}-1",
                    claim="Opening statement (JSON parse failed, "
                          "see raw response)",
                    reasoning=reply,
                    evidence=""
                )
            ]
            return self.arguments

    def cross_examine(
        self, opponent_args: list[Argument]
    ) -> list[CrossExamResponse]:
        """
        Round 2: Cross-Examination.
        Respond to each of the opponent's opening arguments point by point.
        """
        opponent_args_text = "\n\n".join(
            arg.to_text() for arg in opponent_args
        )

        prompt = (
            f"Below are your opponent's opening arguments. "
            f"Respond to each one.\n\n"
            f"{opponent_args_text}\n\n"
            f"For each argument, output a JSON object:\n"
            f'{{"target_arg_id": "opponent argument ID", '
            f'"response_type": "refute|challenge|concede|partial", '
            f'"reasoning": "your reasoning", '
            f'"follow_up_question": "a sharp follow-up question"}}\n\n'
            f"response_type meanings:\n"
            f"- refute: you believe the argument has factual/logical errors\n"
            f"- challenge: you believe evidence is insufficient or "
            f"conditions aren't met\n"
            f"- concede: you accept the argument as valid\n"
            f"- partial: you accept the core but dispute degree or scope\n\n"
            f"Requirements:\n"
            f"1. Must respond to ALL opponent arguments — do not skip any\n"
            f"2. Do NOT introduce new arguments (this is cross-examination, "
            f"only respond to existing arguments)\n"
            f"3. Output pure JSON array only, no other text"
        )

        reply = self._call_llm(prompt, temperature=0.5, max_tokens=1500)

        try:
            cleaned = re.sub(r'```(?:json)?\s*', '', reply).strip()
            data = json.loads(cleaned)
            return [
                CrossExamResponse(
                    target_arg_id=item["target_arg_id"],
                    response_type=item["response_type"],
                    reasoning=item["reasoning"],
                    follow_up_question=item["follow_up_question"]
                )
                for item in data
            ]
        except (json.JSONDecodeError, KeyError) as e:
            print(f"Warning: {self.name} cross-exam JSON parse failed "
                  f"({e}), using free text.")
            return [
                CrossExamResponse(
                    target_arg_id=arg.id,
                    response_type="challenge",
                    reasoning=f"JSON parse failed. Raw response:\n{reply}",
                    follow_up_question="Please clarify the above."
                )
                for arg in opponent_args
            ]

    def closing_statement(self) -> str:
        """
        Round 3: Closing Statement.
        Includes concessions made, arguments not rebutted, final position.
        """
        my_args_text = "\n".join(arg.to_text() for arg in self.arguments)

        prompt = (
            f"Review of your opening arguments:\n{my_args_text}\n\n"
            f"Deliver your closing statement. Structure as follows:\n\n"
            f"## Concessions Made\n"
            f"List the opponent arguments or partial arguments you accepted "
            f"during cross-examination.\n\n"
            f"## Arguments Not Effectively Rebutted\n"
            f"Restate the core arguments from your opening that the opponent "
            f"failed to effectively challenge.\n\n"
            f"## Final Position\n"
            f"Based on the above, what is your current overall position on "
            f"the topic? If it has shifted (strengthened, weakened, "
            f"partially adjusted), explain why.\n\n"
            f"Requirements: total length under 200 words. Concise and sharp."
        )

        return self._call_llm(prompt, temperature=0.5, max_tokens=600)


# ──────────────────────────────────────────────
# 4. Structured Judge Agent (multi-dimension scoring + fallacy detection)
# ──────────────────────────────────────────────
class StructuredJudge:
    """
    Judge Agent — multi-dimensional scoring, fallacy detection,
    argument tracing.

    Key differences from L1's JudgeAgent:
    - Independently scores each argument (logic/evidence/response/honesty)
    - Built-in logical fallacy detection checklist
    - Generates argument trace table
    - Outputs JSON-structured conclusions instead of free text
    """

    FALLACY_CHECKLIST = [
        ("Straw Man",
         "Is it distorting the opponent's argument — attacking something "
         "they didn't say?"),
        ("Appeal to Authority",
         "Is it using \"big company X uses it\" in place of actual reasoning?"),
        ("Slippery Slope",
         "Is it assuming one action triggers an uncontrollable "
         "chain reaction?"),
        ("False Dilemma",
         "Is it reducing a complex issue to an either-or choice?"),
        ("Anecdotal Evidence",
         "Is it using isolated cases instead of systematic evidence?"),
        ("Circular Reasoning",
         "Is the conclusion already contained in the premise?"),
        ("Ad Hominem",
         "Is it attacking the opponent rather than their argument?"),
    ]

    def evaluate(
        self,
        topic: str,
        pro_args: list[Argument],
        con_args: list[Argument],
        pro_cross: list[CrossExamResponse],
        con_cross: list[CrossExamResponse],
        pro_closing: str,
        con_closing: str
    ) -> dict:
        """
        Comprehensive evaluation of the entire debate.
        Outputs structured conclusions.
        """

        # Build the full evaluation request
        pro_args_text = "\n\n".join(a.to_text() for a in pro_args)
        con_args_text = "\n\n".join(a.to_text() for a in con_args)
        pro_cross_text = "\n\n".join(r.to_text() for r in pro_cross)
        con_cross_text = "\n\n".join(r.to_text() for r in con_cross)

        evaluation_prompt = (
            f"## Topic\n{topic}\n\n"
            f"## Pro Opening Arguments\n{pro_args_text}\n\n"
            f"## Con Opening Arguments\n{con_args_text}\n\n"
            f"## Pro Cross-Examination of Con\n{pro_cross_text}\n\n"
            f"## Con Cross-Examination of Pro\n{con_cross_text}\n\n"
            f"## Pro Closing Statement\n{pro_closing}\n\n"
            f"## Con Closing Statement\n{con_closing}\n\n"
        )

        fallacy_rules = "\n".join(
            f"  - {name}: {desc}"
            for name, desc in self.FALLACY_CHECKLIST
        )

        system_prompt = (
            "You are a strictly impartial debate judge. Your task is to "
            "evaluate the entire debate according to the standardized "
            "process below.\n\n"
            "### Scoring Rules\n"
            "For each opening argument (Pro's PRO-1, PRO-2... and Con's "
            "CON-1, CON-2...) score on these four dimensions "
            "(1-10, must be integer):\n"
            "1. logic_score: Is the reasoning chain self-consistent? "
            "1=full of logical gaps, 10=flawless\n"
            "2. evidence_score: Is evidence specific and verifiable? "
            "1=all vague generalities, 10=each piece independently verifiable\n"
            "3. responsiveness_score: How well did they respond to "
            "cross-examination? 1=evaded all questions, "
            "10=point-by-point direct response\n"
            "4. honesty_score: Did they concede when warranted? Any "
            "exaggeration? 1=full of sophistry and distortion, "
            "10=honest and fair\n\n"
            "### Fallacy Detection\n"
            "For each argument, check for the following logical fallacies. "
            "If detected, list them in the fallacies array:\n"
            f"{fallacy_rules}\n\n"
            "### Output Format\n"
            "Output strictly as the following JSON format, "
            "no other text:\n"
            '{\n'
            '  "scores": [\n'
            '    {\n'
            '      "argument_id": "PRO-1",\n'
            '      "logic_score": 8,\n'
            '      "evidence_score": 7,\n'
            '      "responsiveness_score": 6,\n'
            '      "honesty_score": 8,\n'
            '      "fallacies": ["empty list if none detected"],\n'
            '      "notes": "Brief comment on this argument"\n'
            '    }\n'
            '  ],\n'
            '  "argument_trace_table": [\n'
            '    {\n'
            '      "argument_id": "PRO-1",\n'
            '      "claim": "Summary of core claim",\n'
            '      "standing": "UPHELD|PARTIALLY_UPHELD|REFUTED|'
            'UNCERTAIN",\n'
            '      "reason": "Brief explanation"\n'
            '    }\n'
            '  ],\n'
            '  "overall_assessment": {\n'
            '    "pro_total_score": 0.0,\n'
            '    "con_total_score": 0.0,\n'
            '    "key_insight": "The single most important finding from '
            'this debate (1-2 sentences)",\n'
            '    "unresolved_questions": ["Unresolved points of contention"],\n'
            '    "recommendation": "Based on the debate results, what '
            'specific advice for the decision-maker?"\n'
            '  }\n'
            '}'
        )

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": (
                    f"Please evaluate the following debate.\n\n"
                    f"{evaluation_prompt}"
                )}
            ],
            temperature=0.2,  # Very low temp for consistency & reproducibility
            max_tokens=3000
        )

        reply = response.choices[0].message.content

        try:
            cleaned = re.sub(r'```(?:json)?\s*', '', reply).strip()
            result = json.loads(cleaned)
            return result
        except json.JSONDecodeError as e:
            print(f"Warning: Judge JSON parse failed ({e}), "
                  f"returning raw text.")
            return {
                "error": "JSON parse failed",
                "raw_response": reply,
                "scores": [],
                "argument_trace_table": [],
                "overall_assessment": {
                    "pro_total_score": 0,
                    "con_total_score": 0,
                    "key_insight": "Evaluation failed — see raw_response",
                    "unresolved_questions": [],
                    "recommendation": ""
                }
            }


# ──────────────────────────────────────────────
# 5. Debate Engine — orchestrates the 3-round protocol
# ──────────────────────────────────────────────
def run_structured_debate(topic: str) -> dict:
    """
    Run a complete 3-round structured debate.

    Returns:
        dict: Contains all round records, scores, and final conclusions
    """

    # ── Create Pro Agent ──
    pro_agent = StructuredDebateAgent(
        name="Pro",
        stance="For",
        system_prompt=(
            f"You are a logically rigorous debater. "
            f"Your position is [FOR] the following proposition:\n"
            f"\"{topic}\"\n\n"
            f"Core rules:\n"
            f"1. All arguments must be specific, verifiable — use data "
            f"and facts\n"
            f"2. Each argument must contain a clear causal reasoning chain\n"
            f"3. Honesty is the highest principle — when faced with "
            f"unrefutable challenges, concede rather than sophize\n"
            f"4. Strictly follow the format and constraints of each round"
        )
    )

    # ── Create Con Agent ──
    con_agent = StructuredDebateAgent(
        name="Con",
        stance="Against",
        system_prompt=(
            f"You are a logically rigorous debater. "
            f"Your position is [AGAINST] the following proposition:\n"
            f"\"{topic}\"\n\n"
            f"Core rules:\n"
            f"1. All arguments must be specific, verifiable — use data "
            f"and facts\n"
            f"2. Each argument must contain a clear causal reasoning chain\n"
            f"3. Honesty is the highest principle — when faced with "
            f"unrefutable challenges, concede rather than sophize\n"
            f"4. Strictly follow the format and constraints of each round"
        )
    )

    result = {"topic": topic, "rounds": {}}

    print(f"\n{'=' * 60}")
    print(f"Structured Debate: {topic}")
    print(f"{'=' * 60}")

    # ── R1: Opening Statements ──
    print(f"\n{'─' * 60}")
    print(f"Round 1: Opening Statements")
    print(f"{'─' * 60}")

    pro_args = pro_agent.opening_statement(topic)
    print(f"\nPro — {len(pro_args)} arguments")
    for arg in pro_args:
        print(f"  {arg.id}: {arg.claim[:80]}...")

    con_args = con_agent.opening_statement(topic)
    print(f"\nCon — {len(con_args)} arguments")
    for arg in con_args:
        print(f"  {arg.id}: {arg.claim[:80]}...")

    result["rounds"]["opening"] = {
        "pro_arguments": [
            {"id": a.id, "claim": a.claim,
             "reasoning": a.reasoning, "evidence": a.evidence}
            for a in pro_args
        ],
        "con_arguments": [
            {"id": a.id, "claim": a.claim,
             "reasoning": a.reasoning, "evidence": a.evidence}
            for a in con_args
        ]
    }

    # ── R2: Cross-Examination ──
    print(f"\n{'─' * 60}")
    print(f"Round 2: Cross-Examination")
    print(f"{'─' * 60}")

    pro_cross = pro_agent.cross_examine(con_args)
    print(f"\nPro cross-examining Con — {len(pro_cross)} responses")
    for r in pro_cross:
        print(f"  [{r.response_type}] -> {r.target_arg_id}")

    con_cross = con_agent.cross_examine(pro_args)
    print(f"\nCon cross-examining Pro — {len(con_cross)} responses")
    for r in con_cross:
        print(f"  [{r.response_type}] -> {r.target_arg_id}")

    result["rounds"]["cross_examination"] = {
        "pro_cross": [
            {"target": r.target_arg_id, "type": r.response_type,
             "reasoning": r.reasoning,
             "follow_up": r.follow_up_question}
            for r in pro_cross
        ],
        "con_cross": [
            {"target": r.target_arg_id, "type": r.response_type,
             "reasoning": r.reasoning,
             "follow_up": r.follow_up_question}
            for r in con_cross
        ]
    }

    # ── R3: Closing Statements ──
    print(f"\n{'─' * 60}")
    print(f"Round 3: Closing Statements")
    print(f"{'─' * 60}")

    pro_closing = pro_agent.closing_statement()
    print(f"\nPro closing:\n{pro_closing[:200]}...")

    con_closing = con_agent.closing_statement()
    print(f"\nCon closing:\n{con_closing[:200]}...")

    result["rounds"]["closing"] = {
        "pro_closing": pro_closing,
        "con_closing": con_closing
    }

    # ── Judge Evaluation ──
    print(f"\n{'=' * 60}")
    print(f"Judge Evaluation")
    print(f"{'=' * 60}")

    judge = StructuredJudge()
    evaluation = judge.evaluate(
        topic=topic,
        pro_args=pro_args,
        con_args=con_args,
        pro_cross=pro_cross,
        con_cross=con_cross,
        pro_closing=pro_closing,
        con_closing=con_closing
    )

    result["evaluation"] = evaluation

    # Print score summary
    if "overall_assessment" in evaluation:
        oa = evaluation["overall_assessment"]
        print(f"\nPro total score: {oa.get('pro_total_score', 'N/A')}")
        print(f"Con total score: {oa.get('con_total_score', 'N/A')}")
        print(f"\nKey insight: {oa.get('key_insight', 'N/A')}")

    # Print argument trace table
    if "argument_trace_table" in evaluation:
        print(f"\nArgument Trace Table:")
        for entry in evaluation["argument_trace_table"]:
            print(f"  {entry['argument_id']}: {entry['standing']} — "
                  f"{entry.get('claim', '')[:60]}...")

    return result


# ──────────────────────────────────────────────
# 6. Helper: format validation
# ──────────────────────────────────────────────
def validate_opening_args(
    args: list[Argument], expected_prefix: str
) -> list[str]:
    """
    Validate opening argument format completeness.
    Returns list of warnings; empty list means format is acceptable.
    """
    warnings = []
    for arg in args:
        if not arg.id.startswith(expected_prefix):
            warnings.append(
                f"{arg.id}: ID prefix should be {expected_prefix}"
            )
        if len(arg.claim) < 10:
            warnings.append(f"{arg.id}: claim too short (min 10 chars)")
        if len(arg.reasoning) < 20:
            warnings.append(f"{arg.id}: reasoning too short (min 20 chars)")
        if len(arg.evidence) < 5:
            warnings.append(f"{arg.id}: missing evidence")
    return warnings


def print_briefing(result: dict):
    """Print a briefing for the human decision-maker"""
    ev = result.get("evaluation", {})
    oa = ev.get("overall_assessment", {})

    print(f"\n{'=' * 60}")
    print(f"Decision Briefing")
    print(f"{'=' * 60}")
    print(f"\nTopic: {result['topic']}")
    print(f"\nKey Insight:\n  {oa.get('key_insight', 'N/A')}")
    print(f"\nRecommendation:\n  {oa.get('recommendation', 'N/A')}")

    unresolved = oa.get('unresolved_questions', [])
    if unresolved:
        print(f"\nUnresolved Questions:")
        for q in unresolved:
            print(f"  - {q}")


# ──────────────────────────────────────────────
# 7. Run the example
# ──────────────────────────────────────────────
if __name__ == "__main__":
    result = run_structured_debate(
        topic="Should a small startup (under 10 people) "
              "adopt microservices architecture from day one?"
    )

    # Print decision briefing
    print_briefing(result)

    # Save results
    with open("/tmp/structured_debate_result.json",
              "w", encoding="utf-8") as f:
        json.dump(result, f, ensure_ascii=False, indent=2)
    print("\nFull debate record saved to "
          "/tmp/structured_debate_result.json")

Code Structure Breakdown

Compared to L1's debate.py (~180 lines, 3 classes), the L2 code is heavier — but that weight comes from structure and auditability, not pointless complexity:

Component L1 Equivalent New Capability in L2
StructuredDebateAgent DebateAgent Round-aware: opening_statement() / cross_examine() / closing_statement() as separate methods; arguments structured as Argument objects; JSON output for machine readability
StructuredJudge JudgeAgent Multi-dimensional scoring (logic/evidence/responsiveness/honesty + weighted); built-in 7-type fallacy detection; argument trace table; JSON structured output
RoundType (none) Enumeration of 3 round types; engine schedules by round
Argument / CrossExamResponse / ScoringResult (free text) Structured dataclasses, strongly typed I/O for debate
validate_opening_args() (none) Format validation function for argument quality floor
💡 Running tip: Replace your-api-key and api.example.com with your actual API credentials. Each 3-round structured debate run triggers about 8 LLM calls (3 rounds each for both sides + judge evaluation + possible format corrections), so budget your API quota accordingly.

Extracting Actionable Information from Results

Code runs, you get a JSON result. But how do you read it? Here are three levels of reading.

Level 1: Total Scores

overall_assessment.pro_total_score and con_total_score give you a quantitative comparison of debate quality. But don't just look at who's higher — a gap under 1 point means the sides are evenly matched; a gap over 3 points indicates a significant difference.

Level 2: Argument Trace Table

This is the most practical part. argument_trace_table tells you the final status of each opening argument:

Level 3: Unresolved Questions

unresolved_questions lists the issues left hanging by the debate. These are information gaps you must verify yourself before making a decision. AI debate can't do everything for you — but it can precisely locate what you still need to do.

⚠️ Don't blindly trust the scores: The judge is also an LLM and may have its own biases. Scores and trace tables are decision aids, not final verdicts. For truly critical decisions, you should read the debate transcript yourself and apply your own judgment. The AI debate system's role is to improve the organization and coverage of information — not to replace human judgment.

Protocol Limitations (An Honest Assessment)

No protocol is perfect. Here are the known limitations of this 3-round framework:

  1. Sensitive to model capability: When both agents use the same model, they share the same knowledge boundary and reasoning patterns. Two GPT-4o instances debating still can't see what GPT-4o doesn't know. The solution is to use different models for each agent (e.g., GPT-4o vs Claude), but that's not yet implemented in this article's code — saved for a future article.
  2. JSON parsing is fragile: LLM JSON output occasionally has errors (extra comma, missing quote). We've added graceful fallback to free text, but in production you may want more robust parsing (Schema-constrained generation or multiple retries).
  3. Cross-examination may be "surface-level": The opponent raises a challenge, Pro responds — but the judge may not assess whether the response is truly effective. The judge can only evaluate the surface quality of the response (directness, logical consistency), not factual accuracy.
  4. No external verification: The entire debate happens inside the LLM's "mind." If both sides cite a non-existent study, the judge can't detect it. Future articles will introduce RAG and tool calling to address this.

Key Takeaways

  1. Structure = reliability: Free-form debate easily falls into topic drift, false consensus, and shallow depth. The 3-round protocol (Opening → Cross-Exam → Closing) solves these with structural constraints.
  2. Cross-examination is the core of debate: Round 2's cross-examination is the most critical phase — it forces both sides to drill deep into each other's reasoning chains, exposing logical flaws and evidence gaps.
  3. Judges need a ruler, not a feeling: A multi-dimensional scoring rubric (logic/evidence/responsiveness/honesty) is more reliable and reproducible than a vague "who won."
  4. The argument trace table is the decision-maker's map: It compresses a lengthy debate into "which arguments stood and which were refuted" — the crucial bridge from debate to decision.
  5. AI debate is a decision aid, not a decision-maker: The judge's scores and trace tables are input for the human decision-maker — not a replacement.

📎 Series note: This is article 2 of the Multi-Agent Debate series. The previous article L1: Why Debate Beats a Single Answer introduced cognitive biases and adversarial collaboration fundamentals. Recommended reading order.

📖 Next: Debate Scoring & Consensus — score calibration, multi-judge systems, weighted voting, consensus metrics (coming soon)