In the previous article, we used two agents challenging each other to tackle the cognitive bias problem in single models. That debate.py code works — but it has an obvious flaw:
It has no structure. Pro speaks, then Con speaks, then Pro speaks again — it's essentially alternating monologues with no inherent logical framework. You could loop them 10 rounds, but after round 5, both sides are mostly just repeating themselves and going in circles.
Real debate — whether academic peer review, courtroom cross-examination, or presidential debates — has strict phase structure. Each phase has a clear objective and constraints. That's what this article brings to the multi-agent debate system.
Let's be precise about what's wrong with L1's "free-form" debate. It's not useless — it's far better than a single answer. But if you want to use it for genuinely important decisions (tech stack choices, investment strategy, product direction), here are the problems you need to know.
A debate without structural constraints is like a meeting without an agenda. Pro is talking about cost, Con suddenly pivots to security; Pro is still responding about security, Con jumps to team skill set. By round three, you've forgotten what the core proposition was even about.
A debate without anchors = a boxing match without a referee. Both sides throw punches, but nobody knows what this round is actually about.
In free-form debate, Con might say "on the cost issue, I agree with Pro's analysis" — then both move on. A judge reading this might think "cost is settled — consensus reached." But Con may have only politely acknowledged the point without genuine concession. Or Con may have misunderstood but skipped it to keep the debate moving.
Without structured consensus tracking, you can't distinguish "actual agreement" from "underexamined potential disagreement."
Free-form debate encourages breadth — each round covers multiple arguments. But important decisions often require deep drilling into a single argument. Pro says "this approach has better performance," Con says "performance isn't the bottleneck" — and the topic moves on. Nobody asks: "Better by what definition? P99 latency? Throughput? Under what test conditions and benchmarks?"
L1's judge used a prompt: "Pro strengths, Con strengths, areas of agreement, uncertain areas, overall recommendation." Better than "who won," but still flawed: run it twice on the same debate and the conclusions can differ substantially. Because the prompt gives the judge no specific scoring dimensions — it's just improvising.
| Problem | Free-Form Behavior | Structure Needed |
|---|---|---|
| Topic Drift | Topics shift arbitrarily; core proposition blurs | Each round has an explicit theme and constraints |
| False Consensus | Polite concessions mistaken for agreement | Structured consensus tracking and confirmation |
| Shallow Depth | Broad coverage but surface-level on each point | Cross-exam phase forces deep drilling on key arguments |
| Unreproducible | Improvised judgment; two runs, two results | Explicit scoring dimensions and decision rules |
Here's the protocol designed in this article. Three rounds, each with explicit objectives, inputs, outputs, and termination conditions.
| Round | Phase | Objective | Deliverable |
|---|---|---|---|
| R1 | Opening Statement | Present full position, list core arguments | 3-5 structured arguments (Claim + Reasoning + Evidence) |
| R2 | Cross-Examination | Challenge each opponent argument, drill into weak points | Challenge-response pair for each argument |
| R3 | Closing Statement | Synthesize the debate, concede valid rebuttals, state final position | Structured summary + positions held + concessions made |
Objective: Both sides present their full position without interruption.
This isn't "say whatever comes to mind." Opening statements have strict formatting requirements:
Here's an example — not this:
❌ "Microservices improve development efficiency because teams can work independently, and many big companies use them."
But this:
✅ Argument 1: Independent deployment shortens release cycles
Claim: Microservices architecture can significantly shorten the cycle from code commit to production deployment.
Reasoning: In monolithic architecture, any code change requires full build, full test, full deployment. For a 5-person team, each deployment cycle is approximately 4 hours. Microservices allow independent deployment — each service can be independently built, tested, and released without affecting others.
Evidence: Our benchmark testing shows that the same feature change takes an average of 3.8 hours to deploy in monolithic architecture vs. 0.7 hours in microservices (P95: 6.2 hours monolithic, 1.4 hours microservices). That's a 5.4x difference.
See the difference? The latter is cross-examinable — the opponent can ask: "Under what conditions were your benchmarks run? Are the test cases representative? Why does the P95 improvement drop to only 4.4x?"
Objective: This is the most critical phase of the entire protocol. Both sides must respond to each and every one of the opponent's opening arguments — no skipping, no vague hand-waving.
Cross-examination has three mandatory requirements:
This phase mirrors the academic peer review process — a reviewer must address specific claims in the paper, not say "overall this direction feels wrong" and reject it.
Objective: After seeing the opponent's cross-examination, both sides make their final case. This is not "repeat the opening statement."
The closing statement has three components:
The closing statement doesn't need to be long. 100-200 words is plenty — the judge needs the distilled final state, not a third repetition.
L1's judge was already better than "who won" — it used 5 dimensions. But we need a more rigorous design where judgments are reproducible, auditable, and quantifiable.
The judge doesn't give a vague "Pro wins" or "Con wins." It independently scores each opening argument on both sides, then aggregates:
| Scoring Dimension | Criteria | Weight |
|---|---|---|
| Logic (1-10) | Is the reasoning chain self-consistent? Any gaps, circular reasoning, or equivocation? | 30% |
| Evidence Quality (1-10) | Is the evidence specific, verifiable, and relevant? Or just vague generalities? | 30% |
| Responsiveness (1-10) | Did they respond directly to cross-examination? Point-by-point rebuttal or evasion? | 25% |
| Honesty (1-10) | Did they concede when warranted? Any exaggeration or distortion? | 15% |
Each opening argument's final score = Logic × 0.3 + Evidence × 0.3 + Responsiveness × 0.25 + Honesty × 0.15. The average across all arguments is that side's total score.
Note: This scoring rubric measures debate quality, not "whose position is more correct." A position could be wrong, but if it's argued rigorously with strong evidence and honest responses — it deserves a high score. Conversely, a position could be right, but argued poorly — it deserves a low score.
Beyond scoring, the judge also detects common logical fallacies. We embed a fallacy checklist in the judge's system prompt:
| Fallacy | Definition | Example |
|---|---|---|
| Straw Man | Distorting the opponent's argument to attack something they didn't say | "Pro thinks monoliths are worthless" — Pro never said that |
| Appeal to Authority | Using an authority's name in place of actual reasoning | "Google uses microservices, so you should too" — lacks context |
| Slippery Slope | Assuming one action triggers an uncontrollable chain reaction | "If you pick microservices, you'll need K8s, then a DevOps team, then you'll be 20 people…" |
| False Dilemma | Reducing a complex issue to an either-or choice | "You're either all microservices or all monolithic" — ignoring modular monoliths |
| Anecdotal Evidence | Using isolated cases instead of systematic evidence | "I saw one project collapse after adopting microservices" — N=1 |
The judge doesn't need to make final determinations on every suspected fallacy — it just needs to flag suspicious reasoning patterns and list them in the report.
This is the most practical part of the judge's output. It's a table tracking every opening argument's fate from introduction to conclusion:
| Argument ID | Proposer | Core Claim | Cross-Exam Result | Final Status |
|---|---|---|---|---|
| PRO-1 | Pro | Independent deployment shortens release cycles 5.4x | Con challenged test conditions | Partially Upheld — test environment idealized |
| CON-2 | Con | Distributed systems increase operational complexity | Pro failed to effectively rebut | Upheld — ops cost will indeed increase |
| PRO-3 | Pro | Team tech stack flexibility improves | Con partially conceded | Upheld — both sides agree |
The value of this table: you won't drown in a long debate transcript. At a glance, you see which arguments stood, which were refuted, and which need more information to judge. This is the core advantage of structured debate over free-form.
Below is the complete implementation. It inherits L1's debate.py architecture (DebateAgent + JudgeAgent + engine function) and adds protocol round management, argument format validation, multi-dimensional scoring, and fallacy detection.
Save it as debate_protocol.py in the same directory as debate.py and you're ready to run.
"""
Structured Debate Protocol — 3-Round Debate + Multi-Dimensional Judge Scoring
Extends L1's debate.py with protocol round management.
Requires: pip install openai
"""
import os
import json
import re
from enum import Enum
from dataclasses import dataclass, field
from openai import OpenAI
# ──────────────────────────────────────────────
# 1. Initialize LLM client (placeholder credentials)
# ──────────────────────────────────────────────
client = OpenAI(
api_key="your-api-key",
base_url="https://api.example.com/v1"
)
# ──────────────────────────────────────────────
# 2. Data structures
# ──────────────────────────────────────────────
class RoundType(Enum):
OPENING = "opening" # Opening statement
CROSS_EXAM = "cross_exam" # Cross-examination
CLOSING = "closing" # Closing statement
@dataclass
class Argument:
"""A structured argument"""
id: str # Argument ID, e.g. PRO-1, CON-3
claim: str # Core assertion
reasoning: str # Reasoning chain
evidence: str # Supporting evidence
def to_text(self) -> str:
return (
f"[{self.id}] Claim: {self.claim}\n"
f"Reasoning: {self.reasoning}\n"
f"Evidence: {self.evidence}"
)
@dataclass
class CrossExamResponse:
"""Response to a single argument during cross-examination"""
target_arg_id: str # Target argument ID
response_type: str # refute | challenge | concede | partial
reasoning: str # Response reasoning
follow_up_question: str # Follow-up question
def to_text(self) -> str:
return (
f"Response to [{self.target_arg_id}] [{self.response_type}]:\n"
f"{self.reasoning}\n"
f"Follow-up: {self.follow_up_question}"
)
@dataclass
class ScoringResult:
"""Judge's scoring of a single argument"""
argument_id: str
logic_score: int # 1-10
evidence_score: int # 1-10
responsiveness_score: int # 1-10
honesty_score: int # 1-10
fallacies_detected: list[str] = field(default_factory=list)
notes: str = ""
@property
def weighted_score(self) -> float:
return (
self.logic_score * 0.30 +
self.evidence_score * 0.30 +
self.responsiveness_score * 0.25 +
self.honesty_score * 0.15
)
# ──────────────────────────────────────────────
# 3. Structured Debate Agent (extends L1's DebateAgent)
# ──────────────────────────────────────────────
class StructuredDebateAgent:
"""
A debate agent that holds a specific stance and generates
structured output per protocol round.
Key differences from L1's DebateAgent:
- Outputs are structured as Arguments with explicit IDs
- Separate methods per round (opening / cross_examine / closing)
- Maintains its own argument list for cross-examination reference
"""
def __init__(self, name: str, stance: str, system_prompt: str):
self.name = name
self.stance = stance
self.system_prompt = system_prompt
self.history: list[dict] = []
self.arguments: list[Argument] = [] # This side's opening arguments
def _prefix(self) -> str:
"""Return argument ID prefix"""
return "PRO" if "For" in self.stance else "CON"
def _call_llm(self, user_prompt: str, temperature: float = 0.7,
max_tokens: int = 1000) -> str:
"""Unified LLM call wrapper"""
messages = [{"role": "system", "content": self.system_prompt}]
for entry in self.history:
messages.append(entry)
messages.append({"role": "user", "content": user_prompt})
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=temperature,
max_tokens=max_tokens
)
reply = response.choices[0].message.content
self.history.append({"role": "assistant", "content": reply})
return reply
def opening_statement(self, topic: str) -> list[Argument]:
"""
Round 1: Opening Statement.
Output 3-5 structured arguments, each with claim/reasoning/evidence.
"""
prefix = self._prefix()
prompt = (
f"Topic: \"{topic}\"\n"
f"Your position: {self.stance}\n\n"
f"Deliver your opening statement. Output 3-5 arguments "
f"as a JSON array.\n"
f"Each argument must follow this JSON format:\n"
f'{{"id": "{prefix}-N", "claim": "core assertion", '
f'"reasoning": "reasoning chain", "evidence": "evidence or data"}}\n\n'
f"Requirements:\n"
f"1. ID format: {prefix}-1, {prefix}-2, incrementing\n"
f"2. claim must be a specific, falsifiable proposition\n"
f"3. reasoning must contain a causal chain, not mere assertion\n"
f"4. evidence must be verifiable specific facts/data, "
f"not vague phrases like \"many companies\"\n"
f"5. Output pure JSON array only, no other text"
)
reply = self._call_llm(prompt, temperature=0.6, max_tokens=1200)
# Parse JSON response
try:
# Strip possible markdown code block markers
cleaned = re.sub(r'```(?:json)?\s*', '', reply).strip()
data = json.loads(cleaned)
self.arguments = [
Argument(
id=item["id"],
claim=item["claim"],
reasoning=item["reasoning"],
evidence=item["evidence"]
)
for item in data
]
return self.arguments
except (json.JSONDecodeError, KeyError) as e:
# JSON parse failed — fallback to free text
print(f"Warning: {self.name} JSON parse failed ({e}), "
f"using free text fallback.")
self.arguments = [
Argument(
id=f"{prefix}-1",
claim="Opening statement (JSON parse failed, "
"see raw response)",
reasoning=reply,
evidence=""
)
]
return self.arguments
def cross_examine(
self, opponent_args: list[Argument]
) -> list[CrossExamResponse]:
"""
Round 2: Cross-Examination.
Respond to each of the opponent's opening arguments point by point.
"""
opponent_args_text = "\n\n".join(
arg.to_text() for arg in opponent_args
)
prompt = (
f"Below are your opponent's opening arguments. "
f"Respond to each one.\n\n"
f"{opponent_args_text}\n\n"
f"For each argument, output a JSON object:\n"
f'{{"target_arg_id": "opponent argument ID", '
f'"response_type": "refute|challenge|concede|partial", '
f'"reasoning": "your reasoning", '
f'"follow_up_question": "a sharp follow-up question"}}\n\n'
f"response_type meanings:\n"
f"- refute: you believe the argument has factual/logical errors\n"
f"- challenge: you believe evidence is insufficient or "
f"conditions aren't met\n"
f"- concede: you accept the argument as valid\n"
f"- partial: you accept the core but dispute degree or scope\n\n"
f"Requirements:\n"
f"1. Must respond to ALL opponent arguments — do not skip any\n"
f"2. Do NOT introduce new arguments (this is cross-examination, "
f"only respond to existing arguments)\n"
f"3. Output pure JSON array only, no other text"
)
reply = self._call_llm(prompt, temperature=0.5, max_tokens=1500)
try:
cleaned = re.sub(r'```(?:json)?\s*', '', reply).strip()
data = json.loads(cleaned)
return [
CrossExamResponse(
target_arg_id=item["target_arg_id"],
response_type=item["response_type"],
reasoning=item["reasoning"],
follow_up_question=item["follow_up_question"]
)
for item in data
]
except (json.JSONDecodeError, KeyError) as e:
print(f"Warning: {self.name} cross-exam JSON parse failed "
f"({e}), using free text.")
return [
CrossExamResponse(
target_arg_id=arg.id,
response_type="challenge",
reasoning=f"JSON parse failed. Raw response:\n{reply}",
follow_up_question="Please clarify the above."
)
for arg in opponent_args
]
def closing_statement(self) -> str:
"""
Round 3: Closing Statement.
Includes concessions made, arguments not rebutted, final position.
"""
my_args_text = "\n".join(arg.to_text() for arg in self.arguments)
prompt = (
f"Review of your opening arguments:\n{my_args_text}\n\n"
f"Deliver your closing statement. Structure as follows:\n\n"
f"## Concessions Made\n"
f"List the opponent arguments or partial arguments you accepted "
f"during cross-examination.\n\n"
f"## Arguments Not Effectively Rebutted\n"
f"Restate the core arguments from your opening that the opponent "
f"failed to effectively challenge.\n\n"
f"## Final Position\n"
f"Based on the above, what is your current overall position on "
f"the topic? If it has shifted (strengthened, weakened, "
f"partially adjusted), explain why.\n\n"
f"Requirements: total length under 200 words. Concise and sharp."
)
return self._call_llm(prompt, temperature=0.5, max_tokens=600)
# ──────────────────────────────────────────────
# 4. Structured Judge Agent (multi-dimension scoring + fallacy detection)
# ──────────────────────────────────────────────
class StructuredJudge:
"""
Judge Agent — multi-dimensional scoring, fallacy detection,
argument tracing.
Key differences from L1's JudgeAgent:
- Independently scores each argument (logic/evidence/response/honesty)
- Built-in logical fallacy detection checklist
- Generates argument trace table
- Outputs JSON-structured conclusions instead of free text
"""
FALLACY_CHECKLIST = [
("Straw Man",
"Is it distorting the opponent's argument — attacking something "
"they didn't say?"),
("Appeal to Authority",
"Is it using \"big company X uses it\" in place of actual reasoning?"),
("Slippery Slope",
"Is it assuming one action triggers an uncontrollable "
"chain reaction?"),
("False Dilemma",
"Is it reducing a complex issue to an either-or choice?"),
("Anecdotal Evidence",
"Is it using isolated cases instead of systematic evidence?"),
("Circular Reasoning",
"Is the conclusion already contained in the premise?"),
("Ad Hominem",
"Is it attacking the opponent rather than their argument?"),
]
def evaluate(
self,
topic: str,
pro_args: list[Argument],
con_args: list[Argument],
pro_cross: list[CrossExamResponse],
con_cross: list[CrossExamResponse],
pro_closing: str,
con_closing: str
) -> dict:
"""
Comprehensive evaluation of the entire debate.
Outputs structured conclusions.
"""
# Build the full evaluation request
pro_args_text = "\n\n".join(a.to_text() for a in pro_args)
con_args_text = "\n\n".join(a.to_text() for a in con_args)
pro_cross_text = "\n\n".join(r.to_text() for r in pro_cross)
con_cross_text = "\n\n".join(r.to_text() for r in con_cross)
evaluation_prompt = (
f"## Topic\n{topic}\n\n"
f"## Pro Opening Arguments\n{pro_args_text}\n\n"
f"## Con Opening Arguments\n{con_args_text}\n\n"
f"## Pro Cross-Examination of Con\n{pro_cross_text}\n\n"
f"## Con Cross-Examination of Pro\n{con_cross_text}\n\n"
f"## Pro Closing Statement\n{pro_closing}\n\n"
f"## Con Closing Statement\n{con_closing}\n\n"
)
fallacy_rules = "\n".join(
f" - {name}: {desc}"
for name, desc in self.FALLACY_CHECKLIST
)
system_prompt = (
"You are a strictly impartial debate judge. Your task is to "
"evaluate the entire debate according to the standardized "
"process below.\n\n"
"### Scoring Rules\n"
"For each opening argument (Pro's PRO-1, PRO-2... and Con's "
"CON-1, CON-2...) score on these four dimensions "
"(1-10, must be integer):\n"
"1. logic_score: Is the reasoning chain self-consistent? "
"1=full of logical gaps, 10=flawless\n"
"2. evidence_score: Is evidence specific and verifiable? "
"1=all vague generalities, 10=each piece independently verifiable\n"
"3. responsiveness_score: How well did they respond to "
"cross-examination? 1=evaded all questions, "
"10=point-by-point direct response\n"
"4. honesty_score: Did they concede when warranted? Any "
"exaggeration? 1=full of sophistry and distortion, "
"10=honest and fair\n\n"
"### Fallacy Detection\n"
"For each argument, check for the following logical fallacies. "
"If detected, list them in the fallacies array:\n"
f"{fallacy_rules}\n\n"
"### Output Format\n"
"Output strictly as the following JSON format, "
"no other text:\n"
'{\n'
' "scores": [\n'
' {\n'
' "argument_id": "PRO-1",\n'
' "logic_score": 8,\n'
' "evidence_score": 7,\n'
' "responsiveness_score": 6,\n'
' "honesty_score": 8,\n'
' "fallacies": ["empty list if none detected"],\n'
' "notes": "Brief comment on this argument"\n'
' }\n'
' ],\n'
' "argument_trace_table": [\n'
' {\n'
' "argument_id": "PRO-1",\n'
' "claim": "Summary of core claim",\n'
' "standing": "UPHELD|PARTIALLY_UPHELD|REFUTED|'
'UNCERTAIN",\n'
' "reason": "Brief explanation"\n'
' }\n'
' ],\n'
' "overall_assessment": {\n'
' "pro_total_score": 0.0,\n'
' "con_total_score": 0.0,\n'
' "key_insight": "The single most important finding from '
'this debate (1-2 sentences)",\n'
' "unresolved_questions": ["Unresolved points of contention"],\n'
' "recommendation": "Based on the debate results, what '
'specific advice for the decision-maker?"\n'
' }\n'
'}'
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": (
f"Please evaluate the following debate.\n\n"
f"{evaluation_prompt}"
)}
],
temperature=0.2, # Very low temp for consistency & reproducibility
max_tokens=3000
)
reply = response.choices[0].message.content
try:
cleaned = re.sub(r'```(?:json)?\s*', '', reply).strip()
result = json.loads(cleaned)
return result
except json.JSONDecodeError as e:
print(f"Warning: Judge JSON parse failed ({e}), "
f"returning raw text.")
return {
"error": "JSON parse failed",
"raw_response": reply,
"scores": [],
"argument_trace_table": [],
"overall_assessment": {
"pro_total_score": 0,
"con_total_score": 0,
"key_insight": "Evaluation failed — see raw_response",
"unresolved_questions": [],
"recommendation": ""
}
}
# ──────────────────────────────────────────────
# 5. Debate Engine — orchestrates the 3-round protocol
# ──────────────────────────────────────────────
def run_structured_debate(topic: str) -> dict:
"""
Run a complete 3-round structured debate.
Returns:
dict: Contains all round records, scores, and final conclusions
"""
# ── Create Pro Agent ──
pro_agent = StructuredDebateAgent(
name="Pro",
stance="For",
system_prompt=(
f"You are a logically rigorous debater. "
f"Your position is [FOR] the following proposition:\n"
f"\"{topic}\"\n\n"
f"Core rules:\n"
f"1. All arguments must be specific, verifiable — use data "
f"and facts\n"
f"2. Each argument must contain a clear causal reasoning chain\n"
f"3. Honesty is the highest principle — when faced with "
f"unrefutable challenges, concede rather than sophize\n"
f"4. Strictly follow the format and constraints of each round"
)
)
# ── Create Con Agent ──
con_agent = StructuredDebateAgent(
name="Con",
stance="Against",
system_prompt=(
f"You are a logically rigorous debater. "
f"Your position is [AGAINST] the following proposition:\n"
f"\"{topic}\"\n\n"
f"Core rules:\n"
f"1. All arguments must be specific, verifiable — use data "
f"and facts\n"
f"2. Each argument must contain a clear causal reasoning chain\n"
f"3. Honesty is the highest principle — when faced with "
f"unrefutable challenges, concede rather than sophize\n"
f"4. Strictly follow the format and constraints of each round"
)
)
result = {"topic": topic, "rounds": {}}
print(f"\n{'=' * 60}")
print(f"Structured Debate: {topic}")
print(f"{'=' * 60}")
# ── R1: Opening Statements ──
print(f"\n{'─' * 60}")
print(f"Round 1: Opening Statements")
print(f"{'─' * 60}")
pro_args = pro_agent.opening_statement(topic)
print(f"\nPro — {len(pro_args)} arguments")
for arg in pro_args:
print(f" {arg.id}: {arg.claim[:80]}...")
con_args = con_agent.opening_statement(topic)
print(f"\nCon — {len(con_args)} arguments")
for arg in con_args:
print(f" {arg.id}: {arg.claim[:80]}...")
result["rounds"]["opening"] = {
"pro_arguments": [
{"id": a.id, "claim": a.claim,
"reasoning": a.reasoning, "evidence": a.evidence}
for a in pro_args
],
"con_arguments": [
{"id": a.id, "claim": a.claim,
"reasoning": a.reasoning, "evidence": a.evidence}
for a in con_args
]
}
# ── R2: Cross-Examination ──
print(f"\n{'─' * 60}")
print(f"Round 2: Cross-Examination")
print(f"{'─' * 60}")
pro_cross = pro_agent.cross_examine(con_args)
print(f"\nPro cross-examining Con — {len(pro_cross)} responses")
for r in pro_cross:
print(f" [{r.response_type}] -> {r.target_arg_id}")
con_cross = con_agent.cross_examine(pro_args)
print(f"\nCon cross-examining Pro — {len(con_cross)} responses")
for r in con_cross:
print(f" [{r.response_type}] -> {r.target_arg_id}")
result["rounds"]["cross_examination"] = {
"pro_cross": [
{"target": r.target_arg_id, "type": r.response_type,
"reasoning": r.reasoning,
"follow_up": r.follow_up_question}
for r in pro_cross
],
"con_cross": [
{"target": r.target_arg_id, "type": r.response_type,
"reasoning": r.reasoning,
"follow_up": r.follow_up_question}
for r in con_cross
]
}
# ── R3: Closing Statements ──
print(f"\n{'─' * 60}")
print(f"Round 3: Closing Statements")
print(f"{'─' * 60}")
pro_closing = pro_agent.closing_statement()
print(f"\nPro closing:\n{pro_closing[:200]}...")
con_closing = con_agent.closing_statement()
print(f"\nCon closing:\n{con_closing[:200]}...")
result["rounds"]["closing"] = {
"pro_closing": pro_closing,
"con_closing": con_closing
}
# ── Judge Evaluation ──
print(f"\n{'=' * 60}")
print(f"Judge Evaluation")
print(f"{'=' * 60}")
judge = StructuredJudge()
evaluation = judge.evaluate(
topic=topic,
pro_args=pro_args,
con_args=con_args,
pro_cross=pro_cross,
con_cross=con_cross,
pro_closing=pro_closing,
con_closing=con_closing
)
result["evaluation"] = evaluation
# Print score summary
if "overall_assessment" in evaluation:
oa = evaluation["overall_assessment"]
print(f"\nPro total score: {oa.get('pro_total_score', 'N/A')}")
print(f"Con total score: {oa.get('con_total_score', 'N/A')}")
print(f"\nKey insight: {oa.get('key_insight', 'N/A')}")
# Print argument trace table
if "argument_trace_table" in evaluation:
print(f"\nArgument Trace Table:")
for entry in evaluation["argument_trace_table"]:
print(f" {entry['argument_id']}: {entry['standing']} — "
f"{entry.get('claim', '')[:60]}...")
return result
# ──────────────────────────────────────────────
# 6. Helper: format validation
# ──────────────────────────────────────────────
def validate_opening_args(
args: list[Argument], expected_prefix: str
) -> list[str]:
"""
Validate opening argument format completeness.
Returns list of warnings; empty list means format is acceptable.
"""
warnings = []
for arg in args:
if not arg.id.startswith(expected_prefix):
warnings.append(
f"{arg.id}: ID prefix should be {expected_prefix}"
)
if len(arg.claim) < 10:
warnings.append(f"{arg.id}: claim too short (min 10 chars)")
if len(arg.reasoning) < 20:
warnings.append(f"{arg.id}: reasoning too short (min 20 chars)")
if len(arg.evidence) < 5:
warnings.append(f"{arg.id}: missing evidence")
return warnings
def print_briefing(result: dict):
"""Print a briefing for the human decision-maker"""
ev = result.get("evaluation", {})
oa = ev.get("overall_assessment", {})
print(f"\n{'=' * 60}")
print(f"Decision Briefing")
print(f"{'=' * 60}")
print(f"\nTopic: {result['topic']}")
print(f"\nKey Insight:\n {oa.get('key_insight', 'N/A')}")
print(f"\nRecommendation:\n {oa.get('recommendation', 'N/A')}")
unresolved = oa.get('unresolved_questions', [])
if unresolved:
print(f"\nUnresolved Questions:")
for q in unresolved:
print(f" - {q}")
# ──────────────────────────────────────────────
# 7. Run the example
# ──────────────────────────────────────────────
if __name__ == "__main__":
result = run_structured_debate(
topic="Should a small startup (under 10 people) "
"adopt microservices architecture from day one?"
)
# Print decision briefing
print_briefing(result)
# Save results
with open("/tmp/structured_debate_result.json",
"w", encoding="utf-8") as f:
json.dump(result, f, ensure_ascii=False, indent=2)
print("\nFull debate record saved to "
"/tmp/structured_debate_result.json")
Compared to L1's debate.py (~180 lines, 3 classes), the L2 code is heavier — but that weight comes from structure and auditability, not pointless complexity:
| Component | L1 Equivalent | New Capability in L2 |
|---|---|---|
StructuredDebateAgent |
DebateAgent |
Round-aware: opening_statement() / cross_examine() / closing_statement() as separate methods; arguments structured as Argument objects; JSON output for machine readability |
StructuredJudge |
JudgeAgent |
Multi-dimensional scoring (logic/evidence/responsiveness/honesty + weighted); built-in 7-type fallacy detection; argument trace table; JSON structured output |
RoundType |
(none) | Enumeration of 3 round types; engine schedules by round |
Argument / CrossExamResponse / ScoringResult |
(free text) | Structured dataclasses, strongly typed I/O for debate |
validate_opening_args() |
(none) | Format validation function for argument quality floor |
your-api-key and api.example.com with your actual API credentials. Each 3-round structured debate run triggers about 8 LLM calls (3 rounds each for both sides + judge evaluation + possible format corrections), so budget your API quota accordingly.
Code runs, you get a JSON result. But how do you read it? Here are three levels of reading.
overall_assessment.pro_total_score and con_total_score give you a quantitative comparison of debate quality. But don't just look at who's higher — a gap under 1 point means the sides are evenly matched; a gap over 3 points indicates a significant difference.
This is the most practical part. argument_trace_table tells you the final status of each opening argument:
unresolved_questions lists the issues left hanging by the debate. These are information gaps you must verify yourself before making a decision. AI debate can't do everything for you — but it can precisely locate what you still need to do.
No protocol is perfect. Here are the known limitations of this 3-round framework:
📎 Series note: This is article 2 of the Multi-Agent Debate series. The previous article L1: Why Debate Beats a Single Answer introduced cognitive biases and adversarial collaboration fundamentals. Recommended reading order.
📖 Next: Debate Scoring & Consensus — score calibration, multi-judge systems, weighted voting, consensus metrics (coming soon)