← AI Agent Exploration · ← Previous: Architecture & Data Pipeline

The Debate Protocol — Structured Adversarial Cross-Examination Engine for 8 Agents

2026-05-15 · Advanced

30-Second Takeaway

Problem solved: Article 1 built the data pipeline, but data doesn't debate. You need a protocol engine that transforms a structured knowledge base into competing analyses, cross-examination, and synthesized conclusions — with every agent output being comparable, evidence-backed, and structured rather than free-form opinion.
Core method: A 3-round structured debate protocol — Opening Statements (parallel, each agent argues from their analytical lens), Cross-Examination (paired, counterpart agents challenge each other's logic gaps and data blind spots), Closing Statements (parallel, refine arguments based on criticism — concede or double down). All outputs follow a strict JSON Schema: claim + evidence citations + confidence + key assumptions.
Key insight: The 4-dimension judge scoring (Logic 30% + Evidence 30% + Clarity 20% + Persuasiveness 20%) starts as intuition-driven weights — Article 3 will calibrate them with data. Prompt engineering is the debate quality control layer: each agent's system prompt defines not just stance, but constrains output format, evidence citation requirements, and reasoning boundaries.
What you can do after reading: Run debate_protocol_market.py (~300 lines) to connect Article 1's knowledge base JSON to the full debate engine. You'll get a structured debate transcript — every argument timestamped, attributed, and scored — ready for the judge agent to synthesize.

The knowledge base is ready. Seven global indices, ten sector ETFs, ten macro indicators—all structured and stored, each agent with its own custom data slice. But data is just fuel. Now we need the engine: the debate protocol engine.

Before diving into the code, here is a preview of what we are building. This is the output from the tech_bull agent during the opening round of a debate on ExampleIndex (a fictional tech-heavy index):

{
  "agent": "tech_bull",
  "round": "opening",
  "arguments": [
    {
      "claim": "ExampleIndex is in a confirmed uptrend with price above all key moving averages",
      "evidence": "Price 4,850 sits above MA20(4,720), MA50(4,510), and MA200(4,120). All three MAs are in bullish alignment: MA20 > MA50 > MA200.",
      "confidence": 0.85,
      "counterpoints": ["RSI 14 is at 62, approaching overbought territory but not yet triggered"]
    },
    {
      "claim": "MACD generated a bullish crossover 3 days ago, momentum is accelerating",
      "evidence": "MACD line 58.3 crossed above signal line 42.1. Histogram is positive and expanding. Volume on signal day was 1.4× the 20-day average, confirming breakout validity.",
      "confidence": 0.78,
      "counterpoints": ["A single technical signal does not constitute an independent trading thesis; must be confirmed with fundamentals"]
    },
    {
      "claim": "Sector breadth is healthy — 75% of constituent sectors show positive 20-day returns",
      "evidence": "7 out of 10 constituent sectors show positive 20-day returns. Technology and Industrials are leading, defensives lagging — classic risk-on rotation pattern.",
      "confidence": 0.72,
      "counterpoints": ["Breadth is narrowing — down from 9/10 positive 30 days ago"]
    }
  ]
}

This is not free-form conversation. This is structured adversarial debate—every argument carries a claim, evidence, confidence score, and counterpoints, precisely referenceable and challengeable by opponents. The tech_bear will attack these three arguments point by point in the cross-examination round. The macro_bear will attack from an entirely different dimension. And a judge will score every single argument across four dimensions.

This is Article 2 of the Multi-Agent Debate × Market Analysis series. In Article 1, we built the data pipeline and system architecture. Now we build the debate engine itself—how eight specialized agents tear each other's arguments apart across three structured rounds, and how an impartial judge synthesizes this conflict into an unhedged market analysis.

Why Structured Debate Beats Free-Form Discussion

You might wonder: why not just dump the knowledge base into eight agents and let them chat freely about market direction?

Because free-form discussion has three fatal flaws for market analysis:

Flaw	In Free-Form Discussion	Harm in Market Context
Argument Drift	Conversation slides from "what do technicals say?" to "what will the Fed do?" to "geopolitical risk"—no anchor point.	The technical analyst starts commenting on CPI data (beyond their data slice). The fundamental analyst starts drawing candlestick patterns. All agents spill into domains they are not equipped for.
Echo Chamber Effect	Eight agents may unintentionally converge on the same narrative—e.g., all discussing the "AI bubble" while ignoring technical momentum signals.	Cognitive diversity is passively lost in free discussion. Eight agents effectively degrade into a single agent consuming 8× the tokens.
Incommensurable Comparisons	The tech bull says "trend is up." The macro bear says "yield curve is inverted." How do you compare these two arguments? Which carries more weight? There is no common yardstick.	The judge can only synthesize by feel, producing unauditable conclusions. Five runs may yield five different verdicts.

The structured protocol solves these problems through a single principle: constraints create quality. Each agent only speaks within its analytical lens (no argument drift). Each agent must face an opponent that attacks it within its own domain (no echo chamber). Every argument uses a uniform JSON format and judge scoring rubric (cross-perspective comparability).

💡 Core Design Principle: The debate protocol does not restrict agent creativity—it builds an arena for adversarial reasoning. You need boundaries. You need rules. You need a scoreboard. Otherwise, you do not have an 8-agent debate; you have 8 agents producing noise.

The 3-Round Protocol: Opening → Cross-Examination → Closing

The protocol adapts the general debate framework (see Multi-Agent Debate L2: Structured Debate Protocol) but is specifically tuned for market analysis: 8 agents instead of 2, within-domain paired cross-examination instead of free-for-all attacks.

Protocol Overview

Round	Objective	Agent Behavior	Execution	Output
R1: Opening Statements	Each agent independently builds arguments from its analytical lens, undisturbed by opponents	Each agent reads its own knowledge base slice and generates 2–4 structured arguments. Bulls argue for upside; bears argue for downside.	✅ 8 agents all parallel	Per argument: {claim, evidence, confidence, counterpoints}
R2: Cross-Examination	Within the same analytical domain, bulls and bears attack each other's opening arguments	Tech bull attacks tech bear's arguments (and vice versa). Fund bull attacks fund bear. Each pair digs deep within its domain—no cross-domain attacks.	⚠️ Per-pair serial, 4 pairs parallel	Per challenge: {target_id, challenge_type, reasoning, new_evidence}
R3: Closing Arguments	Based on received cross-examination, refine or abandon original arguments, state final position	Each agent reads all challenges directed at them. Decides: which arguments still hold? What must be conceded? What is the final stance?	✅ 8 agents all parallel	Per closing: {refined_claims, concessions, final_stance, conviction_change}

Why Paired Cross-Examination Instead of Free-for-All?

If every agent could attack every other agent, one round would generate 8×7 = 56 attack vectors. Most would be superficial—wide but shallow. By pairing agents within the same analytical domain:

Tech Bull vs. Tech Bear: Both agree technical analysis is valid. They debate what the current technical data means. This is genuine expert debate.
Fund Bull vs. Fund Bear: Both are looking at valuation data. They debate whether the current PE is justified—not one talking PE and the other talking candlestick patterns.
Macro Bull vs. Macro Bear + Senti Bull vs. Senti Bear: Same logic.

⚠️ Paired cross-examination is not simplification—it is focus: Each agent still only attacks within its knowledge slice. The tech bull will not challenge the macro bear's arguments (it cannot see macro data). This means every cross-examination is conducted by a domain expert. Depth within domain > breadth across domains.

JSON Argument Format Specification

All agents output structured JSON across all 3 rounds—not free text. This is the foundation that enables the entire system to automatically process, compare, and score arguments.

Opening Argument Format

Field	Type	Requirement	Example
`claim`	string	A single, clearly falsifiable proposition. Under 100 characters.	"ExampleIndex is in a confirmed uptrend with price above all key moving averages"
`evidence`	string	Concrete data supporting the claim. Must reference data points from the knowledge base. No vague claims like "many" or "generally."	"Price 4,850 → MA20(4,720), MA50(4,510), MA200(4,120). All three in bullish alignment."
`confidence`	float (0.0–1.0)	Agent's confidence in this argument. Must be calibrated against evidence quality—strong evidence = high confidence. Not a guess.	0.85
`counterpoints`	string[]	Counterpoints or limitations the agent itself recognizes. Even while arguing a side, must honestly acknowledge weaknesses.	["RSI 14 is at 62, approaching overbought but not yet triggered"]

💡 Why require counterpoints? This inherits from the L2 debate protocol's "honesty" dimension. An agent willing to proactively expose its argument's limitations is more trustworthy than one pretending perfection. The judge rewards this honesty directly in the scoring.

Cross-Examination Response Format

Field	Description
`target_id`	The argument ID being attacked (e.g., `tech_bull_arg_0`)
`challenge_type`	`refute` (points out logical or factual error) \| `question_evidence` (challenges evidence sufficiency or conditions) \| `concede` (accepts the argument) \| `partial` (partially accepts but disputes degree or scope)
`reasoning`	Detailed reasoning chain for the challenge
`new_evidence`	New evidence supporting the challenge (must come from the knowledge base; no external information permitted)

Closing Statement Format

The closing statement does not re-state complete arguments. It concisely structures the agent's state change after cross-examination:

refined_claims: Which original claims survived cross-examination? How were they improved? (JSON array)
concessions: What opponent arguments does the agent explicitly concede? (JSON array)
final_stance: Final position statement (string, 50–100 words)
conviction_change: Confidence shift—strengthened | weakened | unchanged

Eight Agent Specialized Prompt Design

Each agent's system prompt contains four layers, ensuring specialization without losing structural consistency:

Role Layer: Defines the analytical lens (technical/fundamental/macro/sentiment) and directional stance (bull/bear)
Data Layer: Explicitly tells the agent which data modules exist in its knowledge base slice
Format Layer: Enforces the JSON output format specification (claim/evidence/confidence/counterpoints)
Behavior Layer: Debate rules—honesty requirements, evidence citation rules, hallucination prevention

Agent Specialization Matrix

Dict[str, Any]: """Debate engine entry point. Args: knowledge_base: KnowledgeBase from market_data_pipeline.py topic: Debate topic Returns: Complete debate transcript including 3 rounds and judge evaluation """ orchestrator = DebateOrchestrator(knowledge_base) return orchestrator.run_debate(topic) # ═══════════════════════════════════════════════════════════ # Demo with Synthetic Data # ═══════════════════════════════════════════════════════════ if __name__ == "__main__": # Synthetic knowledge base — fictional ExampleIndex data synthetic_kb = { "meta": { "generated_at": "2026-05-15T12:00:00Z", "market_status": "open", "data_sources": ["synthetic"], "warnings": ["This is synthetic demo data — not for real analysis"], }, "indices": { "EXI": { "ticker": "EXI", "name": "ExampleIndex", "price": 4850.0, "change_pct": 0.8, "returns": {"5d": 2.1, "20d": 4.8, "50d": 12.3, "200d": 28.5}, "vs_52w_high_pct": -3.2, "vs_52w_low_pct": 42.1, "volume_ratio": 1.3, }, }, "technicals": { "EXI": { "ticker": "EXI", "ma_status": {"ma20": "above", "ma50": "above", "ma200": "above"}, "rsi_14": 62.0, "macd_signal": "bullish", "atr_14": 45.2, "volume_trend": "increasing", }, }, "sectors": { "XTECH": {"ticker": "XTECH", "name": "Technology", "price": 520.0, "change_5d_pct": 3.2, "change_20d_pct": 7.5, "relative_strength_vs_spx": 2.7}, "XFIN": {"ticker": "XFIN", "name": "Financials", "price": 180.0, "change_5d_pct": -0.5, "change_20d_pct": 1.2, "relative_strength_vs_spx": -3.6}, "XIND": {"ticker": "XIND", "name": "Industrials", "price": 210.0, "change_5d_pct": 1.8, "change_20d_pct": 4.1, "relative_strength_vs_spx": -0.7}, }, "fundamentals": { "sp500_pe_approx": {"current_pe_approx": 28.5, "long_term_avg_pe": 24.0}, "sp500_earnings_yield_approx": 3.51, "sector_rotation_signal": "cyclical_rotation", }, "macro": { "GDP": {"indicator": "GDP", "trend": "rising", "latest_value": 3.2, "yoy_change_pct": 0.5}, "CPI": {"indicator": "CPI", "trend": "falling", "latest_value": 3.1, "yoy_change_pct": -0.3}, "FEDFUNDS": {"indicator": "FEDFUNDS", "trend": "flat", "latest_value": 4.25}, "UNRATE": {"indicator": "UNRATE", "trend": "flat", "latest_value": 3.8}, "T10Y2Y": {"indicator": "T10Y2Y", "trend": "flat", "latest_value": -0.35}, }, "sentiment": { "vix_level": 18.2, "vix_regime": "normal", "volume_signal": "high_volume_rally", "sector_breadth": {"positive_5d": "7/10", "positive_20d": "7/10", "breadth_regime": "broad_strength"}, }, "global_markets": { "HSI": {"ticker": "HSI", "name": "Hang Seng", "price": 21800, "change_pct": 1.2, "returns": {"5d": 2.8, "20d": 6.3}}, "N225": {"ticker": "N225", "name": "Nikkei 225", "price": 39200, "change_pct": 0.5, "returns": {"5d": 1.3, "20d": 3.7}}, }, } print("=" * 60) print("📊 Multi-Agent Debate × Market Analysis — Protocol Engine") print("=" * 60) result = run_debate(synthetic_kb, topic="One-month market outlook for ExampleIndex") # Save full transcript output_path = "debate_transcript.json" with open(output_path, "w", encoding="utf-8") as f: json.dump(result, f, indent=2, ensure_ascii=False, default=str) print(f"\n📁 Full debate transcript saved to: {output_path}") # Print summary ev = result.get("judge_evaluation", {}) if "synthesis" in ev: s = ev["synthesis"] print(f"\n{'─' * 40}") print(f"📋 Debate Summary") print(f"{'─' * 40}") print(f"Bull: {s.get('bull_case_summary', 'N/A')[:200]}") print(f"Bear: {s.get('bear_case_summary', 'N/A')[:200]}") print(f"Key Insight: {s.get('key_insight', 'N/A')}") print(f"\n✅ Debate engine run complete.")

Running the Debate Engine

Save the code as debate_protocol.py, install dependencies, and run:

# Install dependencies
pip install openai

# Configure: Edit api_key and base_url in LLM_CONFIG
# Replace with your actual API endpoint

# Run demo debate (uses built-in synthetic data)
python debate_protocol.py

In production, import the knowledge base from Article 1's pipeline:

from market_data_pipeline import build_knowledge_base
from debate_protocol import run_debate

kb = build_knowledge_base()        # fetch real market data
result = run_debate(kb)            # run 8-agent debate
# result["judge_evaluation"] contains the synthesis

Practical Example: Full ExampleIndex Debate Flow

Below is the typical output when running the debate engine with the synthetic data above. We show key excerpts rather than a full JSON dump.

Round 1: Opening Argument Highlights

Agent	Key Claim	Confidence
🐂 tech_bull	Price above all MAs, bullish alignment confirms uptrend	0.85
🐻 tech_bear	RSI 14 at 62 near overbought, only -3.2% from 52-week high — pullback risk	0.72
🐂 fund_bull	Earnings yield 3.51% vs long bond ~4.2% — spread narrowing but still positive, equities retain relative value	0.68
🐻 fund_bear	Current PE 28.5 significantly above 5-year average of 24.0 (+18.8%) — valuation at elevated levels	0.80
🐂 macro_bull	GDP growth 3.2%, CPI falling to 3.1% — soft landing scenario supports risk assets	0.75
🐻 macro_bear	Yield curve remains inverted (10Y-2Y = -0.35%) — historically a recession precursor	0.82
🐂 senti_bull	Sector breadth healthy (7/10 positive), VIX 18.2 in normal range — no panic signals	0.70
🐻 senti_bear	Volume ratio 1.3 — high-volume rally may signal near-term overheating and profit-taking pressure	0.65

Round 2: Key Cross-Examination Exchange

Here is a representative exchange from the tech_bull vs. tech_bear pair—the tech_bear's challenge to the tech_bull's first argument:

// tech_bear → attacking tech_bull argument #1
{
  "target_id": "tech_bull_arg_0",
  "challenge_type": "question_evidence",
  "reasoning": "MA bullish alignment does appear in uptrends, but it is a lagging indicator—it confirms a trend after it has already materialized. The current price is approximately 730 points (+17.7%) above MA200. Historically, such deviation is often followed by mean reversion. The 200-day return of +28.5% indicates the index has already risen substantially—MA alignment merely reflects what has already happened, not what will happen next.",
  "new_evidence": "Price deviation from MA200: +17.7% (derived from indices.EXI.price 4,850 vs. indices.EXI.returns.200d 28.5% implying MA200 ≈ 4,120). 200-day rolling return is at approximately the 85th percentile historically."
}

The tech_bull's closing statement, responding to this challenge:

// tech_bull — closing statement excerpt
{
  "refined_claims": [{
    "original": "MA bullish alignment confirms uptrend",
    "refinement": "MA bullish alignment confirms an established uptrend, but should be supplemented with RSI and MACD as leading indicators of trend strength. MA's lag is partially offset by MACD's earlier crossover signal.",
    "confidence_adjusted": 0.78
  }],
  "concessions": [
    "Accept tech_bear's point that MAs are lagging indicators. MA alignment should be used for trend confirmation, not trend prediction.",
    "Accept that the current deviation from historical mean may statistically increase pullback probability—but this does not constitute a reversal signal."
  ],
  "final_stance": "Technical outlook remains bullish, but confidence has been downgraded from very high to moderately high. RSI 62 and MA alignment point to sustained momentum, but the probability of a pullback after such a large rally cannot be ignored. A pullback to MA20 would constitute a technical observation condition: whether support holds would require confirmation.",
  "conviction_change": "weakened"
}

💡 Note the value of concessions: The tech_bull does not pretend its argument is perfect. It explicitly accepts the opponent's criticism (MAs are lagging). The judge rewards this honesty in the "logic" and "persuasiveness" dimensions. An agent willing to concede is more trustworthy than one that never admits weakness.

Judge Synthesis Evaluation

Here is the judge's scoring and trace for the 8 arguments (simplified):

Argument	Logic	Evidence	Clarity	Persuasive	Weighted	Status
tech_bull #1 (MA alignment)	7	8	8	7	7.5	WEAKENED
tech_bear #1 (RSI overbought)	6	6	7	5	6.0	WEAKENED
fund_bull #1 (earnings yield)	8	7	7	7	7.3	UPHELD
fund_bear #1 (PE elevated)	9	8	9	8	8.5	UPHELD
macro_bull #1 (soft landing)	7	7	7	6	6.8	WEAKENED
macro_bear #1 (curve inversion)	8	8	8	8	8.0	UPHELD
senti_bull #1 (healthy breadth)	7	7	8	6	7.0	UPHELD
senti_bear #1 (volume overheating)	6	5	7	5	5.7	REFUTED

Judge Synthesis: "The bears have stronger arguments on fundamentals (elevated PE) and macro (yield curve inversion). The bulls have reliable but more moderate arguments on technicals (uptrend) and sentiment (no panic signals). The critical disagreement lies in time horizon—bear arguments center on intermediate-term risks (3–12 months), while bull arguments primarily reflect short-term (1–2 week) momentum. The debate did not reach a clear directional consensus, but revealed a structural asymmetry in the risk/reward profile—upside potential exists (technical trend), but downside risks are well-argued (valuation + macro)."

Common Pitfalls and Design Decisions

Pitfall 1: Agent Hallucination — Citing Nonexistent Knowledge Base Data

Problem: LLMs have a persistent tendency—when prompted to "cite data from the knowledge base" but cannot find it, they may fabricate data.

Solution: We explicitly state in the prompt: "All evidence MUST come from the knowledge base. Do not fabricate data. If data is insufficient for a high-quality argument, lower confidence rather than inventing." Additionally, the judge agent receives the full knowledge base and can cross-verify agent-cited data during scoring—if an agent claims PE is 15 when the KB says 28.5, the judge will flag an evidence quality issue.

Pitfall 2: Paired Cross-Examination Degeneration — Both Sides "Agree to Disagree"

Problem: At certain temperature settings, the tech_bull and tech_bear may spend the cross-exam round mutually conceding each other's points—"I accept your view," "I also accept yours"—turning debate into a handshake session.

Solution: We lower the cross-examination temperature (0.4) and explicitly require in the prompt: "You MUST find weaknesses in your opponent's arguments. Do not concede to all arguments—challenge or refute at least half."

Pitfall 3: Judge Score Inflation — All Arguments Score 7–9

Problem: An untuned judge tends to produce averaged-out high scores. If all arguments score between 7–9, the scoring system loses discriminative power.

Solution: Add distribution guidance to the judge prompt: "Good arguments should score 6–8, not 9–10. Reserve 9–10 for near-perfect reasoning—logic that is watertight and evidence that is irrefutable." Additionally, Article 3 will introduce score calibration—using historical backtest data to tune judge weights.

Design Decision: Why 4 Dimensions (Instead of the L2 Debate's logic/evidence/responsiveness/honesty)?

The L2 general debate protocol used logic/evidence/responsiveness/honesty. The market debate protocol adapts them to logic/evidence/clarity/persuasiveness. Two reasons:

"Responsiveness" is harder to assess in market analysis: In a general debate, you can judge "did the pro side directly respond to the con side's challenge?" But in market analysis, many responses are "I acknowledge your data, but I interpret it differently"—this is not evasion, it is legitimate viewpoint divergence. Clarity and persuasiveness more accurately capture the quality of market arguments.
Persuasiveness implicitly captures the benefits of honesty: An argument willing to expose weaknesses in its counterpoints is naturally more persuasive—because the reader feels the author is honest. The persuasiveness dimension implicitly subsumes the honesty consideration.

Key Takeaways

Structured protocol = auditable debate: Every argument has an ID, claim, evidence, and confidence score. The judge can precisely reference and score each one. Auditability improves by at least an order of magnitude compared to free-form conversation.
Specialized prompts drive genuine diversity: The 8 agents differ not because "some are bullish and some are bearish"—but because of analytical lens and data slice differences. A technical analyst and a macro economist looking at the same market see different things—that is the source of diversity's value.
Paired cross-examination constrains quality: Within-domain pairing (tech vs. tech, fundamental vs. fundamental) ensures every cross-examination is conducted by a domain expert. Cross-domain attacks only produce noise.
Judge scoring ≠ directional prediction: The judge evaluates debate quality—not who "wins." Backtesting and validation is the subject of Article 3.
The argument trace table is the most practical output: From the chaos of 8 agents × 3 rounds × N arguments, the trace table extracts the only information that matters: which arguments held up, which were refuted, and which remain uncertain. This is the bridge from debate to decision.

Article 3 Teaser: Backtesting & Validation

You now have a working debate engine. It produces high-quality adversarial analysis. But one critical question remains: does the debate actually improve market analysis accuracy?

In Article 3, we will answer this question with hard data:

100 historical debate backtests: Run the debate engine across 100 historical periods with known outcomes—how well does the debate output correlate with actual market direction?
Single-agent baseline comparison: Same data, a single LLM's "analyze the market" output vs. the 8-agent debate judge synthesis—which is more accurate? What is the gap?
Judge weight calibration: The current 4-dimension weights (30/30/20/20) are intuition-based. Using backtest data, we can calibrate to optimal weights—which dimensions are most predictive of market direction?
Confidence validation: Do agent-reported confidence scores correlate with actual accuracy? Are high-confidence arguments more likely to hold up than low-confidence ones?

But before then, run today's code. Pull real data with Article 1's pipeline. Run a debate with today's engine. Read the argument trace table. See which arguments held up and which were refuted. Then ask yourself: if I were looking at this data alone, would I have noticed all of this?

📖 Previous: Multi-Agent Debate × Market Analysis — Architecture & Data Pipeline (building the knowledge base)
📖 Debate Theory: Multi-Agent Debate L2: Structured Debate Protocol · L3: Scoring & Consensus
📖 Next: Article 3 — Backtesting & Validation

⚠️ Disclaimer: This article is a technical workflow demonstration, not financial advice. All market data in this article is synthetic/fictional (ExampleIndex is not a real index). The debate engine's output cannot and should not be used as the basis for actual investment decisions. Financial markets carry inherent risk, and any automated analysis system may produce incorrect conclusions. Consult a licensed financial professional before making any investment decisions.

Next Steps

📖 Next in series: Backtesting & Validation — Accuracy Measurement and Judge Weight Calibration Across 100 Historical Debates — Put this debate protocol through rigorous backtesting: measure directional accuracy, calibrate judge weights with data (replacing intuition), and run McNemar significance tests.
📖 Debate theory foundation: Multi-Agent Debate L2: Structured Debate Protocols — The theoretical origin of the 3-round protocol: why three rounds rather than two or four? The design space and tradeoffs of cross-examination.
📖 Foundational skill: Building an Agent Framework from Scratch — Prompts, Tools, and Orchestration — Deep dive into the systematic methodology behind the prompt engineering in this article: role definition, output constraints, and reasoning boundary design patterns.

Frequently Asked Questions

Q: Why is cross-examination paired rather than free-for-all?

A: A free-for-all (every agent attacks every other agent) produces 8×7=56 attack vectors — that's noise, not signal. The paired design (Tech Bull vs. Tech Bear, Fund Bull vs. Fund Bear, etc.) ensures each cross-examination is focused, deep, and comparable across pairs. The Tech Bull and Tech Bear aren't arguing about whether technical analysis works — they both agree it does. They're arguing about what the technical data means *right now*. This follows the same design principle as the L2 series: "one judge evaluates one debate dimension" — constraint creates quality. Additionally, 4 parallel pairs keep total round time at ~12 seconds.

Q: What fields does the agent output JSON Schema include? Why structured output?

A: Each agent's argument output follows a strict JSON format: claim (core thesis, 1-2 sentences), evidence (list of evidence items, each containing data_point — specific values from the knowledge base — and source — module reference), confidence (0-1 score), and assumptions (list of key assumptions — explicitly stating which premises would invalidate the argument if overturned). Structured output isn't about "looking clean" — it's about machine readability and cross-agent comparability. The judge can automatically parse each argument's evidence sources and confidence levels rather than relying on natural language understanding to guess.

Q: Why these four judge scoring dimensions — Logic, Evidence, Clarity, Persuasiveness?

A: These four dimensions come from the L3 debate theory multi-dimensional scoring framework, adapted for market analysis. Logic evaluates whether the reasoning chain is internally consistent. Evidence evaluates the specificity and verifiability of cited data ("SPX RSI is 58, in neutral territory" vs. "the market looks strong"). Clarity evaluates whether the argument can be accurately understood and responded to by opponent agents. Persuasiveness evaluates how well the argument survives cross-examination. The 30/30/20/20 weights are initial intuition values — Article 3's backtesting will calibrate optimal weights via grid search.

Q: What role does prompt engineering play in the debate protocol? How do you prevent agent hallucination?

A: Prompts are the core quality control layer for the debate. Each agent's system prompt contains four constraints: role definition (analytical lens + directional stance), output format constraint (must follow JSON Schema), evidence citation requirement (must reference specific data from the knowledge base slice — if data isn't in the KB, it must be marked as "inferred" not "fact"), and reasoning boundary (explicitly state what the agent should NOT comment on outside its analytical scope). By anchoring evidence to knowledge base data (rather than the agent's training knowledge), hallucination is systematically suppressed — the agent cannot fabricate data it doesn't have access to.

Q: How does the debate protocol integrate with Article 1's data pipeline?

A: The data pipeline (market_data_pipeline.py) outputs market_knowledge_base.json. The debate protocol (debate_protocol_market.py) imports that JSON and calls slice_for_agent() to extract each agent's dedicated data slice. Integration flow: kb = build_knowledge_base() (data pipeline) → protocol = DebateProtocol(kb) (debate protocol) → transcript = protocol.run_debate(). The debate orchestrator's dependency on the data pipeline is one-directional — the pipeline is a module, the protocol is a consumer. The caching layer (Article 4) will add TTL on top so we don't re-fetch data for every debate.