Dict[str, Any]:
"""Debate engine entry point.
Args:
knowledge_base: KnowledgeBase from market_data_pipeline.py
topic: Debate topic
Returns:
Complete debate transcript including 3 rounds and judge evaluation
"""
orchestrator = DebateOrchestrator(knowledge_base)
return orchestrator.run_debate(topic)
# ═══════════════════════════════════════════════════════════
# Demo with Synthetic Data
# ═══════════════════════════════════════════════════════════
if __name__ == "__main__":
# Synthetic knowledge base — fictional ExampleIndex data
synthetic_kb = {
"meta": {
"generated_at": "2026-05-15T12:00:00Z",
"market_status": "open",
"data_sources": ["synthetic"],
"warnings": ["This is synthetic demo data — not for real analysis"],
},
"indices": {
"EXI": {
"ticker": "EXI", "name": "ExampleIndex",
"price": 4850.0, "change_pct": 0.8,
"returns": {"5d": 2.1, "20d": 4.8, "50d": 12.3, "200d": 28.5},
"vs_52w_high_pct": -3.2, "vs_52w_low_pct": 42.1,
"volume_ratio": 1.3,
},
},
"technicals": {
"EXI": {
"ticker": "EXI",
"ma_status": {"ma20": "above", "ma50": "above", "ma200": "above"},
"rsi_14": 62.0, "macd_signal": "bullish",
"atr_14": 45.2, "volume_trend": "increasing",
},
},
"sectors": {
"XTECH": {"ticker": "XTECH", "name": "Technology", "price": 520.0,
"change_5d_pct": 3.2, "change_20d_pct": 7.5,
"relative_strength_vs_spx": 2.7},
"XFIN": {"ticker": "XFIN", "name": "Financials", "price": 180.0,
"change_5d_pct": -0.5, "change_20d_pct": 1.2,
"relative_strength_vs_spx": -3.6},
"XIND": {"ticker": "XIND", "name": "Industrials", "price": 210.0,
"change_5d_pct": 1.8, "change_20d_pct": 4.1,
"relative_strength_vs_spx": -0.7},
},
"fundamentals": {
"sp500_pe_approx": {"current_pe_approx": 28.5,
"long_term_avg_pe": 24.0},
"sp500_earnings_yield_approx": 3.51,
"sector_rotation_signal": "cyclical_rotation",
},
"macro": {
"GDP": {"indicator": "GDP", "trend": "rising",
"latest_value": 3.2, "yoy_change_pct": 0.5},
"CPI": {"indicator": "CPI", "trend": "falling",
"latest_value": 3.1, "yoy_change_pct": -0.3},
"FEDFUNDS": {"indicator": "FEDFUNDS", "trend": "flat",
"latest_value": 4.25},
"UNRATE": {"indicator": "UNRATE", "trend": "flat",
"latest_value": 3.8},
"T10Y2Y": {"indicator": "T10Y2Y", "trend": "flat",
"latest_value": -0.35},
},
"sentiment": {
"vix_level": 18.2, "vix_regime": "normal",
"volume_signal": "high_volume_rally",
"sector_breadth": {"positive_5d": "7/10",
"positive_20d": "7/10",
"breadth_regime": "broad_strength"},
},
"global_markets": {
"HSI": {"ticker": "HSI", "name": "Hang Seng", "price": 21800,
"change_pct": 1.2,
"returns": {"5d": 2.8, "20d": 6.3}},
"N225": {"ticker": "N225", "name": "Nikkei 225", "price": 39200,
"change_pct": 0.5,
"returns": {"5d": 1.3, "20d": 3.7}},
},
}
print("=" * 60)
print("📊 Multi-Agent Debate × Market Analysis — Protocol Engine")
print("=" * 60)
result = run_debate(synthetic_kb,
topic="One-month market outlook for ExampleIndex")
# Save full transcript
output_path = "debate_transcript.json"
with open(output_path, "w", encoding="utf-8") as f:
json.dump(result, f, indent=2, ensure_ascii=False, default=str)
print(f"\n📁 Full debate transcript saved to: {output_path}")
# Print summary
ev = result.get("judge_evaluation", {})
if "synthesis" in ev:
s = ev["synthesis"]
print(f"\n{'─' * 40}")
print(f"📋 Debate Summary")
print(f"{'─' * 40}")
print(f"Bull: {s.get('bull_case_summary', 'N/A')[:200]}")
print(f"Bear: {s.get('bear_case_summary', 'N/A')[:200]}")
print(f"Key Insight: {s.get('key_insight', 'N/A')}")
print(f"\n✅ Debate engine run complete.")
Running the Debate Engine
Save the code as debate_protocol.py, install dependencies, and run:
# Install dependencies
pip install openai
# Configure: Edit api_key and base_url in LLM_CONFIG
# Replace with your actual API endpoint
# Run demo debate (uses built-in synthetic data)
python debate_protocol.py
In production, import the knowledge base from Article 1's pipeline:
from market_data_pipeline import build_knowledge_base
from debate_protocol import run_debate
kb = build_knowledge_base() # fetch real market data
result = run_debate(kb) # run 8-agent debate
# result["judge_evaluation"] contains the synthesis
Practical Example: Full ExampleIndex Debate Flow
Below is the typical output when running the debate engine with the synthetic data above. We show key excerpts rather than a full JSON dump.
Round 1: Opening Argument Highlights
| Agent |
Key Claim |
Confidence |
| 🐂 tech_bull |
Price above all MAs, bullish alignment confirms uptrend |
0.85 |
| 🐻 tech_bear |
RSI 14 at 62 near overbought, only -3.2% from 52-week high — pullback risk |
0.72 |
| 🐂 fund_bull |
Earnings yield 3.51% vs long bond ~4.2% — spread narrowing but still positive, equities retain relative value |
0.68 |
| 🐻 fund_bear |
Current PE 28.5 significantly above 5-year average of 24.0 (+18.8%) — valuation at elevated levels |
0.80 |
| 🐂 macro_bull |
GDP growth 3.2%, CPI falling to 3.1% — soft landing scenario supports risk assets |
0.75 |
| 🐻 macro_bear |
Yield curve remains inverted (10Y-2Y = -0.35%) — historically a recession precursor |
0.82 |
| 🐂 senti_bull |
Sector breadth healthy (7/10 positive), VIX 18.2 in normal range — no panic signals |
0.70 |
| 🐻 senti_bear |
Volume ratio 1.3 — high-volume rally may signal near-term overheating and profit-taking pressure |
0.65 |
Round 2: Key Cross-Examination Exchange
Here is a representative exchange from the tech_bull vs. tech_bear pair—the tech_bear's challenge to the tech_bull's first argument:
// tech_bear → attacking tech_bull argument #1
{
"target_id": "tech_bull_arg_0",
"challenge_type": "question_evidence",
"reasoning": "MA bullish alignment does appear in uptrends, but it is a lagging indicator—it confirms a trend after it has already materialized. The current price is approximately 730 points (+17.7%) above MA200. Historically, such deviation is often followed by mean reversion. The 200-day return of +28.5% indicates the index has already risen substantially—MA alignment merely reflects what has already happened, not what will happen next.",
"new_evidence": "Price deviation from MA200: +17.7% (derived from indices.EXI.price 4,850 vs. indices.EXI.returns.200d 28.5% implying MA200 ≈ 4,120). 200-day rolling return is at approximately the 85th percentile historically."
}
The tech_bull's closing statement, responding to this challenge:
// tech_bull — closing statement excerpt
{
"refined_claims": [{
"original": "MA bullish alignment confirms uptrend",
"refinement": "MA bullish alignment confirms an established uptrend, but should be supplemented with RSI and MACD as leading indicators of trend strength. MA's lag is partially offset by MACD's earlier crossover signal.",
"confidence_adjusted": 0.78
}],
"concessions": [
"Accept tech_bear's point that MAs are lagging indicators. MA alignment should be used for trend confirmation, not trend prediction.",
"Accept that the current deviation from historical mean may statistically increase pullback probability—but this does not constitute a reversal signal."
],
"final_stance": "Technical outlook remains bullish, but confidence has been downgraded from very high to moderately high. RSI 62 and MA alignment point to sustained momentum, but the probability of a pullback after such a large rally cannot be ignored. A pullback to MA20 would constitute a technical observation condition: whether support holds would require confirmation.",
"conviction_change": "weakened"
}
💡 Note the value of concessions: The tech_bull does not pretend its argument is perfect. It explicitly accepts the opponent's criticism (MAs are lagging). The judge rewards this honesty in the "logic" and "persuasiveness" dimensions. An agent willing to concede is more trustworthy than one that never admits weakness.
Judge Synthesis Evaluation
Here is the judge's scoring and trace for the 8 arguments (simplified):
| Argument |
Logic |
Evidence |
Clarity |
Persuasive |
Weighted |
Status |
| tech_bull #1 (MA alignment) |
7 | 8 |
8 | 7 |
7.5 |
WEAKENED |
| tech_bear #1 (RSI overbought) |
6 | 6 |
7 | 5 |
6.0 |
WEAKENED |
| fund_bull #1 (earnings yield) |
8 | 7 |
7 | 7 |
7.3 |
UPHELD |
| fund_bear #1 (PE elevated) |
9 | 8 |
9 | 8 |
8.5 |
UPHELD |
| macro_bull #1 (soft landing) |
7 | 7 |
7 | 6 |
6.8 |
WEAKENED |
| macro_bear #1 (curve inversion) |
8 | 8 |
8 | 8 |
8.0 |
UPHELD |
| senti_bull #1 (healthy breadth) |
7 | 7 |
8 | 6 |
7.0 |
UPHELD |
| senti_bear #1 (volume overheating) |
6 | 5 |
7 | 5 |
5.7 |
REFUTED |
Judge Synthesis: "The bears have stronger arguments on fundamentals (elevated PE) and macro (yield curve inversion). The bulls have reliable but more moderate arguments on technicals (uptrend) and sentiment (no panic signals). The critical disagreement lies in time horizon—bear arguments center on intermediate-term risks (3–12 months), while bull arguments primarily reflect short-term (1–2 week) momentum. The debate did not reach a clear directional consensus, but revealed a structural asymmetry in the risk/reward profile—upside potential exists (technical trend), but downside risks are well-argued (valuation + macro)."
Common Pitfalls and Design Decisions
Pitfall 1: Agent Hallucination — Citing Nonexistent Knowledge Base Data
Problem: LLMs have a persistent tendency—when prompted to "cite data from the knowledge base" but cannot find it, they may fabricate data.
Solution: We explicitly state in the prompt: "All evidence MUST come from the knowledge base. Do not fabricate data. If data is insufficient for a high-quality argument, lower confidence rather than inventing." Additionally, the judge agent receives the full knowledge base and can cross-verify agent-cited data during scoring—if an agent claims PE is 15 when the KB says 28.5, the judge will flag an evidence quality issue.
Pitfall 2: Paired Cross-Examination Degeneration — Both Sides "Agree to Disagree"
Problem: At certain temperature settings, the tech_bull and tech_bear may spend the cross-exam round mutually conceding each other's points—"I accept your view," "I also accept yours"—turning debate into a handshake session.
Solution: We lower the cross-examination temperature (0.4) and explicitly require in the prompt: "You MUST find weaknesses in your opponent's arguments. Do not concede to all arguments—challenge or refute at least half."
Pitfall 3: Judge Score Inflation — All Arguments Score 7–9
Problem: An untuned judge tends to produce averaged-out high scores. If all arguments score between 7–9, the scoring system loses discriminative power.
Solution: Add distribution guidance to the judge prompt: "Good arguments should score 6–8, not 9–10. Reserve 9–10 for near-perfect reasoning—logic that is watertight and evidence that is irrefutable." Additionally, Article 3 will introduce score calibration—using historical backtest data to tune judge weights.
Design Decision: Why 4 Dimensions (Instead of the L2 Debate's logic/evidence/responsiveness/honesty)?
The L2 general debate protocol used logic/evidence/responsiveness/honesty. The market debate protocol adapts them to logic/evidence/clarity/persuasiveness. Two reasons:
- "Responsiveness" is harder to assess in market analysis: In a general debate, you can judge "did the pro side directly respond to the con side's challenge?" But in market analysis, many responses are "I acknowledge your data, but I interpret it differently"—this is not evasion, it is legitimate viewpoint divergence. Clarity and persuasiveness more accurately capture the quality of market arguments.
- Persuasiveness implicitly captures the benefits of honesty: An argument willing to expose weaknesses in its counterpoints is naturally more persuasive—because the reader feels the author is honest. The persuasiveness dimension implicitly subsumes the honesty consideration.
Key Takeaways
- Structured protocol = auditable debate: Every argument has an ID, claim, evidence, and confidence score. The judge can precisely reference and score each one. Auditability improves by at least an order of magnitude compared to free-form conversation.
- Specialized prompts drive genuine diversity: The 8 agents differ not because "some are bullish and some are bearish"—but because of analytical lens and data slice differences. A technical analyst and a macro economist looking at the same market see different things—that is the source of diversity's value.
- Paired cross-examination constrains quality: Within-domain pairing (tech vs. tech, fundamental vs. fundamental) ensures every cross-examination is conducted by a domain expert. Cross-domain attacks only produce noise.
- Judge scoring ≠ directional prediction: The judge evaluates debate quality—not who "wins." Backtesting and validation is the subject of Article 3.
- The argument trace table is the most practical output: From the chaos of 8 agents × 3 rounds × N arguments, the trace table extracts the only information that matters: which arguments held up, which were refuted, and which remain uncertain. This is the bridge from debate to decision.
Article 3 Teaser: Backtesting & Validation
You now have a working debate engine. It produces high-quality adversarial analysis. But one critical question remains: does the debate actually improve market analysis accuracy?
In Article 3, we will answer this question with hard data:
- 100 historical debate backtests: Run the debate engine across 100 historical periods with known outcomes—how well does the debate output correlate with actual market direction?
- Single-agent baseline comparison: Same data, a single LLM's "analyze the market" output vs. the 8-agent debate judge synthesis—which is more accurate? What is the gap?
- Judge weight calibration: The current 4-dimension weights (30/30/20/20) are intuition-based. Using backtest data, we can calibrate to optimal weights—which dimensions are most predictive of market direction?
- Confidence validation: Do agent-reported confidence scores correlate with actual accuracy? Are high-confidence arguments more likely to hold up than low-confidence ones?
But before then, run today's code. Pull real data with Article 1's pipeline. Run a debate with today's engine. Read the argument trace table. See which arguments held up and which were refuted. Then ask yourself: if I were looking at this data alone, would I have noticed all of this?
📖 Previous: Multi-Agent Debate × Market Analysis — Architecture & Data Pipeline (building the knowledge base)
📖 Debate Theory: Multi-Agent Debate L2: Structured Debate Protocol · L3: Scoring & Consensus
📖 Next: Article 3 — Backtesting & Validation
⚠️ Disclaimer: This article is a technical workflow demonstration, not financial advice. All market data in this article is synthetic/fictional (ExampleIndex is not a real index). The debate engine's output cannot and should not be used as the basis for actual investment decisions. Financial markets carry inherent risk, and any automated analysis system may produce incorrect conclusions. Consult a licensed financial professional before making any investment decisions.
Frequently Asked Questions
Q: Why is cross-examination paired rather than free-for-all?
A: A free-for-all (every agent attacks every other agent) produces 8×7=56 attack vectors — that's noise, not signal. The paired design (Tech Bull vs. Tech Bear, Fund Bull vs. Fund Bear, etc.) ensures each cross-examination is focused, deep, and comparable across pairs. The Tech Bull and Tech Bear aren't arguing about whether technical analysis works — they both agree it does. They're arguing about what the technical data means *right now*. This follows the same design principle as the L2 series: "one judge evaluates one debate dimension" — constraint creates quality. Additionally, 4 parallel pairs keep total round time at ~12 seconds.
Q: What fields does the agent output JSON Schema include? Why structured output?
A: Each agent's argument output follows a strict JSON format: claim (core thesis, 1-2 sentences), evidence (list of evidence items, each containing data_point — specific values from the knowledge base — and source — module reference), confidence (0-1 score), and assumptions (list of key assumptions — explicitly stating which premises would invalidate the argument if overturned). Structured output isn't about "looking clean" — it's about machine readability and cross-agent comparability. The judge can automatically parse each argument's evidence sources and confidence levels rather than relying on natural language understanding to guess.
Q: Why these four judge scoring dimensions — Logic, Evidence, Clarity, Persuasiveness?
A: These four dimensions come from the L3 debate theory multi-dimensional scoring framework, adapted for market analysis. Logic evaluates whether the reasoning chain is internally consistent. Evidence evaluates the specificity and verifiability of cited data ("SPX RSI is 58, in neutral territory" vs. "the market looks strong"). Clarity evaluates whether the argument can be accurately understood and responded to by opponent agents. Persuasiveness evaluates how well the argument survives cross-examination. The 30/30/20/20 weights are initial intuition values — Article 3's backtesting will calibrate optimal weights via grid search.
Q: What role does prompt engineering play in the debate protocol? How do you prevent agent hallucination?
A: Prompts are the core quality control layer for the debate. Each agent's system prompt contains four constraints: role definition (analytical lens + directional stance), output format constraint (must follow JSON Schema), evidence citation requirement (must reference specific data from the knowledge base slice — if data isn't in the KB, it must be marked as "inferred" not "fact"), and reasoning boundary (explicitly state what the agent should NOT comment on outside its analytical scope). By anchoring evidence to knowledge base data (rather than the agent's training knowledge), hallucination is systematically suppressed — the agent cannot fabricate data it doesn't have access to.
Q: How does the debate protocol integrate with Article 1's data pipeline?
A: The data pipeline (market_data_pipeline.py) outputs market_knowledge_base.json. The debate protocol (debate_protocol_market.py) imports that JSON and calls slice_for_agent() to extract each agent's dedicated data slice. Integration flow: kb = build_knowledge_base() (data pipeline) → protocol = DebateProtocol(kb) (debate protocol) → transcript = protocol.run_debate(). The debate orchestrator's dependency on the data pipeline is one-directional — the pipeline is a module, the protocol is a consumer. The caching layer (Article 4) will add TTL on top so we don't re-fetch data for every debate.
|