← AI Agent Exploration · ← Previous: The Debate Protocol

Backtesting & Validation — Accuracy Measurement and Judge Weight Calibration Across 100 Historical Debates

Q: Why use synthetic data for backtesting instead of real market data? Is this even meaningful?

Synthetic data backtesting doesn't aim to "prove the system can predict markets" — it validates the methodology. It proves the backtesting framework itself is correct, the statistical tests work, and the weight calibration flow is sound. Synthetic data provides a controlled "known answer" environment: you know exactly how the data was generated, so you can judge whether the backtest correctly identified the signal. On real market data, you face an unknown data-generating process — you can't distinguish "the system found real signal" from "the system overfit noise." Validate the framework on synthetic data first, then evaluate the system on real data — two-layer validation, not one-shot.

Q: What does the ±1% threshold for directional accuracy mean? Why not just use up/down?

Real market moves are rarely "pure up" or "pure down." If a window's market only wiggled by 0.3%, claiming "direction was correct" would be dishonest — a coin flip could claim the same. The ±1% threshold restricts "directionality" to meaningful moves: gain ≥ +1% = UP, loss ≤ -1% = DOWN, everything in between = FLAT (excluded from accuracy calculation). This makes accuracy measurement stricter — you only evaluate the system on windows with clear directional signals. In this article's synthetic data, roughly 17% of windows are marked FLAT.

Q: Won't grid search for judge weight calibration overfit? How do you prevent that?

When searching 4000+ weight combinations, you're very likely to find a combination that performed best across 100 windows purely by luck — not because it's genuinely better. Prevention: split backtest windows into a training set (70%) and validation set (30%). Run grid search on the training set, evaluate optimal weights on the validation set. If validation accuracy diverges significantly from training accuracy — you're overfitting. The article provides train/validation split code that works directly within backtest_engine.py.

Q: McNemar test, binomial test, bootstrap confidence intervals — what question does each answer?

Three tests answer different levels of questions. McNemar test (paired): Is the prediction difference between the debate system and single-agent baseline on the same windows statistically significant? — answers "is debate genuinely better than baseline?" Binomial test: Is the debate system's accuracy significantly above random guessing (50%)? — answers "is the system extracting non-random information?" Bootstrap confidence intervals: What is the 95% confidence interval for the debate system's accuracy? — answers "how stable is the accuracy estimate?" The three tests are complementary; none alone gives a complete answer.

Q: Are 100 backtests enough? Can backtest results be used directly for investment decisions?

100 backtests are insufficient for statistically significant conclusions — the article's McNemar test yields p=0.137, not reaching the 5% significance level. To distinguish a 4-percentage-point accuracy difference, you need 500+ backtests. More importantly: all accuracy figures in this article (71.1%) come from synthetic data — they demonstrate methodology, not actual market prediction capability. The backtest framework is a tool to "rigorously test whether the system works" — not a machine to "prove the system works." On real data, if accuracy approaches 50% — face that result honestly. That's the value of backtesting.

2026-05-15 · Expert

30-Second Takeaway

Problem solved: Articles 1 and 2 built the data pipeline and debate protocol — the system runs, and the output looks reasonable. But "looks reasonable" is not engineering. You need hard data: does multi-agent debate genuinely outperform single-agent analysis? Are the intuitive judge weights (Logic 30% + Evidence 30%) optimal? Are the results statistically significant?
Core method: 100 historical backtests using sliding window design (5 trading days between windows), ±1% directional threshold to filter noise. Three-dimensional evaluation: directional accuracy (debate vs. single-agent baseline vs. random 50%), confidence calibration (Brier Score + bin analysis), judge weight grid search (~4000+ combinations, step 0.05). Statistical tests: McNemar paired test + binomial test + bootstrap confidence intervals.
Key insight: On synthetic data, multi-agent debate leads the single-agent baseline by ~9 percentage points on directional accuracy (71.1% vs. 62.0%), with 18% Brier Score improvement. Grid search bumped Evidence weight from 30% to 40% and dropped Clarity from 20% to 5%. The system is severely overconfident in the 80%+ confidence range — exactly the kind of actionable insight backtesting reveals.
What you can do after reading: Run backtest_engine.py (~350 lines) to reproduce the full backtesting workflow on synthetic data. Understand how to swap simulated debates with real LLM calls — the framework's window management, metric computation, and statistical testing layers are fully reusable.

You have built two things. In Article 1, you constructed a complete pipeline from market data to a structured knowledge base. In Article 2, you implemented an 8-agent, 3-round debate protocol—Opening Statements, Cross-Examination, Closing Arguments—plus an impartial judge. The code runs. The output looks reasonable—the judge assigns scores, the argument trace table shows structured analysis.

But one nagging question remains: does it actually work better than a simple single-agent approach?

Intuitively, eight specialized agents debating each other should be more reliable than one general agent producing a single output. Noise gets filtered by the adversarial process. Biases get offset by opposing viewpoints. Different analytical lenses—technical, fundamental, macro, sentiment—should produce comprehensiveness no single lens can achieve.

But intuition is not engineering. You need hard data.

That is the goal of this article: to measure, across 100 backtested debates, whether the multi-agent debate system genuinely outperforms a single-agent baseline. We will measure directional accuracy, calibrate confidence scores, optimize judge weights, and use statistical significance tests to separate signal from noise. By the end, you will have a complete backtesting framework—capable of independently validating the predictive quality of any AI debate system.

⚠️ This is not a philosophical discussion about "whether market prediction is possible." This is about engineering validation—given a dataset and a prediction system, how do you rigorously measure whether that system is better than a simpler alternative? All data, indices, and prices in this article are synthetic/fictional. ExampleIndex is not a real index. All "accuracy" figures derive from synthetic scenarios—they demonstrate methodology, not actual investment performance.

Backtesting Design: Core Framework

The core idea of backtesting is simple: run the debate on past data, compare the debate output to what actually happened. But because we are testing AI-generated debates (not trading strategies), "backtesting" needs a slightly adjusted definition:

Component	Traditional Quant Backtest	Debate System Backtest
Input	Historical price series	Historical knowledge base snapshot (as of time T)
Processing	Trading signal generation rules	8-agent debate + judge evaluation (using only data up to time T)
Prediction	Buy/sell/hold signals	Directional call (up/down) + confidence + debate quality scores
Ground Truth	Actual future returns	Actual market direction from T to T+N (up/down)

The critical constraint: the debate must not see the future. Each historical window's knowledge base snapshot contains only data up to that point in time. This is the fundamental prerequisite for backtest validity.

Sliding Window Design

We generate 100 historical windows, spaced 5 trading days apart:

Window 1:  2024-01-05 → KB snapshot(as of 2024-01-05), GT = direction from 01-06 to 02-05
Window 2:  2024-01-12 → KB snapshot(as of 2024-01-12), GT = direction from 01-13 to 02-12
...
Window 100: 2025-12-15 → KB snapshot(as of 2025-12-15), GT = direction from 12-16 to 01-15

Each window "anchors" at a specific historical date. The debate uses data up to that date. We then look at the actual market direction over the next 20 trading days—this is the "ground truth." If the debate predicted up and the market indeed rose over the next 20 days, the direction is correct.

Ground Truth Definition: What Counts as "Directionally Correct"?

This is not a trivial question. Actual market moves are rarely "pure up" or "pure down"—we need an operational standard:

Threshold	Classification	Meaning
Return ≥ +1%	UP	Clear bullish signal
Return ≤ −1%	DOWN	Clear bearish signal
−1% < Return < +1%	FLAT	No clear direction—excluded from sample or marked as inconclusive

The ±1% threshold restricts "directionality" to meaningful moves. If the market merely wiggled by 0.3%, claiming "direction was correct" would be dishonest—any random coin flip could claim the same. Excluding FLAT windows makes our accuracy measurement stricter.

Evaluation Metrics: Beyond Simple Accuracy

Directional accuracy is the starting point, but insufficient. A system could reach 60% accuracy by luck. We need multi-dimensional evaluation:

Metric 1: Directional Accuracy

DirectionalAccuracy = Correct_Direction_Predictions / Total_Predictions

Where:
  "Correct" = (debate predicts UP AND ground_truth is UP)
               OR (debate predicts DOWN AND ground_truth is DOWN)
  Total_Predictions = number of windows excluding FLAT ones

This is the most intuitive metric—how often the debate system "gets the direction right." The baseline is 50% (random guess). Any accuracy significantly above 50% indicates the system is extracting non-random information.

Metric 2: Confidence Calibration

Accuracy only measures "right or wrong." But we also need to know whether the system is genuinely more reliable when it expresses high confidence. Binning predictions by confidence:

Confidence Bin	Predictions	Actual Accuracy	Ideal Accuracy	Calibration Bias
0.50 – 0.60	18	55.6%	~55%	Well-calibrated ✅
0.60 – 0.70	31	64.5%	~65%	Well-calibrated ✅
0.70 – 0.80	28	71.4%	~75%	Slight overconfidence ⚠️
0.80 – 1.00	23	73.9%	~90%	Severe overconfidence ❌

The table above (from our upcoming synthetic backtest) reveals a critical finding: confidence is well-calibrated in the mid-range, but severely overconfident at the high end. This means predictions with 60–70% confidence are trustworthy, but those at 80%+ confidence should not be. This is a practically important insight—it tells you where the system's "trust boundary" lies.

Metric 3: Brier Score

The Brier score is a composite metric for probabilistic calibration:

Brier = (1/N) × Σ (p_i − o_i)²

Where:
  p_i = system's predicted probability of UP (0 to 1)
  o_i = actual outcome (1 = UP, 0 = DOWN)
  N = total number of predictions

Brier score range: [0, 1], lower is better
  0.00 = perfect calibration
  0.25 = equivalent to random guess (50% probability)
  >0.25 = worse than random

The Brier score combines discrimination (can the system tell UP from DOWN?) and calibration (do probabilities match frequencies?). In our synthetic backtest:

Multi-Agent Debate Brier Score:  0.187
Single-Agent Baseline Brier:     0.228
Random Baseline Brier:           0.250

Δ = 0.041 — debate system outperforms baseline by 18% on probability calibration

Single-Agent Baseline: The Anchor for Comparison

Comparison without a baseline is meaningless. Our baseline is simple:

Single-Agent Baseline:
  - One LLM instance
  - Receives the complete knowledge base (all data modules, no slicing restrictions)
  - No debate—directly analyzes the market and predicts direction
  - Same temperature setting (0.3, lower for consistency)
  - Output: directional call + confidence

Why is this a fair comparison? Because the single agent has an information advantage—it sees all the data (while each agent in the debate only sees its own slice). The single agent also has no multi-agent coordination overhead—one LLM call vs. 8+1. If the debate system still performs better under these conditions, it is not because it has more information—it is because the adversarial process produces better reasoning.

Comparison Matrix

Dimension	8-Agent Debate	Single-Agent Baseline	Edge
Information Access	Each agent sees partial data (by slice)	Sees the complete knowledge base	Debate disadvantage
LLM Calls	9 calls (8 agents + 1 judge)	1 call	Debate disadvantage (9× cost)
Reasoning Process	Adversarial: cross-examine, concede, refine	Single pass: no external challenge	Debate advantage
Cognitive Diversity	4 distinct analytical lenses	Single lens (likely biased toward technical or fundamental)	Debate advantage
Overconfidence Tendency	Suppressed by opponent cross-exam	No external correction mechanism	Debate advantage

This is an unfair fight—but tilted in favor of the baseline. If the debate system wins with less information at higher cost, it means the adversarial process itself carries significant value.

Judge Weight Calibration: Grid Search for Optimal Weights

In Article 2, we set the 4-dimension judge scoring weights as:

Weighted Score = Logic × 0.30 + Evidence × 0.30 + Clarity × 0.20 + Persuasiveness × 0.20

These weights were intuition-driven. We thought "Logic" and "Evidence" matter most—but are they actually the most predictive dimensions for market direction? Should "Persuasiveness" have a higher contribution? Is "Clarity" even relevant at all?

We need data to answer these questions.

Grid Search Method

We define a search space—all possible combinations of the 4 weights (step size 0.05, summing to 1.0):

Search space: each weight ∈ {0.05, 0.10, 0.15, ..., 0.85}
Constraint: w_logic + w_evidence + w_clarity + w_persuasiveness = 1.0
Number of candidate combinations: ~4,000+

For each weight combination, we compute directional accuracy using the backtest data. The process:

Compute weighted scor ... [OUTPUT TRUNCATED - 6747 chars omitted out of 56747 total] ...


          predictions=predictions,
                windows=windows,
                config=cfg,
                method="multi_agent_debate",
            )
            for ww, pp in zip(windows, predictions):
                if ww.is_directional and pp.predicted_direction != "UNCERTAIN":
                    pp.is_correct = (pp.predicted_direction == ww.ground_truth)
                else:
                    pp.is_correct = None

            acc = result.directional_accuracy
            br = result.brier_score
            dpairs = result.directional_windows

            cr = CalibrationResult(
                weights=w_combo,
                accuracy=acc,
                brier=br,
                num_directional=len(dpairs),
            )
            all_results.append(cr)

            if acc > best_accuracy and len(dpairs) >= 30:
                best_accuracy = acc
                best_result = cr

            if verbose and eval_count % 500 == 0:
                print(f"  Evaluated {eval_count}/{len(combinations)}... current best: acc={best_accuracy:.3f}")

        if best_result is None:
            best_result = max(all_results, key=lambda r: r.accuracy)

        if verbose:
            print(f"\nGrid search complete")
            print(f"  Optimal weights: L={best_result.weights[0]:.2f} E={best_result.weights[1]:.2f} "
                  f"C={best_result.weights[2]:.2f} P={best_result.weights[3]:.2f}")
            print(f"  Optimal accuracy: {best_result.accuracy:.1%}")

        return best_result, all_results

(continued — Statistical tests)

# ═══════════════════════════════════════════════════════════
# Statistical Tests
# ═══════════════════════════════════════════════════════════

def mcnemar_test(
    debate_result: BacktestRunResult,
    baseline_result: BacktestRunResult,
) -> Dict[str, Any]:
    """McNemar's test: debate vs baseline comparison."""
    a = b = c = d = 0

    for (w1, p1), (w2, p2) in zip(
        zip(debate_result.windows, debate_result.predictions),
        zip(baseline_result.windows, baseline_result.predictions),
    ):
        if not w1.is_directional or p1.predicted_direction == "UNCERTAIN":
            continue
        if not w2.is_directional or p2.predicted_direction == "UNCERTAIN":
            continue
        d_correct = p1.is_correct
        b_correct = p2.is_correct
        if d_correct and b_correct:
            a += 1
        elif d_correct and not b_correct:
            b += 1
        elif not d_correct and b_correct:
            c += 1
        else:
            d += 1

    if b + c == 0:
        chi2 = 0.0
        p_value = 1.0
    else:
        chi2 = (abs(b - c) - 1) ** 2 / (b + c)
        p_value = 2 * (1 - _chi2_cdf(chi2, 1)) if chi2 > 0 else 1.0

    return {
        "table": {"both_correct": a, "debate_only": b, "baseline_only": c, "both_wrong": d},
        "chi2": round(chi2, 4),
        "p_value": round(p_value, 4),
        "significant": p_value < 0.05,
    }


def binomial_test(result: BacktestRunResult) -> Dict[str, Any]:
    """Binomial test: is accuracy significantly above random (50%)?"""
    pairs = result.directional_windows
    n = len(pairs)
    k = sum(1 for w, p in pairs if p.is_correct)

    p_value = 0.0
    for i in range(k, n + 1):
        p_value += math.comb(n, i) * (0.5 ** n)

    return {
        "n": n,
        "k_correct": k,
        "observed_accuracy": k / n if n > 0 else 0,
        "p_value": round(p_value, 6),
        "significant": p_value < 0.05,
    }


def bootstrap_ci(
    result: BacktestRunResult,
    num_iterations: int = 10000,
) -> Dict[str, Any]:
    """Bootstrap confidence interval."""
    pairs = result.directional_windows
    n = len(pairs)
    if n == 0:
        return {"median": 0, "ci_95": [0, 0]}

    correct = np.array([p.is_correct for w, p in pairs], dtype=float)
    rng = np.random.default_rng(42)

    accuracies = []
    for _ in range(num_iterations):
        idx = rng.integers(0, n, n)
        sample_correct = correct[idx]
        acc = np.mean(sample_correct)
        accuracies.append(acc)

    accuracies = np.array(accuracies)
    return {
        "median": round(float(np.median(accuracies)), 4),
        "ci_95": [round(float(np.percentile(accuracies, 2.5)), 4),
                   round(float(np.percentile(accuracies, 97.5)), 4)],
    }


def _chi2_cdf(x: float, df: int) -> float:
    """Approximate chi-squared CDF."""
    if x <= 0:
        return 0.0
    if df == 1:
        return 2 * _norm_cdf(math.sqrt(x)) - 1
    return _norm_cdf(((x / df) ** (1/3) - (1 - 2/(9*df))) / math.sqrt(2/(9*df)))


def _norm_cdf(x: float) -> float:
    """Standard normal CDF."""
    return 0.5 * (1 + math.erf(x / math.sqrt(2)))

(continued — Main function & output)

# ═══════════════════════════════════════════════════════════
# Main Function
# ═══════════════════════════════════════════════════════════

def print_report(
    debate_result: BacktestRunResult,
    baseline_result: BacktestRunResult,
    calibration: CalibrationResult,
    stats: Dict[str, Any],
):
    """Print a formatted backtest report."""
    print("\n" + "=" * 70)
    print("Multi-Agent Debate System — Backtest Validation Report")
    print("=" * 70)

    gt_dist = GroundTruth.distribution(debate_result.windows)
    print(f"\nData Overview")
    print(f"  Total windows:         {len(debate_result.windows)}")
    print(f"  UP windows:            {gt_dist['UP']}")
    print(f"  DOWN windows:          {gt_dist['DOWN']}")
    print(f"  FLAT windows:          {gt_dist['FLAT']} (excluded from accuracy)")

    print(f"\nDirectional Accuracy (threshold +/-1%)")
    print(f"  Multi-Agent Debate:    {debate_result.directional_accuracy:.1%} ({len(debate_result.directional_windows)} directional windows)")
    print(f"  Single-Agent Baseline: {baseline_result.directional_accuracy:.1%} ({len(baseline_result.directional_windows)} directional windows)")
    diff = debate_result.directional_accuracy - baseline_result.directional_accuracy
    print(f"  Delta (Debate - Baseline):  {diff:+.1%}")

    print(f"\nBrier Score (lower is better)")
    print(f"  Multi-Agent Debate:    {debate_result.brier_score:.4f}")
    print(f"  Single-Agent Baseline: {baseline_result.brier_score:.4f}")
    print(f"  Random Baseline:       0.2500")
    print(f"  Delta (Baseline - Debate):  {baseline_result.brier_score - debate_result.brier_score:+.4f}")

    print(f"\nConfidence Calibration (Debate System)")
    bins = debate_result.confidence_bins()
    for bin_key in sorted(bins.keys()):
        b = bins[bin_key]
        if b["count"] > 0:
            bar = "#" * int(b["accuracy"] * 20)
            print(f"  {bin_key}: {b['accuracy']:.1%} ({b['correct']}/{b['count']}) {bar}")

    print(f"\nJudge Weight Calibration")
    w = calibration.weights
    print(f"  Default weights:  L=0.30 E=0.30 C=0.20 P=0.20")
    print(f"  Optimal weights:  L={w[0]:.2f} E={w[1]:.2f} C={w[2]:.2f} P={w[3]:.2f}")
    print(f"  Optimal accuracy: {calibration.accuracy:.1%}")

    print(f"\nStatistical Significance Tests")
    m = stats["mcnemar"]
    print(f"  McNemar's test: chi2={m['chi2']:.3f}, p={m['p_value']:.4f} {'[SIGNIFICANT]' if m['significant'] else '[NOT SIGNIFICANT - need larger sample]'}")
    print(f"    Contingency table: both_correct={m['table']['both_correct']}, "
          f"debate_only={m['table']['debate_only']}, "
          f"baseline_only={m['table']['baseline_only']}, "
          f"both_wrong={m['table']['both_wrong']}")

    bn = stats["binomial_debate"]
    print(f"  Binomial test (debate vs random): p={bn['p_value']:.6f} {'[HIGHLY SIGNIFICANT]' if bn['significant'] else ''}")

    bt = stats["bootstrap_debate"]
    print(f"  Bootstrap CI (debate): median={bt['median']:.1%}, 95% CI=[{bt['ci_95'][0]:.1%}, {bt['ci_95'][1]:.1%}]")
    bbl = stats["bootstrap_baseline"]
    print(f"  Bootstrap CI (baseline): median={bbl['median']:.1%}, 95% CI=[{bbl['ci_95'][0]:.1%}, {bbl['ci_95'][1]:.1%}]")

    print(f"\n" + "-" * 70)
    print(f"Synthesis")
    print("-" * 70)
    if debate_result.directional_accuracy > baseline_result.directional_accuracy:
        print(f"  [OK] Debate system outperforms single-agent baseline on directional accuracy ({diff:+.1%})")
    else:
        print(f"  [--] Debate system does not outperform single-agent baseline on directional accuracy ({diff:+.1%})")
    if debate_result.brier_score < baseline_result.brier_score:
        print(f"  [OK] Debate system outperforms baseline on probability calibration")
    if bn["significant"]:
        print(f"  [OK] Debate accuracy is significantly above random guessing")
    if not m["significant"]:
        print(f"  [WARN] Debate vs baseline difference is not statistically significant — recommend increasing sample to 500+")
    print(f"\n  [WARN] All results are based on synthetic data. Do not represent actual market performance.")

    print(f"\n" + "=" * 70)


def export_results(
    debate_result: BacktestRunResult,
    baseline_result: BacktestRunResult,
    filename: str = "backtest_results.json",
):
    """Export backtest results as JSON."""
    output = {
        "meta": {
            "generated_at": datetime.now().isoformat(),
            "data_type": "synthetic",
            "warning": "Synthetic data — not for real investment decisions",
            "num_windows": len(debate_result.windows),
        },
        "debate": {
            "accuracy": debate_result.directional_accuracy,
            "brier": debate_result.brier_score,
            "num_directional": len(debate_result.directional_windows),
        },
        "baseline": {
            "accuracy": baseline_result.directional_accuracy,
            "brier": baseline_result.brier_score,
            "num_directional": len(baseline_result.directional_windows),
        },
    }
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(output, f, indent=2, ensure_ascii=False)
    print(f"Results exported to: {filename}")


# ═══════════════════════════════════════════════════════════
# Entry Point
# ═══════════════════════════════════════════════════════════

if __name__ == "__main__":
    print("=" * 70)
    print("Multi-Agent Debate x Market Analysis — Backtest Validation Engine")
    print("=" * 70)

    config = BacktestConfig(
        num_windows=100,
        window_spacing_days=5,
        forward_look_days=20,
    )

    print("\n[1/7] Generating synthetic historical data...")
    history = generate_synthetic_history(num_days=700, seed=config.random_seed)
    print(f"  Generated {len(history)} historical snapshots")

    print("\n[2/7] Generating backtest windows...")
    windows = generate_windows(history, config)
    print(f"  Generated {len(windows)} backtest windows")
    gt_dist = GroundTruth.distribution(windows)
    print(f"  Direction distribution: UP={gt_dist['UP']}, DOWN={gt_dist['DOWN']}, FLAT={gt_dist['FLAT']}")

    print("\n[3/7] Running multi-agent debate backtest...")
    runner = BacktestRunner(config)
    debate_result = runner.run(windows, method="multi_agent_debate")
    print(f"  Complete: accuracy={debate_result.directional_accuracy:.1%}, "
          f"Brier={debate_result.brier_score:.4f}")

    print("\n[4/7] Running single-agent baseline...")
    baseline_result = runner.run(windows, method="single_agent_baseline")
    print(f"  Complete: accuracy={baseline_result.directional_accuracy:.1%}, "
          f"Brier={baseline_result.brier_score:.4f}")

    print("\n[5/7] Judge weight grid search calibration...")
    calibrator = JudgeCalibrator(config)
    best_calibration, all_calibrations = calibrator.calibrate(windows, verbose=True)

    print("\n[6/7] Running statistical tests...")
    stats = {
        "mcnemar": mcnemar_test(debate_result, baseline_result),
        "binomial_debate": binomial_test(debate_result),
        "binomial_baseline": binomial_test(baseline_result),
        "bootstrap_debate": bootstrap_ci(debate_result),
        "bootstrap_baseline": bootstrap_ci(baseline_result),
    }

    print("\n[7/7] Generating report...")
    print_report(debate_result, baseline_result, best_calibration, stats)
    export_results(debate_result, baseline_result)

    print(f"\nBacktest validation engine complete.")
    print(f"Disclaimer: All data in this article and code is synthetic/fictional. Not financial advice.")

Running the Backtest Engine

# Install dependencies
pip install numpy

# Run backtest (uses synthetic data)
python backtest_engine.py

To integrate the backtest engine with the real debate protocol engine from Article 2's debate_protocol.py, replace SimulatedDebateRunner.simulate_debate() with calls to the real LLM debate. The rest of the framework—window management, metric computation, statistical tests—remains unchanged.

Synthetic Backtest Results: Visual Interpretation of 100 Debates

Here is the typical output after running 100 backtests with synthetic data (consistent with our earlier theoretical discussion):

Overall Results

Metric	Multi-Agent Debate	Single-Agent Baseline	Random
Directional Accuracy	71.1%	62.0%	50.0%
Brier Score	0.187	0.228	0.250
Valid Predictions (directional windows)	83	80	—

The multi-agent debate leads the single-agent baseline by approximately 9 percentage points on directional accuracy and by about 18% on Brier score. The debate system produced fewer "UNCERTAIN" judgments—meaning it is better than the single agent at making clear directional calls when a signal exists.

Quadrant Analysis

Breaking the 100 predictions into four quadrants reveals the system's strengths and weaknesses:

Quadrant	Definition	Count	Interpretation
True Positives (TP)	Predicted UP, actual UP	34	System correctly identified upside opportunities
True Negatives (TN)	Predicted DOWN, actual DOWN	25	System correctly identified downside risks
False Positives (FP)	Predicted UP, actual DOWN	16	Bullish signal failed in a declining market
False Negatives (FN)	Predicted DOWN, actual UP	8	Bearish signal failed in a rising market (fewer errors—system leans bullish)

A noteworthy finding: the system has fewer false negatives (FN=8) than false positives (FP=16). This means the debate system tends toward bullish predictions—when the market rises, it missed 8 opportunities; when the market falls, it incorrectly predicted up 16 times. This bullish bias may stem from the positive drift in the synthetic data (daily +0.03%), but it is also a hypothesis worth testing on real data.

Common Pitfalls

Pitfall 1: Look-Ahead Bias

Problem: When generating "historical" knowledge base snapshots, you accidentally include data beyond the snapshot date. For example, if the backtest window is anchored at 2024-03-15, but the knowledge base's "200-day returns" are calculated through 2024-06-01—the debate sees the future.

Solution: Our generate_synthetic_history() function strictly computes all indicators using data only up to the anchor date. In a real system, you must ensure the data pipeline's temporal cutoff logic is correct—each backtest's build_knowledge_base() must accept an as_of_date parameter.

Pitfall 2: Survivorship Bias

Problem: If historical data only includes indices that "survived"—ignoring those that delisted or performed extremely poorly—backtest results will systematically overstate accuracy.

Solution: This article uses entirely synthetic data and is therefore unaffected. In a real data system, you must ensure the knowledge base includes historical records of delisted indices and correctly incorporates them in backtests.

Pitfall 3: Grid Search Overfitting

Problem: When searching 4,000+ weight combinations for the "optimal" weights, you are very likely to find a combination that performed best across 100 windows purely by luck—not because it is genuinely better.

Solution: Split backtest windows into a training set (70%) and validation set (30%). Run grid search on the training set, evaluate the optimal weights on the validation set. If validation accuracy diverges significantly from training accuracy—you are overfitting.

# Train/validation split
train_windows = windows[:70]
val_windows = windows[70:]

# Grid search on training set
best_cal, _ = calibrator.calibrate(train_windows)

# Evaluate on validation set
cfg_val = BacktestConfig(judge_weights=best_cal.weights)
val_runner = BacktestRunner(cfg_val)
val_result = val_runner.run(val_windows)
print(f"Validation accuracy: {val_result.directional_accuracy:.1%}")

Pitfall 4: Treating Synthetic Backtest Results as Real Performance

Problem: All 71.1% accuracy figures in this article come from synthetic data—they illustrate methodology, not actual market prediction capability. On real market data, accuracy may differ significantly.

Solution: Treat this framework as a validation tool—not a "prove the system works" machine, but a "rigorously test whether the system works" machine. Run the same backtest on real data with the same metrics. If real accuracy approaches 50%—face that result honestly. That is the value of backtesting.

Key Takeaways

Backtesting is not a one-time validation—it is an ongoing engineering practice. Every time you modify the debate protocol, adjust prompts, or switch models, re-run the backtest. Without backtesting, your judgment of system quality is merely guesswork.
The multi-agent debate outperforms the single-agent baseline on synthetic data. +9 percentage points on directional accuracy, 18% improvement on Brier score. This improvement comes from the adversarial process—not from more information (each agent in the debate only sees partial data).
Evidence quality is the most important judge scoring dimension. The grid search bumped Evidence from 30% to 40% and dropped Clarity from 20% to 5%—indicating that "does the argument contain specific, quantifiable data" is far more important than "is the argument clearly expressed."
100 backtests are insufficient for statistically significant conclusions. McNemar's test yields p = 0.137—not significant at the 5% level. 500+ backtests are needed to distinguish a 4-percentage-point accuracy difference with high confidence.
Synthetic backtesting validates methodology—not market prediction capability. 71.1% accuracy demonstrates the value of adversarial debate on synthetic data. On real market data, the same framework will produce different numbers—potentially higher, potentially lower. The key: you now have the tools to measure it.

Article 4 Teaser: Production Deployment

You now have a backtest-validated debate system. You know it outperforms a single-agent baseline. You know the calibrated judge weights. You know which confidence ranges are reliable and which are not.

But validation is only the first step. Next: put it into production.

In Article 4, we will tackle every engineering problem involved in real deployment:

Scheduled execution: Every morning before market open, automatically fetch the prior day's data, build the knowledge base, and run the debate—no manual triggers needed.
Persistent storage: Write each debate's complete transcript, judge scores, and prediction results to a database. Build a queryable history of analyses.
Monitoring dashboard: Track accuracy trends in real-time, confidence distribution shifts, judge score drift—get alerted when system performance begins to degrade.
Cost optimization: 9 LLM calls × daily = a substantial API bill. Explore caching strategies, model tiering (mini for daily, full for critical days), and batching optimizations.
Error recovery: When an agent's API call fails, how do you handle it? Is a partial debate still valid? What is the fallback strategy?

But before then—run this article's code. Generate 100 backtests with synthetic data first. Read the accuracy report. Ask yourself: if I change the judge's scoring dimensions, how does accuracy change? If I change the debate temperature settings? If I add more agents?

The backtest framework is not just for "validation"—it is your experimental platform for optimizing the system.

📖 Previous: Multi-Agent Debate × Market Analysis — The Debate Protocol (8-agent debate engine)
📖 Architecture: Multi-Agent Debate × Market Analysis — Architecture & Data Pipeline
📖 Debate Theory: Multi-Agent Debate L3: Scoring & Consensus Theory
📖 Next: Article 4 — Production Deployment (coming soon)

⚠️ Disclaimer: This article is a technical workflow demonstration, not financial advice. All market data, index names (ExampleIndex), prices, returns, and accuracy figures in this article are synthetic/fictional. Backtest results cannot and should not be used as the basis for actual investment decisions. The multi-agent debate system is an engineering technology demonstration—its output should under no circumstances be considered market prediction or trading advice. Financial markets carry inherent risk. Consult a licensed financial professional before making any investment decisions.

Next Steps

📖 Previous in series: The Debate Protocol — How 8 AI Agents Conduct Structured Adversarial Cross-Examination — Review the debate engine that this article's backtesting validates: 3-round protocol, JSON argument format, and 4-dimension judge scoring implementation.
📖 Debate theory foundation: Multi-Agent Debate L4: Production Deployment & Monitoring — The next step after backtesting validation: pushing the validated system into production with scheduled execution, persistent storage, monitoring dashboards, cost optimization, and error recovery.
📖 Foundational skill: Multi-Agent Orchestration — From Single Agent to Agent Teams — Understand the async execution foundation beneath the backtesting framework: how to coordinate 100 parallel debate executions, result collection, and aggregate analysis.

Frequently Asked Questions

Q: Why use synthetic data for backtesting instead of real market data? Is this even meaningful?

A: Synthetic data backtesting doesn't aim to "prove the system can predict markets" — it validates the methodology. It proves the backtesting framework itself is correct, the statistical tests work, and the weight calibration flow is sound. Synthetic data provides a controlled "known answer" environment: you know exactly how the data was generated, so you can judge whether the backtest correctly identified the signal. On real market data, you face an unknown data-generating process — you can't distinguish "the system found real signal" from "the system overfit noise." Validate the framework on synthetic data first, then evaluate the system on real data — two-layer validation, not one-shot.

Q: What does the ±1% threshold for directional accuracy mean? Why not just use up/down?

A: Real market moves are rarely "pure up" or "pure down." If a window's market only wiggled by 0.3%, claiming "direction was correct" would be dishonest — a coin flip could claim the same. The ±1% threshold restricts "directionality" to meaningful moves: gain ≥ +1% = UP, loss ≤ -1% = DOWN, everything in between = FLAT (excluded from accuracy calculation). This makes accuracy measurement stricter — you only evaluate the system on windows with clear directional signals. In this article's synthetic data, roughly 17% of windows are marked FLAT.

Q: Won't grid search for judge weight calibration overfit? How do you prevent that?

A: When searching 4000+ weight combinations, you're very likely to find a combination that performed best across 100 windows purely by luck — not because it's genuinely better. Prevention: split backtest windows into a training set (70%) and validation set (30%). Run grid search on the training set, evaluate optimal weights on the validation set. If validation accuracy diverges significantly from training accuracy — you're overfitting. The article provides train/validation split code that works directly within backtest_engine.py.

Q: McNemar test, binomial test, bootstrap confidence intervals — what question does each answer?

A: Three tests answer different levels of questions. McNemar test (paired): Is the prediction difference between the debate system and single-agent baseline on the same windows statistically significant? — answers "is debate genuinely better than baseline?" Binomial test: Is the debate system's accuracy significantly above random guessing (50%)? — answers "is the system extracting non-random information?" Bootstrap confidence intervals: What is the 95% confidence interval for the debate system's accuracy? — answers "how stable is the accuracy estimate?" The three tests are complementary; none alone gives a complete answer.

Q: Are 100 backtests enough? Can backtest results be used directly for investment decisions?

A: 100 backtests are insufficient for statistically significant conclusions — the article's McNemar test yields p=0.137, not reaching the 5% significance level. To distinguish a 4-percentage-point accuracy difference, you need 500+ backtests. More importantly: all accuracy figures in this article (71.1%) come from synthetic data — they demonstrate methodology, not actual market prediction capability. The backtest framework is a tool to "rigorously test whether the system works" — not a machine to "prove the system works." On real data, if accuracy approaches 50% — face that result honestly. That's the value of backtesting.