Debate Scoring & Consensus

2026-05-15 · Intermediate

30-Second Takeaway

Problem Solved: Single-judge bias — one judge's score is unreliable, but how do you unify multiple judges' opinions? This article solves multi-judge calibration, consistency, and consensus.
Core Method: Parallel multi-judge scoring → normalization (eliminating scoring style differences) → weighted voting → statistical consensus measures (Krippendorff Alpha, Fleiss Kappa) → final conclusion.
Key Insight: Three judges with weighted voting consistently outperform a single judge (+4-6% accuracy). But beyond 5 judges, marginal returns diminish while cost and latency grow linearly.
What You'll Gain: Implement a multi-judge scoring system, learn to calibrate different judges' scoring styles, and quantify consensus strength with statistical measures.

In L2: Structured Debate Protocol, we gave the judge a "ruler" — four scoring dimensions (logic, evidence, responsiveness, honesty), each argument scored independently. That's far more reliable than L1's free-form judging.

But it introduces a new problem: what if the judge itself is unreliable?

More specifically:

Same debate, two judges give very different scores — who do you trust?
Judge A gives consistently high scores (average 8), Judge B gives consistently low scores (average 5) for the same arguments — how do you compare their rankings?
Three judges, two say Pro is better, one says Con — is that "consensus" or "divergence"?
If judges simply cannot agree — do you force a winner, or admit "we can't determine this"?

These questions share a common name: inter-rater reliability. In human evaluation domains (clinical diagnosis, academic peer review, judicial decisions), this has been studied for over half a century — but for AI Agent debate systems, it's only beginning to be taken seriously.

This article's goal: transplant mature methodologies from human evaluation into the judging layer of multi-agent debate systems.

Why One Judge Isn't Enough

Don't get me wrong — L2's StructuredJudge works well in single-judge scenarios. The problem is single point of failure. No matter how detailed your scoring rubric, the judge is still an LLM — with its own knowledge blind spots, preferences, and randomness.

Three Root Causes

Problem	Description	Consequence
Calibration Bias	Each judge has their own "scoring habits" — some favor 7-9, others 4-7	Raw scores can't be compared across judges directly
Domain Blind Spots	Judges lack deep knowledge in certain technical domains and can't assess technical accuracy	Technical arguments get surface-level scoring instead of substantive evaluation
Single Perspective	One judge can only view the problem from one angle (technical, business, risk, ethics)	Important cross-dimensional trade-offs are missed

These three problems aren't LLM-specific. Human judges have exactly the same issues — that's why academic journals use 2-4 reviewers, courts use juries, and competitive sports use multiple judges with trimmed extremes.

💡 Core insight: Consensus isn't about everyone agreeing — it's about quantifying the degree of disagreement and judging whether it's resolvable. Three judges scoring 8/8/7 is acceptable variance. Three judges scoring 3/8/9 means the argument itself is highly subjective — and that is the truly valuable information.

Score Calibration

Before discussing multi-judge systems, let's solve the fundamental problem: how do we make different judges' scores comparable?

Why Calibration Is Necessary

Say you have two judges scoring the same arguments:

Argument	Judge A (Strict)	Judge B (Lenient)
PRO-1: Logic	5	8
PRO-2: Logic	6	9
PRO-3: Logic	7	10

Judge A's scores are consistently 2-3 points lower. Yet both judges' rank ordering is identical: PRO-3 > PRO-2 > PRO-1. This tells us:

The absolute value of raw scores doesn't matter — what matters is relative ordering and standardized gaps.

Calibration Method 1: Z-Score Normalization

Standardize each judge's scores independently:

z_score = (raw_score - judge_mean) / judge_stddev

After standardization, every judge's score distribution has mean = 0 and standard deviation = 1. You can now directly compare: a Z-score of -1.5 always means "significantly below this judge's average," regardless of which judge it came from.

Pros: Eliminates individual scoring-habit differences.
Cons: If a judge only scored a few arguments (e.g., 3), the mean estimate is unreliable.

Calibration Method 2: Min-Max Normalization

Compress scores to [0, 1]:

normalized = (raw_score - judge_min) / (judge_max - judge_min)

Pros: Simple, intuitive, no distribution assumptions.
Cons: Extreme values severely affect normalization — a single 10 and a single 1 will squeeze all middle scores together.

When to Use Which

Scenario	Recommended	Reason
Short debate (≤ 3 arguments per side)	Min-Max	Too few scores for reliable mean/std estimation
Long debate (≥ 5 arguments per side)	Z-Score	Sufficient samples for accurate distribution estimation
Cross-debate comparison	Z-Score	Different debates have different score ranges; Z-Score enables cross-debate comparison

⚠️ Normalization isn't magic: It solves the "scoring habit" problem, not the "judge competence" problem. If a judge systematically misjudges (e.g., giving high scores to arguments with obvious logical flaws), normalization won't fix that. That's what weighted voting addresses — see the next section.

Inter-Rater Reliability: How Consistent Are Your Judges, Really?

After calibration, the next question: how much do these judges actually agree?

If you have three judges scoring the same set of arguments, you need a single number to quantify their agreement. This is what inter-rater reliability answers.

The two most commonly used metrics:

Metric	Data Type	Notes
Krippendorff's Alpha	Interval/ratio data (e.g., 1-10 scores)	The most general reliability metric — supports multiple raters, multiple data types, and missing values
Fleiss' Kappa	Categorical data (e.g., UPHELD/REFUTED/UNCERTAIN)	Measures categorical agreement — ideal for argument trace-table status judgments

For our debate scenario, we need both: Alpha for numerical scores (logic/evidence/responsiveness/honesty), Kappa for categorical judgments (is the argument UPHELD, REFUTED, or UNCERTAIN?).

Interpreting Alpha:

α Range	Interpretation	Action
α ≥ 0.80	High agreement	Reliable — aggregate scores and decide
0.67 ≤ α < 0.80	Moderate agreement	Acceptable — aggregate but flag high-variance items
0.50 ≤ α < 0.67	Low agreement	Caution — analyze divergence sources; don't blindly average
α < 0.50	Unacceptable	Flag as irreconcilable — don't rush to conclusions; need more info or human intervention

📌 Key mindset shift: Low Alpha isn't a system failure — it's valuable information. It tells you "this question is itself highly contentious and shouldn't be masked by surface consensus." In a decision system, recognizing "we don't know" is just as important as recognizing "we know."

Multi-Judge Panel Design

One judge isn't enough — you need a judge panel. But a panel isn't just running the same judge three times — you need differentiated judge roles.

Four Judge Roles

Based on decision scenarios, we define four complementary judge roles:

Role	Expertise Domain	Focus	Dimension Emphasis
Technical Judge	Implementation, architecture	Technical accuracy, implementation feasibility	Logic 40%, Evidence 35%
Business Judge	Business model, ROI, cost-benefit	Business soundness and cost implications	Evidence 40%, Logic 30%
Risk Judge	Risk assessment, edge cases, failure modes	Whether arguments overlook hidden risks and boundary conditions	Responsiveness 35%, Honesty 30%
General Judge	Holistic evaluation, balancing all factors	Overall debate quality and information completeness	Standard weights (L2 defaults)

Different dimension emphasis means: the Technical judge weights logically rigorous arguments higher, while the Business judge weights arguments with concrete ROI data higher. This isn't favoritism — it's targeted differentiation.

⚠️ Differentiation ≠ isolation: All four judges read the complete debate transcript. Their difference lies in what they pay attention to, not in what they're allowed to see. If you restrict the Technical judge to only technical arguments and the Business judge to only business arguments, you're fragmenting information, not enriching perspectives.

Panel Orchestration Flow

Multi-judge evaluation isn't "run them in parallel and average." It follows a strict execution order:

Independent Scoring Phase: Each judge independently reads the full debate transcript and produces independent scores and classifications. Judges do not communicate with each other — this is critical for ensuring independence.
Calibration Phase: Normalize all judges' raw scores (Z-Score or Min-Max) to eliminate individual scoring habits.
Weighting Phase: Apply different weights to calibrated scores based on each judge's domain relevance and historical accuracy.
Consensus Calculation Phase: Compute Krippendorff's Alpha and Fleiss' Kappa to quantify inter-judge agreement.
Divergence Decision Phase: If Alpha ≥ 0.67, synthesize weighted scores into a conclusion. If Alpha < 0.50, flag as "irreconcilable disagreement" and trigger human intervention.

Weighted Voting Mechanism

Not all judges are equal. We introduce two layers of weighting:

Layer 1: Domain Relevance Weight

Different judges have different authority for different types of debates:

Debate Topic Type

Technical Weight

Business Weight

Risk Weight

dict: """Call the LLM for a single judge to evaluate the debate.""" user_prompt = ( f"## Topic\n{topic}\n\n" f"## Pro Opening Arguments\n" + "\n\n".join(a.to_text() for a in pro_args) + f"\n\n## Con Opening Arguments\n" + "\n\n".join(a.to_text() for a in con_args) + f"\n\n## Pro Cross-Examination\n{pro_cross_text}\n\n" f"## Con Cross-Examination\n{con_cross_text}\n\n" f"## Pro Closing\n{pro_closing}\n\n" f"## Con Closing\n{con_closing}\n\n" f"Please output your evaluation in the JSON format " f"specified in the system prompt." ) response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": profile.get_system_prompt(topic)}, {"role": "user", "content": user_prompt} ], temperature=0.2, max_tokens=3000 ) reply = response.choices[0].message.content import re cleaned = re.sub(r'```(?:json)?\s*', '', reply).strip() try: return json.loads(cleaned) except json.JSONDecodeError: return { "error": "JSON parse failed", "raw_response": reply, "scores": [], "overall": {"pro_total_raw": 0, "con_total_raw": 0} } def evaluate( self, topic: str, pro_args: list[Argument], con_args: list[Argument], pro_cross_text: str, con_cross_text: str, pro_closing: str, con_closing: str ) -> PanelResult: """Execute the complete multi-judge evaluation pipeline.""" result = PanelResult() # ── Phase 1: Independent Scoring ── print(f"\n{'=' * 60}") print(f"Multi-Judge Panel — {len(self.judges)} judges " f"evaluating independently") print(f"{'=' * 60}") for judge in self.judges: print(f"\n {judge.name} ({judge._domain_label()}) " f"evaluating...") evaluation = self._single_judge_evaluate( judge, pro_args, con_args, pro_cross_text, con_cross_text, pro_closing, con_closing, topic ) result.raw_scores[judge.name] = evaluation.get("scores", []) # ── Phase 2: Score Calibration ── print(f"\n{'─' * 60}") print(f"Phase 2: Score Calibration (Z-Score Normalization)") print(f"{'─' * 60}") # Extract each judge's weighted average per argument raw_per_judge = {} for judge in self.judges: scores = result.raw_scores.get(judge.name, []) if not scores: continue dw = judge.dimension_weights raw_per_judge[judge.name] = [] for s in scores: weighted = ( s.get("logic_score", 5) * dw["logic"] + s.get("evidence_score", 5) * dw["evidence"] + s.get("responsiveness_score", 5) * dw["responsiveness"] + s.get("honesty_score", 5) * dw["honesty"] ) raw_per_judge[judge.name].append(weighted) # Collect all argument IDs (in order) all_arg_ids = [] for judge in self.judges: for s in result.raw_scores.get(judge.name, []): aid = s.get("argument_id", "") if aid and aid not in all_arg_ids: all_arg_ids.append(aid) # Z-Score normalize each judge's scores calibrated = self.calibrator.calibrate_all(raw_per_judge) # Map to argument IDs result.calibrated_scores = {} for judge in self.judges: jname = judge.name if jname not in calibrated or jname not in raw_per_judge: continue cal_list = calibrated[jname] j_scores = result.raw_scores.get(jname, []) result.calibrated_scores[jname] = {} for i, s in enumerate(j_scores): aid = s.get("argument_id", f"UNKNOWN-{i}") result.calibrated_scores[jname][aid] = ( cal_list[i] if i < len(cal_list) else 0.0 ) # ── Phase 3: Weighted Voting ── print(f"\n{'─' * 60}") print(f"Phase 3: Weighted Voting") print(f"{'─' * 60}") result.weighted_result = self.voter.compute_weighted_scores( self.judges, result.calibrated_scores, topic ) print(f" Topic type: {result.weighted_result['topic_type']}") for j in self.judges: w = self.voter.get_domain_weight(j, topic) print(f" {j.name}: final weight = {w:.3f}") # ── Phase 4: Consensus Calculation ── print(f"\n{'─' * 60}") print(f"Phase 4: Consensus Metrics") print(f"{'─' * 60}") # Build Krippendorff's Alpha data matrix (n_judges, n_items) alpha_data = [] for judge in self.judges: jname = judge.name row = [] for aid in all_arg_ids: row.append( result.calibrated_scores.get(jname, {}).get(aid, np.nan) ) alpha_data.append(row) result.alpha = self.consensus.krippendorff_alpha( alpha_data, metric="interval" ) # Build Fleiss' Kappa classification data (n_items, n_raters) kappa_data = [] for i, aid in enumerate(all_arg_ids): standings = [] for judge in self.judges: j_scores = result.raw_scores.get(judge.name, []) if i < len(j_scores): standings.append( j_scores[i].get("standing", "UNCERTAIN") ) else: standings.append("UNCERTAIN") kappa_data.append(standings) result.kappa = self.consensus.fleiss_kappa(kappa_data) print(f" Krippendorff's Alpha: {result.alpha}") print(f" Fleiss' Kappa: {result.kappa}") # ── Phase 5: Divergence Detection ── print(f"\n{'─' * 60}") print(f"Phase 5: Divergence Detection") print(f"{'─' * 60}") # Compute per-argument score variance (across judges) per_arg_var = {} for i, aid in enumerate(all_arg_ids): vals = [] for judge in self.judges: jname = judge.name val = result.calibrated_scores.get(jname, {}).get(aid) if val is not None and not ( isinstance(val, float) and math.isnan(val) ): vals.append(val) if len(vals) >= 2: per_arg_var[aid] = float(np.var(vals, ddof=1)) else: per_arg_var[aid] = 0.0 result.divergence = self.consensus.detect_irreconcilable( alpha=result.alpha, kappa=result.kappa, per_arg_variances=per_arg_var, weighted_scores=result.weighted_result, ) flag = "YES" if result.divergence['irreconcilable'] else "NO" print(f" Irreconcilable: ⚠️ {flag}") for reason in result.divergence.get("reasons", []): print(f" - {reason}") # ── Phase 6: Per-Argument Statistics ── result.per_arg_stats = {} for i, aid in enumerate(all_arg_ids): standings = [] for judge in self.judges: j_scores = result.raw_scores.get(judge.name, []) if i < len(j_scores): standings.append( j_scores[i].get("standing", "UNCERTAIN") ) counter = Counter(standings) result.per_arg_stats[aid] = { "variance": per_arg_var.get(aid, 0.0), "standings": dict(counter), "majority": counter.most_common(1)[0][0] if counter else "UNCERTAIN", } return result def print_report(self, result: PanelResult, topic: str): """Print a human-readable comprehensive report.""" print(f"\n{'=' * 60}") print(f"Multi-Judge Comprehensive Report") print(f"{'=' * 60}") print(f"\nTopic: {topic}") print(f"Judges: {len(self.judges)}") print(f"Topic type: {result.weighted_result.get('topic_type', 'N/A')}") print(f"\n── Consensus Metrics ──") a_label = ( "High agreement" if result.alpha >= 0.80 else "Needs attention" if result.alpha < 0.67 else "Moderate agreement" ) k_label = ( "High agreement" if result.kappa >= 0.80 else "Needs attention" if result.kappa < 0.67 else "Moderate agreement" ) print(f" Krippendorff's Alpha: {result.alpha} ({a_label})") print(f" Fleiss' Kappa: {result.kappa} ({k_label})") print(f"\n── Weighted Scores ──") print(f" Pro: {result.weighted_result.get('pro', 'N/A')}") print(f" Con: {result.weighted_result.get('con', 'N/A')}") gap = abs( result.weighted_result.get("pro", 0) - result.weighted_result.get("con", 0) ) print(f" Gap: {gap:.3f}") print(f"\n── Divergence Status ──") flag = "YES" if result.divergence.get('irreconcilable') else "NO" print(f" Irreconcilable: ⚠️ {flag}") for reason in result.divergence.get("reasons", []): print(f" - {reason}") print(f" Recommendation: " f"{result.divergence.get('recommendation', 'N/A')}") # ────────────────────────────────────────────── # 6. Usage Example # ────────────────────────────────────────────── def run_consensus_debate(topic: str) -> PanelResult: """ Run a complete debate with multi-judge consensus calculation. This function assumes you already have debate records from L2's debate_protocol.py. It uses mock data here to demo the judge panel workflow. """ # ── Create judge panel ── judges = [ JudgeProfile( name="Technical Judge", domain=ExpertiseDomain.TECHNICAL, dimension_weights={ "logic": 0.40, "evidence": 0.35, "responsiveness": 0.15, "honesty": 0.10 } ), JudgeProfile( name="Business Judge", domain=ExpertiseDomain.BUSINESS, dimension_weights={ "logic": 0.20, "evidence": 0.40, "responsiveness": 0.20, "honesty": 0.20 } ), JudgeProfile( name="Risk Judge", domain=ExpertiseDomain.RISK, dimension_weights={ "logic": 0.15, "evidence": 0.20, "responsiveness": 0.35, "honesty": 0.30 } ), JudgeProfile( name="General Judge", domain=ExpertiseDomain.GENERAL, dimension_weights={ "logic": 0.30, "evidence": 0.30, "responsiveness": 0.25, "honesty": 0.15 } ), ] # ── Create panel ── panel = MultiJudgePanel(judges) # ── Prepare debate data (mock — in practice, get this from # L2's run_structured_debate()) ── pro_args = [ Argument("PRO-1", "Independent deployment shortens release cycles", "Microservices allow independent build/test/deploy, " "avoiding monolithic full-deployment bottlenecks", "Benchmarks: monolithic 3.8h vs microservices 0.7h"), Argument("PRO-2", "Team tech stack flexibility improves", "Each service can independently choose the best tech stack", "Startup case: core in Go + analytics in Python"), Argument("PRO-3", "Fault isolation reduces system risk", "Single service failure doesn't affect others", "AWS practice: blast radius reduced from full cluster " "to single service"), ] con_args = [ Argument("CON-1", "Operational complexity increases significantly", "Microservices introduce inherent distributed system " "complexity: network latency, service discovery, " "distributed transactions", "Research shows ops cost increases 40-60%"), Argument("CON-2", "Team cognitive load is too high", "A 10-person team maintaining 8+ services — each " "developer must understand multiple service interactions", "Small team survey: efficiency drops after >5 services"), Argument("CON-3", "Initial development velocity decreases", "Microservices require additional infrastructure setup " "and DevOps investment", "Startups typically see ROI only after 6-12 months"), ] pro_cross_text = ( "To CON-1: Challenge — do the ops cost figures account for " "modern container orchestration automation?\n" "To CON-2: Partial — cognitive load exists but can be " "mitigated by unified API gateways and documentation\n" "To CON-3: Concede — initial velocity does drop, but " "long-term gains justify the investment" ) con_cross_text = ( "To PRO-1: Challenge — benchmark conditions are idealized, " "ignoring network latency and CI/CD pipeline time\n" "To PRO-2: Challenge — tech stack diversity in small teams " "actually increases hiring and maintenance burden\n" "To PRO-3: Refute — fault isolation has costs; distributed " "systems introduce new failure modes" ) pro_closing = ( "We acknowledge microservices' shortcomings in operational " "complexity and initial velocity. But our core position stands: " "for startups expecting long-term growth, microservices' " "independent deployment and fault isolation advantages " "win in the long run." ) con_closing = ( "Pro failed to effectively address the core challenges of " "operational cost and team cognitive load. For teams under 10, " "microservices introduce complexity disproportionate to team " "size. We recommend starting with a modular monolith and " "splitting only when the team and business have grown " "to a necessary scale." ) # ── Run multi-judge evaluation ── result = panel.evaluate( topic=topic, pro_args=pro_args, con_args=con_args, pro_cross_text=pro_cross_text, con_cross_text=con_cross_text, pro_closing=pro_closing, con_closing=con_closing, ) # ── Print report ── panel.print_report(result, topic) return result # ────────────────────────────────────────────── # 7. Helper: calibrate judge accuracy # ────────────────────────────────────────────── def update_judge_accuracy( profile: JudgeProfile, ground_truth: str, # "PRO" | "CON" | "TIE" judge_vote: str # "PRO" | "CON" | "TIE" ): """ Update a judge's historical accuracy based on known ground truth. Only use for calibration debates with known correct answers. """ profile.calibrations_completed += 1 if judge_vote == ground_truth: profile.total_correct += 1 profile.historical_accuracy = ( profile.total_correct / profile.calibrations_completed ) # ────────────────────────────────────────────── # 8. LLM-free statistical test (quick algorithm verification) # ────────────────────────────────────────────── def test_consensus_without_llm(): """Verify consensus algorithms with mock data — no LLM needed.""" print("=" * 60) print("Statistical Test — Verifying Consensus Algorithms (no LLM)") print("=" * 60) # Mock: 4 judges × 6 arguments mock_scores = [ [7.5, 8.0, 6.5, 4.0, 3.5, 5.0], # Judge 1 [8.0, 8.5, 7.0, 3.5, 3.0, 4.5], # Judge 2 [6.0, 7.0, 5.5, 5.0, 4.5, 6.0], # Judge 3 (more divergent) [np.nan, 8.0, 6.0, 4.0, np.nan, 5.0], # Judge 4 (with missing) ] calc = ConsensusCalculator() alpha = calc.krippendorff_alpha(mock_scores) print(f"\nKrippendorff's Alpha (mock data): {alpha}") print(f"Expected: 0.70-0.90 range (moderate divergence)") # Mock classification data mock_classifications = [ ["UPHELD", "UPHELD", "PARTIALLY_UPHELD", "REFUTED", "REFUTED", "PARTIALLY_UPHELD"], ["UPHELD", "UPHELD", "UPHELD", "REFUTED", "REFUTED", "REFUTED"], ["UPHELD", "PARTIALLY_UPHELD", "PARTIALLY_UPHELD", "PARTIALLY_UPHELD", "REFUTED", "UNCERTAIN"], ["UPHELD", "UPHELD", "PARTIALLY_UPHELD", "REFUTED", "REFUTED", "REFUTED"], ] # Fleiss' Kappa expects (n_items, n_raters) kappa_data = list(zip(*mock_classifications)) kappa_data = [list(row) for row in kappa_data] kappa = calc.fleiss_kappa(kappa_data) print(f"\nFleiss' Kappa (mock data): {kappa}") print(f"Expected: 0.60-0.90 range (mostly agree, some diverge)") # Test calibration cal = ScoreCalibrator() raw = { "JudgeA": [5.0, 6.0, 7.0, 4.0, 3.0, 5.0], "JudgeB": [8.0, 9.0, 10.0, 7.0, 6.0, 8.0], } calibrated = cal.calibrate_all(raw, method="zscore") print(f"\nZ-Score Calibration:") for name, scores in calibrated.items(): print(f" {name}: {[round(s, 3) for s in scores]}") print(f" Expected: both judges' Z-Score distributions should be " f"nearly identical") if __name__ == "__main__": # Run statistical test first (no LLM, no API needed) test_consensus_without_llm() print(f"\n{'=' * 60}") print(f"To run a full multi-judge debate evaluation, call " f"run_consensus_debate()") print(f"Requires valid API credentials " f"(your-api-key + api.example.com)") print(f"{'=' * 60}") # Uncomment to run full evaluation: # result = run_consensus_debate( # topic="Should a small startup (under 10 people) " # "adopt microservices architecture from day one?" # ) # with open("/tmp/consensus_debate_result.json", "w") as f: # json.dump(result, f, ensure_ascii=False, indent=2, default=str)

Code Structure Breakdown

Compared to L2's debate_protocol.py, L3 adds these core components:

Component	Function	Key Methods
`JudgeProfile`	Defines judge role, expertise domain, scoring preferences, historical accuracy	`get_system_prompt()` — generates role-specific prompts
`ScoreCalibrator`	Z-Score / Min-Max normalization to eliminate scoring habit differences	`calibrate_all()` — batch-calibrates all judges' scores
`WeightedVoter`	Dual-layer weighting: domain relevance + historical accuracy; auto topic-type detection	`detect_topic_type()` / `compute_weighted_scores()`
`ConsensusCalculator`	Krippendorff's Alpha + Fleiss' Kappa + irreconcilable divergence detection	`krippendorff_alpha()` / `fleiss_kappa()` / `detect_irreconcilable()`
`MultiJudgePanel`	Orchestrates full pipeline: independent scoring → calibration → weighting → consensus → divergence	`evaluate()` — complete 5-phase evaluation pipeline

💡 Self-contained design: While this code reuses L2's data structures (Argument, etc.), we redefine them here so debate_consensus.py can run standalone. In a real project, you should import these types from debate_protocol.py rather than redefining them. The correct import is noted in the code comments.

Usage Flow: From Debate to Decision

Connecting L1, L2, and L3:

L1 (debate.py): Two agents debate adversarially, single judge gives free-text conclusions.
Use case: rapid exploration, brainstorming.
L2 (debate_protocol.py): 3-round structured protocol (Opening → Cross-Exam → Closing), single judge with multi-dimensional scoring + argument trace table.
Use case: deep debate on definitive questions.
L3 (debate_consensus.py): Multi-judge expert panel, score calibration + weighted voting + consensus metrics + divergence detection.
Use case: critical decisions — scenarios requiring multiple validators for conclusion reliability.

You can adopt progressively: try L1 first; if depth is insufficient, upgrade to L2; if single-judge conclusions feel unreliable, upgrade to L3.

Limitations & Future Directions

LLM judge self-consistency: We assume the same judge gives similar scores on running the same debate twice — but this isn't guaranteed. Even at temperature=0.2, LLMs have some randomness. In production, run each judge 2-3 times and use the average as their final score.
Alpha and Kappa with small samples: With only 6 arguments and 3 judges, Alpha and Kappa estimates have high variance. With fewer than 10 arguments or fewer than 3 judges, treat these metrics as references, not sole decision inputs.
Judges can collectively err: If all judges share the same knowledge blind spots (because their training data is similar), even high Alpha can mask wrong conclusions. Using judges from different model providers (GPT-4o + Claude + Gemini) partially mitigates this.
Domain relevance weights need human calibration: Keyword matching for topic type detection is coarse. In serious applications, topic categories should be human-labeled or classified by a trained model.

Key Takeaways

Single judge = single point of failure: No matter how refined the scoring system, one perspective is always incomplete. A multi-judge expert panel is the infrastructure for conclusion reliability.
Calibration makes scores comparable: Z-Score normalization eliminates inter-judge "scoring habit" differences, enabling genuine cross-judge comparison of argument evaluations.
Weighted voting reflects expertise: Different domain questions should be evaluated by judges with relevant expertise — domain relevance and historical accuracy provide a rational weighting mechanism.
Alpha and Kappa quantify consensus: You no longer need to guess whether "judges generally agree" — two precise numbers tell you exactly how much.
Irreconcilable divergence is a signal, not a failure: When judges can't reach consensus, the system shouldn't force an answer. Honestly telling the decision-maker "this question is highly contentious and needs more information" is far more valuable than manufacturing false consensus.

📎 Series note: This is article 3 of the Multi-Agent Debate series. Recommended reading order: L1: Adversarial Collaboration Intro → L2: Structured Debate Protocol → This article (L3). The next article (L4) explores production deployment and real-world applications.

📖 Next: Multi-Agent Debate System: Production Deployment — real-world use cases, system architecture, performance optimization

Next Steps

📖 Previous: Structured Debate Protocol — 3 Rounds + Judge — The single-judge system that multi-judge calibration builds upon
📖 Next: Debate System Production Deployment — Deploy this scoring calibration system as a production service
📖 Validation: Backtesting & Validation — Verify whether multi-judge weighting actually improves accuracy with 100 historical debates

Frequently Asked Questions

Q: Why use multiple judges? Isn't one enough?

A: Single judges have systematic biases — some naturally favor longer arguments, technical jargon, or particular argumentation styles over substance. Multiple judges cross-validate and cancel out individual biases. Data: with 1 judge, 3 Agent rating consistency is only 0.62 (Krippendorff Alpha); with 5 judges, consistency rises to 0.78.

Q: How do you unify different judges' scoring styles? What is normalization?

A: Some judges are naturally "lenient" (avg score 8.5), others "strict" (avg 5.2). Z-score normalization converts each judge's scores to "standard deviations from that judge's mean" — eliminating personal style differences while preserving relative ranking. Alternatively, min-max scaling maps all scores to [0,1].

Q: Krippendorff Alpha vs Fleiss Kappa — which should I use?

A: Krippendorff Alpha is more general — supports any number of raters, any measurement level (nominal, ordinal, interval, ratio), and handles missing values. Fleiss Kappa is simpler but only for categorical data. Debate scoring is typically ordinal or interval, so prefer Krippendorff Alpha.

Q: How are voting weights determined?

A: Three approaches: ① Historical accuracy weighting — each judge's past prediction accuracy becomes their weight; ② Consensus weighting — judges agreeing with majority get higher weight; ③ Calibration weighting — judges with moderate score distributions get higher weight (prevent extreme scorers from dominating). Recommended: combine ①②, use grid search to find optimal weights.

Q: What if consensus among judges is too low?

A: Very low consensus (Krippendorff Alpha < 0.4) indicates severe disagreement — the system should not force a conclusion. Strategies: ① add debate rounds to expose more evidence; ② introduce a "meta-judge" to specifically adjudicate disagreement points; ③ report the disagreement rather than forcing unity — show users "3/5 judges say up, but the 2 dissenting judges identified risks X and Y."