← Back

Validation Lab Governance Verdict Completed RollingWithHoldout

Campaign Minutes Calibration Sprint 01 Run 5

Walk-forward validation audit, holdout protection, and final promotion readiness for challenger governance.

Dashboard Experiments Models Validation Leaderboard

Status

Completed

Current lifecycle state of this validation run.

Primary Metric

MinutesMae

Governing metric used by the recommendation engine.

Recommendation

Reject

Policy-engine recommendation based on evaluation, holdout, and guardrails.

Window Win Rate

0.0%

Share of comparable windows won by the challenger.

Governance Verdict

Summary of the challenger’s posture after evaluation windows, holdout checks, and promotion policy gates.

Recommendation: Reject Final: Reject

Decision Reason

Comparable evaluation windows=2 (required>=2). Comparable evaluation games=168 (required>=50). Comparable evaluation player rows=3354 (required>=300). Holdout comparable=True (required=True). Holdout games=112 (required>=20). Holdout player rows=2239 (required>=120). RequireHoldoutWin=True. PrimaryMetricImprovementThreshold=0.05. HoldoutRegressionTolerance=0.03. Evaluation delta=1.729747. Holdout delta=0.32548. PointsMae delta=1.907037 (max regression=0.15). MinutesMae delta=1.729747 (max regression=0). FairSpreadMae delta=0.135613 (max regression=n/a). FairTotalMae delta=0.013534 (max regression=n/a). Failed holdout win gate. Failed holdout regression tolerance gate. Failed primary metric improvement gate. Failed PointsMae guard. Failed MinutesMae guard. Recommendation=Reject because challenger failed performance or guardrail gates.

Governance Interpretation

A strong recommendation should still be read alongside holdout performance, guardrail posture, and the final human governance decision.

Decision Support

Audit

Validation Standard

Challenger approval should require both target-metric strength and acceptable regression behavior across protected secondary metrics.

Best Practice

Treat this page as the governance verdict surface. Promotion should remain deliberate and supported by both policy outcome and reviewer judgment.

Traceability

This run preserves immutable validation evidence, explicit gates, holdout results, and final-decision history.

Model Comparison

Baseline production model versus challenger candidate under governance review.

BaselineVsChallenger

Baseline

structured-0010-0003

Model 45

Challenger

structured-0014-0003

Model 75

Evaluation Evidence

Sample Quality

Comparable Windows

Insufficient Windows

Comparable Games

168

Comparable Player Rows

3354

Min Games / Window

Run-Level Sample Gate

Pass

Min Total Games

Min Player Rows

300

Run Definition

Config

Comparison Kind

BaselineVsChallenger

Window Strategy

RollingWithHoldout

Window Count

Gap Days

Window Size Days

Holdout Size Days

Start

2026-02-07

End

2026-03-23

Evaluation Summary

Main Set

Baseline Mean

9.88

Challenger Mean

11.61

Delta

1.73

Stability Score

0.02

Baseline Window Wins

Challenger Window Wins

Holdout Summary

Final Check

Baseline Holdout

8.91

Challenger Holdout

9.23

Holdout Delta

0.33

Source Experiment

Created

2026-03-24 11:36

Completed

2026-03-24 11:37

Promotion Gates

Core gate outcomes used by the policy engine before guardrails and final governance review.

Policy Engine

Comparable Window Gate

Pass

Games Gate

Pass

Player Rows Gate

Pass

Primary Metric Gate

Fail

Holdout Gate

Fail

Secondary Metric Guardrails

Protected regression checks that prevent narrow gains from degrading adjacent model quality.

Guardrails

Points MAE Guard

Fail

Delta: 1.91

Minutes MAE Guard

Fail

Delta: 1.73

Fair Spread MAE Guard

Pass

Delta: 0.14

Fair Total MAE Guard

Pass

Delta: 0.01

Final Decision Governance

Human governance decision layer on top of advisor recommendation and immutable validation evidence.

Final Review

Advisor Recommendation

Reject

Final Decision

Reject

Overridden

Decision By

automation-dashboard

Decision Time

2026-03-24 11:37

Final Decision Reason

Auto-applied by automation workflow

Governance Action

Record the final decision with explicit reasoning. This creates the last decision layer above the policy engine’s recommendation.

Window Results

Comparable and holdout window outcomes across the full walk-forward validation timeline.

RollingWithHoldout 3 window(s)

Window Reading Guide

Use comparable windows to judge repeated challenger strength, and the holdout window to test whether the improvement survives outside the main evaluation set.

#	Role	Status	Range	Baseline	Challenger	Delta	Winner	Games	Player Rows	Notes
1	Evaluation	Comparable	2026-02-07 → 2026-02-20	10.49	12.20	1.71	Baseline	B: 64 C: 64	B: 1276 C: 1279	—
2	Evaluation	Comparable	2026-02-21 → 2026-03-06	9.26	11.02	1.75	Baseline	B: 104 C: 104	B: 2078 C: 2078	—
3	Holdout	Holdout	2026-03-10 → 2026-03-23	8.91	9.23	0.33	Baseline	B: 112 C: 112	B: 2239 C: 2239	—

Notes

Context

Description

Automated campaign validation

Internal Notes

Final decision set to Reject by automation-dashboard at 2026-03-24T18:37:57.6008703Z. Reason: Auto-applied by automation workflow