← Back
Validation Lab
Governance Verdict
Completed
RollingWithHoldout
Validate Active random-0007-0006 vs Challenger 45
Walk-forward validation audit, holdout protection, and final promotion readiness for challenger governance.
Status
Completed
Current lifecycle state of this validation run.
Primary Metric
MinutesMae
Governing metric used by the recommendation engine.
Recommendation
Reject
Policy-engine recommendation based on evaluation, holdout, and guardrails.
Window Win Rate
100.0%
Share of comparable windows won by the challenger.
Governance Verdict
Summary of the challenger’s posture after evaluation windows, holdout checks, and promotion policy gates.
Recommendation: Reject
Final: Promote
Decision Reason
Comparable evaluation windows=3 (required>=2). Comparable evaluation games=224 (required>=40). Comparable evaluation player rows=4470 (required>=400). Holdout comparable=True (required=True). Holdout games=36 (required>=6). Holdout player rows=719 (required>=120). RequireHoldoutWin=False. PrimaryMetricImprovementThreshold=0.05. HoldoutRegressionTolerance=0.05. Evaluation delta=null. Holdout delta=null. PointsMae delta=0.073333 (max regression=0.15). MinutesMae delta=-0.085028 (max regression=0.5). FairSpreadMae delta=0.018794 (max regression=0.25). FairTotalMae delta=0.002617 (max regression=0.25). Failed holdout regression tolerance gate. Failed primary metric improvement gate. Recommendation=Reject because challenger failed performance or guardrail gates.
Governance Interpretation
A strong recommendation should still be read alongside holdout performance, guardrail posture,
and the final human governance decision.
Decision Support
AuditValidation Standard
Challenger approval should require both target-metric strength and acceptable regression behavior across protected secondary metrics.
Best Practice
Treat this page as the governance verdict surface. Promotion should remain deliberate and supported by both policy outcome and reviewer judgment.
Traceability
This run preserves immutable validation evidence, explicit gates, holdout results, and final-decision history.
Model Comparison
Baseline production model versus challenger candidate under governance review.
BaselineVsChallenger
Baseline
random-0007-0006
Model 22
Challenger
structured-0010-0003
Model 45
Evaluation Evidence
Sample QualityComparable Windows
3
Insufficient Windows
0
Comparable Games
224
Comparable Player Rows
4470
Min Games / Window
10
Run-Level Sample Gate
Pass
Min Total Games
40
Min Player Rows
400
Run Definition
ConfigComparison Kind
BaselineVsChallenger
Window Strategy
RollingWithHoldout
Window Count
3
Gap Days
0
Window Size Days
10
Holdout Size Days
10
Start
2026-01-14
End
2026-03-04
Evaluation Summary
Main SetBaseline Mean
12.02
Challenger Mean
11.93
Delta
-0.09
Stability Score
0.05
Baseline Window Wins
0
Challenger Window Wins
3
Holdout Summary
Final CheckBaseline Holdout
10.19
Challenger Holdout
10.00
Holdout Delta
-0.20
Source Experiment
10
Created
2026-03-11 07:52
Completed
2026-03-11 07:52
Promotion Gates
Core gate outcomes used by the policy engine before guardrails and final governance review.
Policy Engine
Comparable Window Gate
Pass
Games Gate
Pass
Player Rows Gate
Pass
Primary Metric Gate
Pass
Holdout Gate
Pass
Secondary Metric Guardrails
Protected regression checks that prevent narrow gains from degrading adjacent model quality.
Guardrails
Points MAE Guard
Pass
Delta: 0.07
Minutes MAE Guard
Pass
Delta: -0.09
Fair Spread MAE Guard
Pass
Delta: 0.02
Fair Total MAE Guard
Pass
Delta: 0.00
Final Decision Governance
Human governance decision layer on top of advisor recommendation and immutable validation evidence.
Final Review
Advisor Recommendation
Reject
Final Decision
Promote
Overridden
Yes
Decision By
ui
Decision Time
2026-03-11 08:05
Final Decision Reason
This model is better and guardrails are failing right now
Governance Action
Record the final decision with explicit reasoning. This creates the last decision layer above the policy engine’s recommendation.
Window Results
Comparable and holdout window outcomes across the full walk-forward validation timeline.
RollingWithHoldout
4 window(s)
Window Reading Guide
Use comparable windows to judge repeated challenger strength, and the holdout window to test whether the improvement survives outside the main evaluation set.
| # | Role | Status | Range | Baseline | Challenger | Delta | Winner | Games | Player Rows | Notes |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Evaluation | Comparable | 2026-01-14 → 2026-01-23 | 12.89 | 12.86 | -0.03 | Challenger |
B: 76
C: 76
|
B: 1517
C: 1517
|
— |
| 2 | Evaluation | Comparable | 2026-01-24 → 2026-02-02 | 11.85 | 11.78 | -0.07 | Challenger |
B: 72
C: 72
|
B: 1438
C: 1438
|
— |
| 3 | Evaluation | Comparable | 2026-02-03 → 2026-02-12 | 11.31 | 11.16 | -0.15 | Challenger |
B: 76
C: 76
|
B: 1515
C: 1515
|
— |
| 4 | Holdout | Holdout | 2026-02-13 → 2026-02-22 | 10.19 | 10.00 | -0.20 | Challenger |
B: 36
C: 36
|
B: 719
C: 719
|
— |
Notes
ContextDescription
Direction Memory Validation V1
Internal Notes
Final decision set to Promote by ui at 2026-03-11T15:05:08.7062861Z. Reason: This model is better and guardrails are failing right now