← Back
Validation Lab Governance Verdict Completed RollingWithHoldout

Validate Active random-0007-0006 vs Challenger 45

Walk-forward validation audit, holdout protection, and final promotion readiness for challenger governance.
Status
Completed
Current lifecycle state of this validation run.
Primary Metric
MinutesMae
Governing metric used by the recommendation engine.
Recommendation
Reject
Policy-engine recommendation based on evaluation, holdout, and guardrails.
Window Win Rate
100.0%
Share of comparable windows won by the challenger.

Governance Verdict

Summary of the challenger’s posture after evaluation windows, holdout checks, and promotion policy gates.
Recommendation: Reject Final: Promote
Decision Reason
Comparable evaluation windows=3 (required>=2). Comparable evaluation games=224 (required>=40). Comparable evaluation player rows=4470 (required>=400). Holdout comparable=True (required=True). Holdout games=36 (required>=6). Holdout player rows=719 (required>=120). RequireHoldoutWin=False. PrimaryMetricImprovementThreshold=0.05. HoldoutRegressionTolerance=0.05. Evaluation delta=null. Holdout delta=null. PointsMae delta=0.073333 (max regression=0.15). MinutesMae delta=-0.085028 (max regression=0.5). FairSpreadMae delta=0.018794 (max regression=0.25). FairTotalMae delta=0.002617 (max regression=0.25). Failed holdout regression tolerance gate. Failed primary metric improvement gate. Recommendation=Reject because challenger failed performance or guardrail gates.
Governance Interpretation
A strong recommendation should still be read alongside holdout performance, guardrail posture, and the final human governance decision.

Decision Support

Audit
Validation Standard
Challenger approval should require both target-metric strength and acceptable regression behavior across protected secondary metrics.
Best Practice
Treat this page as the governance verdict surface. Promotion should remain deliberate and supported by both policy outcome and reviewer judgment.
Traceability
This run preserves immutable validation evidence, explicit gates, holdout results, and final-decision history.

Model Comparison

Baseline production model versus challenger candidate under governance review.
BaselineVsChallenger
Baseline
random-0007-0006
Model 22
Challenger
structured-0010-0003
Model 45

Evaluation Evidence

Sample Quality
Comparable Windows
3
Insufficient Windows
0
Comparable Games
224
Comparable Player Rows
4470
Min Games / Window
10
Run-Level Sample Gate
Pass
Min Total Games
40
Min Player Rows
400

Run Definition

Config
Comparison Kind
BaselineVsChallenger
Window Strategy
RollingWithHoldout
Window Count
3
Gap Days
0
Window Size Days
10
Holdout Size Days
10
Start
2026-01-14
End
2026-03-04

Evaluation Summary

Main Set
Baseline Mean
12.02
Challenger Mean
11.93
Delta
-0.09
Stability Score
0.05
Baseline Window Wins
0
Challenger Window Wins
3

Holdout Summary

Final Check
Baseline Holdout
10.19
Challenger Holdout
10.00
Holdout Delta
-0.20
Source Experiment
10
Created
2026-03-11 07:52
Completed
2026-03-11 07:52

Promotion Gates

Core gate outcomes used by the policy engine before guardrails and final governance review.
Policy Engine
Comparable Window Gate
Pass
Games Gate
Pass
Player Rows Gate
Pass
Primary Metric Gate
Pass
Holdout Gate
Pass

Secondary Metric Guardrails

Protected regression checks that prevent narrow gains from degrading adjacent model quality.
Guardrails
Points MAE Guard
Pass
Delta: 0.07
Minutes MAE Guard
Pass
Delta: -0.09
Fair Spread MAE Guard
Pass
Delta: 0.02
Fair Total MAE Guard
Pass
Delta: 0.00

Final Decision Governance

Human governance decision layer on top of advisor recommendation and immutable validation evidence.
Final Review
Advisor Recommendation
Reject
Final Decision
Promote
Overridden
Yes
Decision By
ui
Decision Time
2026-03-11 08:05
Final Decision Reason
This model is better and guardrails are failing right now
Governance Action
Record the final decision with explicit reasoning. This creates the last decision layer above the policy engine’s recommendation.

Window Results

Comparable and holdout window outcomes across the full walk-forward validation timeline.
RollingWithHoldout 4 window(s)
Window Reading Guide
Use comparable windows to judge repeated challenger strength, and the holdout window to test whether the improvement survives outside the main evaluation set.
# Role Status Range Baseline Challenger Delta Winner Games Player Rows Notes
1 Evaluation Comparable 2026-01-14 → 2026-01-23 12.89 12.86 -0.03 Challenger
B: 76
C: 76
B: 1517
C: 1517
2 Evaluation Comparable 2026-01-24 → 2026-02-02 11.85 11.78 -0.07 Challenger
B: 72
C: 72
B: 1438
C: 1438
3 Evaluation Comparable 2026-02-03 → 2026-02-12 11.31 11.16 -0.15 Challenger
B: 76
C: 76
B: 1515
C: 1515
4 Holdout Holdout 2026-02-13 → 2026-02-22 10.19 10.00 -0.20 Challenger
B: 36
C: 36
B: 719
C: 719

Notes

Context
Description
Direction Memory Validation V1
Internal Notes
Final decision set to Promote by ui at 2026-03-11T15:05:08.7062861Z. Reason: This model is better and guardrails are failing right now