← Back

Validation Lab Governance Verdict Completed RollingWithHoldout

Validate Active random-0007-0006 vs Challenger 45

Walk-forward validation audit, holdout protection, and final promotion readiness for challenger governance.

Dashboard Experiments Models Validation Leaderboard

Status

Completed

Current lifecycle state of this validation run.

Primary Metric

MinutesMae

Governing metric used by the recommendation engine.

Recommendation

Reject

Policy-engine recommendation based on evaluation, holdout, and guardrails.

Window Win Rate

100.0%

Share of comparable windows won by the challenger.

Governance Verdict

Summary of the challenger’s posture after evaluation windows, holdout checks, and promotion policy gates.

Recommendation: Reject Final: Promote

Decision Reason

Comparable evaluation windows=3 (required>=2). Comparable evaluation games=224 (required>=40). Comparable evaluation player rows=4470 (required>=400). Holdout comparable=True (required=True). Holdout games=36 (required>=6). Holdout player rows=719 (required>=120). RequireHoldoutWin=False. PrimaryMetricImprovementThreshold=0.05. HoldoutRegressionTolerance=0.05. Evaluation delta=null. Holdout delta=null. PointsMae delta=0.073333 (max regression=0.15). MinutesMae delta=-0.085028 (max regression=0.5). FairSpreadMae delta=0.018794 (max regression=0.25). FairTotalMae delta=0.002617 (max regression=0.25). Failed holdout regression tolerance gate. Failed primary metric improvement gate. Recommendation=Reject because challenger failed performance or guardrail gates.

Governance Interpretation

A strong recommendation should still be read alongside holdout performance, guardrail posture, and the final human governance decision.

Decision Support

Audit

Validation Standard

Challenger approval should require both target-metric strength and acceptable regression behavior across protected secondary metrics.

Best Practice

Treat this page as the governance verdict surface. Promotion should remain deliberate and supported by both policy outcome and reviewer judgment.

Traceability

This run preserves immutable validation evidence, explicit gates, holdout results, and final-decision history.

Model Comparison

Baseline production model versus challenger candidate under governance review.

BaselineVsChallenger

Baseline

random-0007-0006

Model 22

Challenger

structured-0010-0003

Model 45

Evaluation Evidence

Sample Quality

Comparable Windows

Insufficient Windows

Comparable Games

224

Comparable Player Rows

4470

Min Games / Window

Run-Level Sample Gate

Pass

Min Total Games

Min Player Rows

400

Run Definition

Config

Comparison Kind

BaselineVsChallenger

Window Strategy

RollingWithHoldout

Window Count

Gap Days

Window Size Days

Holdout Size Days

Start

2026-01-14

End

2026-03-04

Evaluation Summary

Main Set

Baseline Mean

12.02

Challenger Mean

11.93

Delta

-0.09

Stability Score

0.05

Baseline Window Wins

Challenger Window Wins

Holdout Summary

Final Check

Baseline Holdout

10.19

Challenger Holdout

10.00

Holdout Delta

-0.20

Source Experiment

Created

2026-03-11 07:52

Completed

2026-03-11 07:52

Promotion Gates

Core gate outcomes used by the policy engine before guardrails and final governance review.

Policy Engine

Comparable Window Gate

Pass

Games Gate

Pass

Player Rows Gate

Pass

Primary Metric Gate

Pass

Holdout Gate

Pass

Secondary Metric Guardrails

Protected regression checks that prevent narrow gains from degrading adjacent model quality.

Guardrails

Points MAE Guard

Pass

Delta: 0.07

Minutes MAE Guard

Pass

Delta: -0.09

Fair Spread MAE Guard

Pass

Delta: 0.02

Fair Total MAE Guard

Pass

Delta: 0.00

Final Decision Governance

Human governance decision layer on top of advisor recommendation and immutable validation evidence.

Final Review

Advisor Recommendation

Reject

Final Decision

Promote

Overridden

Yes

Decision By

Decision Time

2026-03-11 08:05

Final Decision Reason

This model is better and guardrails are failing right now

Governance Action

Record the final decision with explicit reasoning. This creates the last decision layer above the policy engine’s recommendation.

Window Results

Comparable and holdout window outcomes across the full walk-forward validation timeline.

RollingWithHoldout 4 window(s)

Window Reading Guide

Use comparable windows to judge repeated challenger strength, and the holdout window to test whether the improvement survives outside the main evaluation set.

#	Role	Status	Range	Baseline	Challenger	Delta	Winner	Games	Player Rows	Notes
1	Evaluation	Comparable	2026-01-14 → 2026-01-23	12.89	12.86	-0.03	Challenger	B: 76 C: 76	B: 1517 C: 1517	—
2	Evaluation	Comparable	2026-01-24 → 2026-02-02	11.85	11.78	-0.07	Challenger	B: 72 C: 72	B: 1438 C: 1438	—
3	Evaluation	Comparable	2026-02-03 → 2026-02-12	11.31	11.16	-0.15	Challenger	B: 76 C: 76	B: 1515 C: 1515	—
4	Holdout	Holdout	2026-02-13 → 2026-02-22	10.19	10.00	-0.20	Challenger	B: 36 C: 36	B: 719 C: 719	—

Notes

Context

Description

Direction Memory Validation V1

Internal Notes

Final decision set to Promote by ui at 2026-03-11T15:05:08.7062861Z. Reason: This model is better and guardrails are failing right now