← Back

Validation Lab Governance Verdict Completed RollingWithHoldout

Validate Active v1 vs Challenger 22

Walk-forward validation audit, holdout protection, and final promotion readiness for challenger governance.

Dashboard Experiments Models Validation Leaderboard

Status

Completed

Current lifecycle state of this validation run.

Primary Metric

PraMae

Governing metric used by the recommendation engine.

Recommendation

NeedsReview

Policy-engine recommendation based on evaluation, holdout, and guardrails.

Window Win Rate

0.0%

Share of comparable windows won by the challenger.

Governance Verdict

Summary of the challenger’s posture after evaluation windows, holdout checks, and promotion policy gates.

Recommendation: NeedsReview

Decision Reason

Comparable evaluation windows: 1. Insufficient evaluation windows: 3. Comparable evaluation games: 144. Comparable player rows: 2875. MinimumGamesPerWindow=10. MinimumTotalGames=40. MinimumPlayerRows=400. Evaluation mean PraMae: baseline=10.532744, challenger=10.609572, delta=0.076828. Holdout PraMae: baseline=9.530819, challenger=9.322832, delta=-0.207987. Challenger won 0 of 1 comparable evaluation windows (0.00%).

Governance Interpretation

A strong recommendation should still be read alongside holdout performance, guardrail posture, and the final human governance decision.

Decision Support

Audit

Validation Standard

Challenger approval should require both target-metric strength and acceptable regression behavior across protected secondary metrics.

Best Practice

Treat this page as the governance verdict surface. Promotion should remain deliberate and supported by both policy outcome and reviewer judgment.

Traceability

This run preserves immutable validation evidence, explicit gates, holdout results, and final-decision history.

Model Comparison

Baseline production model versus challenger candidate under governance review.

BaselineVsChallenger

Baseline

Model 1

Challenger

random-0007-0006

Model 22

Evaluation Evidence

Sample Quality

Comparable Windows

Insufficient Windows

Comparable Games

Comparable Player Rows

Min Games / Window

Run-Level Sample Gate

Fail

Min Total Games

Min Player Rows

400

Run Definition

Config

Comparison Kind

BaselineVsChallenger

Window Strategy

RollingWithHoldout

Window Count

Gap Days

Window Size Days

Holdout Size Days

Start

2025-11-10

End

2026-03-09

Evaluation Summary

Main Set

Baseline Mean

10.53

Challenger Mean

10.61

Delta

0.08

Stability Score

—

Baseline Window Wins

Challenger Window Wins

Holdout Summary

Final Check

Baseline Holdout

9.53

Challenger Holdout

9.32

Holdout Delta

-0.21

Source Experiment

Created

2026-03-09 17:41

Completed

2026-03-09 17:41

Promotion Gates

Core gate outcomes used by the policy engine before guardrails and final governance review.

Policy Engine

Comparable Window Gate

Fail

Games Gate

Fail

Player Rows Gate

Fail

Primary Metric Gate

Fail

Holdout Gate

Fail

Secondary Metric Guardrails

Protected regression checks that prevent narrow gains from degrading adjacent model quality.

Guardrails

Points MAE Guard

Pass

Delta: —

Minutes MAE Guard

Pass

Delta: —

Fair Spread MAE Guard

Pass

Delta: —

Fair Total MAE Guard

Pass

Delta: —

Final Decision Governance

Human governance decision layer on top of advisor recommendation and immutable validation evidence.

Final Review

Advisor Recommendation

NeedsReview

Final Decision

—

Overridden

Decision By

—

Decision Time

—

Final Decision Reason

—

Governance Action

Record the final decision with explicit reasoning. This creates the last decision layer above the policy engine’s recommendation.

Window Results

Comparable and holdout window outcomes across the full walk-forward validation timeline.

RollingWithHoldout 5 window(s)

Window Reading Guide

Use comparable windows to judge repeated challenger strength, and the holdout window to test whether the improvement survives outside the main evaluation set.

#	Role	Status	Range	Baseline	Challenger	Delta	Winner	Games	Player Rows
1	Evaluation	Insufficient Data	2025-11-10 → 2025-11-30	—	—	—	—	B: 0 C: 0	B: 0 C: 0
2	Evaluation	Insufficient Data	2025-12-01 → 2025-12-21	—	—	—	—	B: 0 C: 0	B: 0 C: 0
3	Evaluation	Insufficient Data	2025-12-22 → 2026-01-11	—	—	—	—	B: 0 C: 0	B: 0 C: 0
4	Evaluation	Insufficient Data	2026-01-12 → 2026-02-01	10.53	10.61	0.08	—	B: 144 C: 144	B: 2875 C: 2875
5	Holdout	Holdout / Insufficient	2026-02-05 → 2026-02-25	9.53	9.32	-0.21	—	B: 115 C: 115	B: 2295 C: 2295