Experiment: WD Window Duration Experiment

WD Window Duration Experiment

Category: Machine Learning

Summary: Mapping how both the start time and the duration of a weight-decay window affect compositional generalization.

Weight decay is usually treated as either always on or always off, but that ignores a natural timing question: how long does it actually need to be applied, and when should that window begin? This experiment asks whether short, well-timed windows of weight decay can recover most of the benefit of continuous regularization.

The script builds a two-dimensional map over weight-decay start epoch and duration, compares those windows with no-weight-decay and always-on baselines, and analyzes the pattern separately across network widths. It specifically tests whether short late windows can perform nearly as well as full training-time regularization.

That is useful because it turns a coarse regularization choice into a schedule-design problem. The resulting landscape can reveal whether timing matters more than total exposure, and whether wider networks need different window lengths than narrower ones.

Method: Systematic NumPy training sweeps over weight-decay start times and durations, compared against baseline schedules on a compositional generalization task.

What is measured: Generalization gaps across the start-duration grid, best-performing window by width, short-late-window effectiveness, optimal duration at late starts, diminishing-returns patterns, and prediction-support verdicts.