Category: Machine Learning
Summary: Mapping how both the start time and the duration of a weight-decay window affect compositional generalization.
Weight decay is usually treated as either always on or always off, but that ignores a natural timing question: how long does it actually need to be applied, and when should that window begin? This experiment asks whether short, well-timed windows of weight decay can recover most of the benefit of continuous regularization.
The script builds a two-dimensional map over weight-decay start epoch and duration, compares those windows with no-weight-decay and always-on baselines, and analyzes the pattern separately across network widths. It specifically tests whether short late windows can perform nearly as well as full training-time regularization.
That is useful because it turns a coarse regularization choice into a schedule-design problem. The resulting landscape can reveal whether timing matters more than total exposure, and whether wider networks need different window lengths than narrower ones.
Method: Systematic NumPy training sweeps over weight-decay start times and durations, compared against baseline schedules on a compositional generalization task.
What is measured: Generalization gaps across the start-duration grid, best-performing window by width, short-late-window effectiveness, optimal duration at late starts, diminishing-returns patterns, and prediction-support verdicts.
