Experiment: Regularization Timing Universality

Regularization Timing Universality

Category: Machine Learning

Summary: Testing whether the advantage of late regularization is a general timing principle across multiple regularizer types, not just weight decay.

Recent Axiom results suggested that applying weight decay late in training can outperform applying it early. This experiment asks whether that timing effect is specific to weight decay or reflects a broader principle that training may be more sensitive to how regularization ends than to how it begins.

The study compares no regularization, always-on, early-only, and late-only schedules for weight decay, dropout, L1 penalties, and Gaussian noise injection. By repeating the same schedule logic across widths, it directly tests whether the inverse critical-period pattern recurs across qualitatively different interventions.

That makes the experiment more than a hyperparameter sweep. If several regularizers show the same timing asymmetry, it would point to a general property of optimization dynamics rather than an isolated quirk of one penalty.

Method: Factorial neural-network training sweeps crossing four regularizer types with no, always, early, and late schedules across widths 32 to 128.

What is measured: Generalization gap, relative performance of early versus late regularization, width dependence, and evidence for cross-regularizer timing universality.