Experiment: Weight Decay Timing x Gradient Clipping Interaction

Weight Decay Timing x Gradient Clipping Interaction

Category: Machine Learning

Summary: Testing whether gradient clipping reduces the out-of-distribution benefit of turning on weight decay late in training.

Weight decay and gradient clipping are both used to stabilize neural-network training, but they may overlap in what they accomplish. This experiment asks whether clipping already suppresses the unstable, high-norm updates that late-onset weight decay would otherwise clean up, thereby shrinking the extra generalization benefit of applying weight decay later rather than from the start.

The script trains the same model under four conditions: no regularization, late weight decay, clipping only, and clipping plus late weight decay. By comparing in-distribution and out-of-distribution accuracy across repeated trials, it measures whether the late-weight-decay gain becomes smaller once clipping is already active.

That matters because timing effects in regularization are often studied one technique at a time. This design turns the question into an interaction test, asking whether two common training interventions are partially redundant or genuinely complementary.

Method: Repeated NumPy MLP training runs comparing four conditions, with late-onset weight decay and gradient clipping crossed in a factorial design.

What is measured: In-distribution accuracy, out-of-distribution accuracy, generalization gap, confidence, clipping-event counts, and the interaction effect between clipping and late weight decay.