Experiment: Weight Decay Timing x Optimizer Interaction

Weight Decay Timing x Optimizer Interaction

Category: Machine Learning

Summary: Testing whether the previously observed inverse critical-period effect for weight decay depends on the optimizer rather than appearing uniformly across SGD and Adam variants.

Earlier Axiom work suggested that turning weight decay on late in training can outperform early application on a compositional learning task. This experiment asks whether that effect is a generic property of regularization timing or whether it depends on the optimizer used to train the network.

The script trains the same three-layer ReLU classifier under four optimizers: SGD, momentum SGD, Adam with coupled L2 regularization, and AdamW with decoupled weight decay. For each optimizer it compares never, early, late, and always-on weight decay, then tracks test accuracy, compositionality gap relative to a linear baseline, and effective rank of the learned weight matrices over training.

That makes the experiment mechanistic rather than purely practical. If the timing effect survives only for some optimizers, the result would tie the phenomenon to update geometry and the distinction between coupled and decoupled regularization rather than to weight decay alone.

Method: Controlled MLP training sweeps over optimizer choice and weight-decay timing, with rank tracking and compositionality-gap analysis.

What is measured: Final test accuracy, compositionality gap, linear-baseline accuracy, effective rank, rank history, and test-accuracy trajectories.