Experiment: Weight Decay and Learning Rate Interaction

Weight Decay and Learning Rate Interaction

Category: Machine Learning

Summary: Testing whether the best time to turn on weight decay depends on the learning-rate schedule rather than being a fixed property of the task.

If weight decay helps only after a representation has begun to crystallize, then changing the pace of learning should shift the best regularization onset time. This experiment asks exactly that: whether warmup schedules delay the helpful window for weight decay, and whether cosine decay makes the useful onset arrive earlier.

The script crosses several learning-rate schedules with several weight-decay onset epochs on the same compositional task, then measures generalization gaps and timing-dependent gains. It is looking for a genuine interaction effect, not just independent main effects of schedule and regularization.

That makes the project a two-dimensional timing map for optimization. The broader significance is that it tests whether regularization timing should be understood relative to the network’s internal learning clock rather than the raw epoch number alone.

Method: Factorial MLP sweeps crossing learning-rate schedule shape with weight-decay onset epoch on a compositional classification task.

What is measured: Generalization gap by schedule and onset, best onset per learning-rate schedule, interaction magnitude, warmup versus cosine shift in optimal onset, and supporting transition diagnostics.