Category: Machine Learning
Summary: Testing whether late-applied weight decay helps generalization more under noisier small-batch training than under lower-noise large-batch training.
Weight-decay timing and minibatch noise both affect optimization, but they may not do so independently. This experiment asks whether the gain from applying weight decay late is larger when training noise is high, as in small-batch settings, than when optimization is already smoother under large batches.
The core design is an interaction test rather than a single best-setting search. By crossing weight-decay timing with different effective noise levels from batch size, the run checks whether late regularization matters most in the regimes where stochastic training noise leaves more room for a later corrective effect.
That makes the project mechanistic as well as practical. A strong interaction would suggest that timing-based regularization depends on the optimization regime created by minibatch noise, not only on model architecture.
Method: Factorial training sweeps crossing weight-decay timing schedules with batch-size-controlled noise conditions.
What is measured: Generalization performance across timing and batch-noise conditions, interaction strength between the two factors, and relative late-weight-decay benefit by noise level.
