Experiment: WD Timing Extended Width - Inverse CP at Large Widths

WD Timing Extended Width - Inverse CP at Large Widths

Category: Machine Learning

Summary: Testing whether the previously observed inverse critical-period effect for weight decay still holds in much wider neural networks.

Earlier Axiom results suggested a counterintuitive pattern: applying weight decay later in training can help compositional generalization more than applying it early. This experiment asks whether that inverse critical-period effect survives when model width is pushed well beyond the smaller networks where it was first established.

The script keeps the same compositional task and training setup as the earlier studies, then compares no weight decay, always-on weight decay, early-only weight decay, and late-onset weight decay at widths from 96 to 512. It records how the compositional generalization gap, final norms, and representation-rank proxies change with width.

That makes the experiment less about finding a single best hyperparameter and more about stress-testing a proposed mechanism. If the timing effect persists at large width, it supports the claim that the phenomenon is structural rather than a narrow small-model artifact.

Method: Controlled NumPy MLP sweeps over widths 96 to 512, comparing always-on, early, late, and absent weight decay on the same compositional task.

What is measured: Generalization gaps for each timing condition, late and early effectiveness, whether inverse critical-period behavior holds by width, final weight norms, and representation-rank summaries.