Experiment: Regularized Compositionality: Dropout & Weight Decay vs Width

Regularized Compositionality: Dropout & Weight Decay vs Width

Category: Machine Learning

Summary: Testing whether training-time regularization can rescue compositional generalization as neural networks get wider.

Earlier Axiom work suggested that wider networks can generalize compositionally worse, possibly because they develop redundant low-rank internal structure. This experiment asks whether training-time regularizers such as dropout and weight decay can counter that width-driven failure better than post-training structural interventions could.

The script trains models across several widths and regularization settings, then compares in-distribution accuracy, out-of-distribution accuracy, compositional gap, and effective-rank diagnostics. The central question is whether dropout or weight decay helps especially strongly at large width, revealing a width-by-regularization interaction rather than a uniform improvement.

That matters because it turns a negative scaling result into a mechanistic test. If regularization rescues wide models specifically where redundancy is strongest, it supports the idea that representational excess, not width alone in the abstract, is what harms compositional generalization.

Method: Controlled NumPy training sweeps over network width and regularization regime, comparing dropout, weight decay, and combined settings on a compositional task.

What is measured: In-distribution accuracy, out-of-distribution accuracy, compositional gap, effective-rank measures, rank ratio, best regularizer by width, and dropout-by-width interaction summaries.