Category: Machine Learning
Summary: Testing whether a network needs an early free-exploration phase to form compositional features before weight decay can help compress them usefully.
Earlier Axiom results suggested that applying weight decay late can outperform applying it early, implying that regularization timing may matter as much as strength. This experiment asks whether that effect comes from a representation-crystallization event: the network may need an initial period of relatively unconstrained learning to discover a diverse compositional basis before later regularization can sharpen it instead of crushing it.
The model trains multilayer perceptrons on a four-class compositional task under no weight decay, always-on weight decay, early-only weight decay, and late-onset weight decay. During training it tracks effective rank, activation diversity, dead-neuron fraction, and compositional gap to locate when internal representations become organized enough that compression helps rather than harms.
That framing turns a schedule-tuning problem into a mechanism test. The experiment is asking whether the critical timing effect reflects a real structural transition in learned representations, not just a coincidental hyperparameter preference.
Method: Controlled MLP training sweeps over weight-decay timing on a compositional classification task, with epoch-by-epoch tracking of rank, diversity, and compositional gap.
What is measured: Effective-rank trajectories, crystallization epoch, compositional gap, activation diversity, dead-neuron fraction, and correlations between crystallization metrics and final performance.
