Category: Machine Learning
Summary: Testing whether the reported inverse critical-period effect for weight decay is specific to one compositional task or appears across several distinct learning problems.
A previous Axiom result suggested that narrow networks can benefit more from late weight decay than from early regularization, an inverse critical-period pattern. This experiment asks whether that effect is a general property of training dynamics or just a feature of one benchmark task.
The script repeats the same weight-decay schedules and hyperparameters across four structurally different tasks: compositional classification, teacher-student learning, hidden-manifold data, and noisy spirals. Comparing the same intervention across these tasks isolates whether the timing effect travels with the optimizer setup or depends on the geometry of the data.
That distinction matters for publication claims and for theory. A task-general effect would suggest a broader property of narrow-network optimization, whereas a task-specific result would sharply constrain the scope of the original finding.
Method: Matched-hyperparameter training sweeps across four tasks, comparing late, early, always-on, and absent weight decay at fixed narrow width.
What is measured: Generalization gaps by task and schedule, replication of the inverse critical-period effect, task-to-task variation in timing sensitivity, and support for cross-task generality.
