Experiment: Grokking Weight-Decay Phase Diagram

Grokking Weight-Decay Phase Diagram

Category: Machine Learning

Summary: Mapping the critical weight-decay range where grokking appears in modular arithmetic by sweeping weight decay and comparing test accuracy across settings.

Grokking is the delayed transition where a model first memorizes and only much later generalizes. This experiment asks where that transition sits in weight-decay space for a modular-arithmetic task, after earlier Axiom runs suggested that very large weight decay fails while smaller values may induce grokking.

The script sweeps weight decay across a targeted range from 0.001 to 0.1 while keeping architecture and optimizer settings fixed. Each value receives an equal share of the overall time budget, turning the run into a one-dimensional phase diagram for how test accuracy changes with regularization strength.

That is useful because grokking is often discussed qualitatively. Here the aim is to locate the boundary numerically, identifying where the training dynamics cross from non-grokking behavior into a regime that supports delayed generalization.

Method: Matched modular-arithmetic training runs sweeping weight decay over a fixed grid to estimate the grokking phase boundary.

What is measured: Test accuracy versus weight decay, location of the grokking phase boundary, and comparative behavior across regularization strengths.