Experiment: Grokking Dynamics v2

Grokking Dynamics v2

Category: Machine Learning

Summary: Measuring the delayed separation between memorization and generalization in modular arithmetic using AdamW, the optimizer setting thought to be important for grokking.

Grokking is most striking when a model reaches near-perfect training accuracy long before it learns the true underlying rule. This experiment studies that delayed transition on modular arithmetic, with particular attention to AdamW-style decoupled weight decay, which earlier work suggests is important for making the phenomenon visible.

The script trains a modular-addition classifier for many epochs and logs training and test behavior at regular intervals. It marks when memorization happens, when test accuracy finally surges, and how weight norms evolve across the long plateau between those two events.

That makes the project a direct timing study of delayed generalization. Rather than asking only whether modular arithmetic can be solved, it asks how and when a network reorganizes itself from a memorizing system into one that has discovered a compact rule.

Method: Long-horizon AdamW training on modular-addition data with explicit tracking of memorization epoch, grokking epoch, and weight-norm dynamics.

What is measured: Memorization epoch, grokking epoch, grokking gap, train and test loss, train and test accuracy, weight norms, and final grokking detection status.