Category: Machine Learning
Summary: Tracking how a small GELU network trained with AdamW on a modular arithmetic task transitions from memorization toward broader algorithmic generalization over very long training runs.
This experiment studies grokking: the delayed shift where a neural network first fits its training examples and only later discovers a more general rule. The code trains a hidden-layer GELU model on modular arithmetic with only a fraction of examples shown during training, then follows accuracy and loss for hundreds of thousands of epochs to see when that qualitative change appears.
Rather than treating grokking as a final-score benchmark, the experiment records the full training trajectory. That lets Axiom compare early memorization, later generalization, and the role of strong weight decay in pushing the system toward a cleaner algorithmic solution.
The scientific interest is in the time structure of learning itself. Grokking remains difficult to predict from standard optimization intuition, so dense long-run measurements can help clarify when and why networks abruptly reorganize their internal computation.
Method: Long-horizon AdamW training of a GELU multilayer perceptron on modular arithmetic with sparse logging and trajectory analysis over up to 500,000 epochs.
What is measured: Training and test accuracy, cross-entropy loss, weight norms, and long-time learning-curve transition behavior.
