Experiment: Gradient Noise Scale

Gradient Noise Scale

Category: Machine Learning

Summary: Measuring the gradient noise scale during training and testing whether it predicts the critical batch size where further batching yields diminishing speed gains.

Stochastic gradient descent is noisy because minibatches only approximate the full gradient, and theory suggests that this noise level should define a critical batch size. This experiment asks whether the measured gradient noise scale actually predicts where larger batches stop buying much additional training speed.

The script computes per-sample gradients to estimate the covariance trace divided by squared mean-gradient norm, then trains matched models across multiple batch sizes from the same initialization. By comparing loss trajectories, train and test accuracy, and normalized speed gains, it links the measured noise scale to observed batch-size efficiency.

That gives the project both theoretical and practical value. If the prediction works, the noise scale becomes a measurable quantity that explains why critical batch sizes arise instead of being a purely empirical tuning rule.

Method: Per-sample gradient computations to estimate noise scale, combined with matched batch-size training sweeps from a common initialization.

What is measured: Gradient noise scale, covariance trace, mean-gradient norm, train and test accuracy, loss trajectories, speed gains, and estimated critical batch-size behavior.