Experiment: Optimizer Comparison

Optimizer Comparison

Category: Machine Learning

Summary: Comparing how SGD, momentum, Nesterov, Adam, and RMSProp differ in convergence speed, gradient behavior, and final generalization on the same classification task.

Optimizers are often compared through a final accuracy table, but that misses the question of how they shape the path of learning. This experiment puts several standard methods on the same six-class Gaussian-cluster problem to ask which ones reach strong performance fastest, which ones stabilize gradients most effectively, and whether those advantages come with different final weight norms or generalization outcomes.

The script trains the same multilayer perceptron repeatedly under SGD, momentum SGD, Nesterov, Adam, and RMSProp, holding the architecture and data fixed while recording train and test loss, time to a target accuracy, gradient norms, and weight growth. Multiple runs per optimizer provide basic error bars rather than a single lucky trajectory.

That makes the result a dynamics comparison rather than a leaderboard. The point is to map how update geometry influences the speed and style of learning, not only which method wins by a narrow margin at the endpoint.

Method: Matched repeated MLP training runs on a six-class Gaussian-cluster task across five optimizers, with trajectory logging of loss, gradient norms, and convergence time.

What is measured: Final train and test loss, epochs to target test accuracy, convergence rate, weight norms, recent gradient norms, and optimizer ranking by mean performance.