AXIOM EXPERIMENT REVIEW — Session 2026-03-01 12:00 UTC
======================================================

KEY SCIENTIFIC FINDINGS
======================================================

1. CATAPULT PHASE — STRONGLY CONFIRMED (Lewkowycz et al. 2020)
   77 independent runs across multiple hosts, with diverse random seeds.
   - Catapult events observed in 82% of runs (63/77)
   - 73 catapult events across 2156 total configs (3.4% of configs)
   - Catapult regime achieves 98.8% test accuracy vs monotone 77.2% — a 21.6% improvement
   - 100% of runs with catapults showed BETTER generalization than monotone training (63/63)
   - Only 1 diverged config out of 2156 — catapults are recoverable, not destructive
   - Catapults occur at high LR (threshold ~0.5-1.0) in wider networks
   - Interpretation: Large LR creates loss spikes that escape sharp minima, landing in
     flatter basins with better generalization. This is strong evidence for the "edge of
     stability" / "catapult" phenomenon being beneficial, not just a curiosity.

2. SAM VS SGD — CONFIRMED NEGATIVE RESULT (3216 config results, v2 with fixed seeding)
   Sharpness-Aware Minimization finds flatter minima but does NOT improve generalization
   in small MLPs on synthetic tasks.
   - SAM produces flatter minima in 80.6% of configurations
   - But SGD wins on accuracy in 74.9% of configurations
   - Mean test accuracy: SGD 0.926 vs SAM 0.918 (SGD is 0.8% better)
   - Mean sharpness: SGD 8.51 vs SAM 6.10 (SAM is 28% flatter)
   - Sharpness ratio (SAM/SGD): 0.87 across all configs
   - By dataset: Spiral (harder) — SGD 90.6% vs SAM 89.0%; Circles (easier) — tied at 94.7%
   - Interpretation: The flat minima hypothesis (flatter = better generalization) does NOT
     hold universally. In the small-model regime, SGD's implicit regularization already
     finds sufficiently good minima. SAM's perturbation-based optimization adds noise
     without benefit, slightly hurting accuracy while reliably reducing sharpness.
     This suggests sharpness-generalization correlation is a consequence, not a cause.

3. RANK DYNAMICS — STRONGLY CONFIRMED (468 config results across multiple seeds)
   Rank compression is universal during neural network training.
   - 100% of configurations (468/468) showed rank compression
   - Mean rank compression ratio: 0.61 (39% reduction in effective rank)
   - Strong negative correlation (-0.62) between compression and accuracy gain
   - More rank compression correlates with MORE learning (better accuracy)
   - All architectures (shallow to deep, narrow to wide) exhibit compression
   - Interpretation: Training drives weight matrices toward lower effective rank,
     concentrating information in fewer singular value directions. This is consistent
     with the neural network "information compression" hypothesis and relates to
     implicit regularization — networks learn low-rank representations that generalize.

CREDIT AWARDED
======================================================
358 results credited, 1,490 total credit (1,331 applied to host/user tables).
Credit tiers: 15 (heavy compute >3h), 8 (long >17min), 4 (medium), 2 (short), 1 (instant/error)

Per-user credit:
  Steve Dodd (uid 56): +647 credit (hosts 87/123/85, 146 results, 102.8 compute hours)
  ChelseaOilman (uid 40): +298 credit (hosts 325/331/337/319, 89 results, 5.9h)
  Coleslaw (uid 122): +162 credit (host 323, 34 results, 11.8h)
  [VENETO] boboviz (uid 79): +44 credit (host 137, 25 results, 5.1h)
  Manuel Stenschke (uid 52): +44 credit (host 86, 13 results, 1.2h)
  [DPC] hansR (uid 5): +35 credit (host 9, 6 results, 6.3h)
  Buckey (uid 66): +34 credit (host 235, 36 results, SSL cert errors)
  dthonon (uid 67): +20 credit (host 249, 3 results, 1.5h)
  zombie67 [MM] (uid 6): +20 credit (host 15, 5 results, 0.7h)
  3C-714 (uid 63): +8 credit (host 95, 2 results, 0.3h)

EXPERIMENTS DEPLOYED
======================================================
1,929 workunits deployed across 66 hosts, filling idle cores.

Experiment mix (per host, round-robin):
  - rank_dynamics.py — rank compression & phase transitions
  - catapult_phase.py — high-LR loss spike recovery (Lewkowycz et al.)
  - sam_vs_sgd_v2.py — SAM vs SGD sharpness/accuracy comparison
  - progressive_sharpening.py — loss landscape sharpening during training
  - feature_learning_phase.py — feature learning phase transitions

Major deployments:
  Host 296 (epyc7v12, 240 cores): 240 WUs
  Host 287 (DESKTOP-N5RAJSE, 192 cores): 192 WUs
  Host 194 (7950x, 128 cores): 128 WUs
  Host 141 (SPEKTRUM, 72 cores): 72 WUs
  Host 269 (JM7, 64 cores): 64 WUs
  + 61 additional hosts with 4-56 WUs each

KNOWN ISSUES
======================================================
  - Host 235 (alix, Arch Linux): SSL CERTIFICATE_VERIFY_FAILED — all experiments error
  - Host 202 (archlinux): Same SSL cert issue
  - Some hosts overscheduled from previous sessions (212: 138 on 16 cores, 258: 137 on 28)
    — in-progress work will drain naturally

NEXT INVESTIGATION PRIORITIES
======================================================
1. Cross-validate catapult_phase on fresh hosts (now deployed to 66 hosts)
2. Confirm rank_dynamics universality with more seeds (1929 new WUs deployed)
3. SAM vs SGD is now conclusive — consider retiring after this batch
4. Design new experiment: Neural Scaling Laws (power-law test loss vs params)
   — test Chinchilla-like compute-optimal allocation in small numpy MLPs
5. Monitor progressive_sharpening and feature_learning_phase results