============================================================
AXIOM EXPERIMENT RESULTS — March 1, 2026 3:45 AM
============================================================

PREVIOUSLY RECORDED RESULT IDs (do not re-record these):
1509027, 1509028, 1509029, 1509030, 1509031, 1509034, 1509035, 1509036, 1509037, 1509039, 1509040, 1509041, 1509042, 1509044, 1509045, 1509046, 1509048, 1509049, 1509050, 1509051, 1509052, 1509053, 1509054, 1509055, 1509056, 1509067, 1509068, 1509081

CREDITED RESULT IDs (do not re-credit these):
1509034 (10cr ChelseaOilman), 1509035 (15cr philip-in-hongkong), 1509036 (15cr philip-in-hongkong), 1509037 (75cr Coleslaw), 1509039 (50cr makracz), 1509040 (15cr makracz), 1509041 (30cr makracz), 1509042 (10cr makracz), 1509044 (25cr zioriga), 1509045 (75cr ChelseaOilman), 1509046 (5cr Vato), 1509048 (5cr Drago75), 1509049 (5cr Coleslaw), 1509050 (5cr Steve Dodd), 1509051 (15cr Drago75), 1509052 (25cr Steve Dodd), 1509054 (15cr Vato), 1509053 (30cr Steve Dodd), 1509055 (15cr makracz), 1509056 (50cr makracz), 1509067 (50cr Steve Dodd), 1509068 (25cr Steve Dodd), 1509081 (30cr Steve Dodd)

SUMMARY
-------
New results this session: 6
Total completed (all time): 27 successful, 1 GPU failure
Total in-progress: 35
Total workunits deployed: ~1,800 (mass deployment to fill all cores)
Credit awarded this session: 200

NEW RESULTS (ranked by scientific interest)
-------------------------------------------

1. [1509056] POWER LAW FORGETTING v2 — Host: SPEKTRUM (72 CPUs, 191GB Windows)
   User: makracz
   Runtime: 13.8s
   Credit: 50 (Excellent — v2 redesign worked perfectly)
   Findings:
     - MASSIVE IMPROVEMENT over v1 (which showed zero forgetting)
     - Task similarity: -0.071 (nearly orthogonal — good experimental design)
     - Naive SGD catastrophic forgetting: 64-66% of task A accuracy lost
       bottleneck=20: 72.0% -> 7.7% (forgot 64.3%)
       bottleneck=50: 73.3% -> 7.0% (forgot 66.3%)
       bottleneck=100: 73.3% -> 8.0% (forgot 65.3%)
     - EWC (Elastic Weight Consolidation) WORKS:
       bottleneck=20: forgot only 32.7% (vs 64.3% naive) — task A preserved at 39.3%
       bottleneck=50: forgot 47.0% (vs 66.3%)
       bottleneck=100: forgot 59.0% (vs 65.3%)
     - Classic EWC tradeoff: preserving task A costs task B performance
       EWC taskB: 19.3% (vs 78.7% naive) at bottleneck=20
     - Wider bottleneck = EWC less effective (more parameters to constrain)
     - This is a textbook demonstration of catastrophic forgetting and EWC
   Quality: Excellent — exactly what the v2 redesign aimed for

2. [1509067] EDGE OF CHAOS v2 (cross-validation) — Host: DadOld-PC (80 CPUs, 128GB Windows)
   User: Steve Dodd
   Runtime: 114.7s
   Credit: 50 (Excellent — perfect replication on different hardware)
   Findings:
     - PERFECT CROSS-VALIDATION of h320 result:
     - Zero crossing at radius 1.269 (h320 also found 1.269 — IDENTICAL)
     - Peak memory capacity 34.76 at radius 1.0 (h320 also found 34.76 at 1.0 — IDENTICAL)
     - Lyapunov range: -2.303 to +0.287 (matches h320)
     - Clean monotonic transition confirmed on 80-CPU machine vs 20-CPU original
     - 30 radii tested, 5 trials each
     - EDGE OF CHAOS IS NOW CONFIRMED across two independent hosts
   Quality: Excellent — textbook replication

3. [1509081] GRADIENT NOISE SCALE — Host: DadOld-PC (80 CPUs, 128GB Windows)
   User: Steve Dodd
   Runtime: 52.0s
   Credit: 30 (Good — novel measurement confirms theory)
   Findings:
     - B_noise at init: 7.79 → predicts critical batch size ~8
     - B_noise increases 14x during training: 6.97 → 98.79
     - Gradients become noisier near convergence (expected — approaching minimum)
     - Efficiency curve confirms prediction:
       batch=1: efficiency=1.00, batch=4: 0.25, batch=16: 0.06, batch=64: 0.015
     - Steep dropoff after batch=1 — for this small network, even batch=4 is past the elbow
     - Interpretation: B_noise is very small for small networks (few parameters = low gradient variance)
     - As training progresses, gradients decorrelate → B_noise grows → larger batches become viable
   Quality: Good — first empirical B_noise measurement on our platform, confirms McCandlish et al.

4. [1509053] CRITICAL LEARNING PERIODS — Host: Dad-Workstation (80 CPUs, 128GB Windows)
   User: Steve Dodd
   Runtime: 798.1s (13 minutes — longest experiment yet)
   Credit: 30 (Fair — 111 sub-experiments but base model doesn't generalize)
   Findings:
     - Architecture [25, 128, 64, 32, 8], 120 epochs, 5 deficit types
     - PROBLEM: control model reaches 100% train but only 12.3% test (random = 12.5% for 8 classes)
     - COUNTERINTUITIVE result: longer deficits IMPROVE test accuracy
       5-epoch deficit: 12.5% test → 60-epoch deficit: 31.3% test
     - This is a regularization effect — disrupting training prevents overfitting
     - Deficit types vary: shuffled_labels worst at early epochs (11), random_labels worst at epoch 21
     - gaussian_noise is least harmful (acts as data augmentation)
     - The critical period signal exists but is INVERTED — deficits help because model is already broken
     - Need a model that actually generalizes first to see true critical periods
   Quality: Fair — massive compute (111 sub-experiments), interesting regularization finding,
            but the experiment answers a different question than intended

5. [1509068] EIGENSPECTRUM DYNAMICS — Host: DadOld-PC (80 CPUs, 128GB Windows)
   User: Steve Dodd
   Runtime: 43.6s
   Credit: 25 (Good — novel random matrix theory analysis)
   Findings:
     - Model trains from 11.8% to 100% test accuracy — FULL CONVERGENCE
     - Weight matrix spectral analysis across 4 layers, 21 snapshots:
     - W0 (20x128, input): stable throughout training
       max_sv: 4.94 → 4.95, effective_rank: 14.3, no outliers
     - W1 (128x128, widest): 8 Marchenko-Pastur outlier eigenvalues throughout
       max_sv: 2.74 → 2.76, outlier_variance_ratio: 20.7% → 20.9%
     - W2 (128x64): no outliers, max_sv: 2.35 → 2.38
     - W3 (64x6, output): smallest effective_rank (4.46 → 4.62), expected for 6-class output
     - KEY INSIGHT: outlier eigenvalues exist AT INITIALIZATION in the 128x128 layer
       and DON'T change during training. The network learns to classify perfectly
       by adjusting feature directions (singular vectors) not magnitudes (singular values).
     - Effective rank barely changes — the spectral structure is remarkably stable
   Quality: Good — connects to random matrix theory, the stability finding is surprising

6. [1509055] LOSS LANDSCAPE — Host: SPECTRE (32 CPUs, 126GB Windows)
   User: makracz
   Runtime: 2.2s
   Credit: 15 (Good — fast clean execution, 51x51 grid)
   Findings:
     - 2D loss surface computed on 51x51 grid (2,601 evaluations)
     - Filter-normalized random directions
     - Model converged fully before landscape was mapped
     - Clean execution, data ready for visualization
   Quality: Good — clean data, would benefit from visualization tools we don't have

CREDIT LEDGER (this session)
-----------------------------
Steve Dodd (userid=56):  +135 (h87: 30cr Critical Periods, h85: 50+30+25cr = 105cr for EoC/Gradient/Eigen)
makracz (userid=80):     +65  (h141: 50cr PLF v2, h143: 15cr Loss Landscape)
TOTAL:                   200

MAJOR MILESTONES
-----------------
- EDGE OF CHAOS: Now confirmed by TWO independent hosts (h320 + h85). Critical point at
  radius 1.269, peak memory at 1.0. This is a SOLID, reproducible finding.

- POWER LAW FORGETTING: v2 redesign SUCCESS. Shows real catastrophic forgetting (64% lost)
  and EWC protection (only 33% lost). v1 showed zero forgetting = broken design.

- GRADIENT NOISE SCALE: First B_noise measurement on the platform. Confirms theoretical
  predictions about batch size efficiency. Novel contribution.

- EIGENSPECTRUM: Surprising stability — networks can go from random to perfect accuracy
  without changing their spectral structure. Outlier eigenvalues are structural features
  of random initialization, not learned features.

MASS DEPLOYMENT NOTE
---------------------
This session deployed 1,773 new workunits across 61 active hosts, filling ALL cores:
- 10 new experiment scripts written (grokking_dynamics, lottery_ticket_v2, mode_connectivity_v2,
  cellular_automata_v2, optimizer_comparison, random_label_memorization, information_bottleneck_deep,
  reservoir_extended, gradient_noise_scale, eigenspectrum_dynamics)
- 33 unique experiments deployed to each host
- Big hosts (80-240 CPUs) also get independent replications of seed-aware scripts
- Every host from 4-CPU to 240-CPU is now at 100% experiment coverage

STATUS: 35 experiments in-progress (state=4), ~1,740 awaiting host check-in

EXPERIMENTS NEEDING REDESIGN
-----------------------------
1. Critical Learning Periods: base model must generalize (12.3% test = random chance).
   Fix: use binary classification with 2000+ samples, architecture [20, 64, 32, 2],
   so control hits 90%+ test accuracy. Then deficit effects will be meaningful.

2. Lottery Ticket (original): base model 11% test accuracy. v2 deployed with binary
   classification fix — awaiting results.

3. Mode Connectivity (original): models only 11% accuracy. v2 deployed with binary
   classification and 500 epochs — awaiting results.

CROSS-VALIDATION STATUS
------------------------
CONFIRMED (identical results, different hardware):
  - Edge of Chaos v2: h320 (Dell-9520, 20 CPUs) + h85 (DadOld-PC, 80 CPUs) — IDENTICAL
    Critical point: radius 1.269, peak memory: 34.76 at radius 1.0
  - Benford Law: h253 (ASUS, Linux) + h7 (iand, Windows) — IDENTICAL (deterministic seed)
  - Cellular Automata: 2 runs on h267 — identical (fitness 0.455)

AWAITING:
  - Information Bottleneck: original h209 (excellent), replication pending
  - Reservoir Computing: original h321 (excellent), replication pending
  - All new experiments: first results will serve as baseline for cross-validation