Axiom BOINC Session Log — s0303e (~07:12 UTC, March 3, 2026)
=============================================================

DEPLOYMENT SUMMARY
==================

CPU Deployment: 186 workunits to 9 hosts
  Host 123 (Dads-PC, 80 CPUs, 128GB): 49 WUs
  Host 320 (Dell-9520, 20 CPUs, 32GB): 40 WUs
  Host 87 (Dad-Workstation, 80 CPUs, 128GB): 24 WUs
  Host 137 (Note11Ste, 12 CPUs, 15GB): 19 WUs
  Host 29 (DESKTOP-P57624Q, 8 CPUs, 16GB): 16 WUs
  Host 1 (Pyhelix, 16 CPUs, 32GB): 14 WUs
  Host 345 (Andre-WEBK, 8 CPUs, 8GB): 8 WUs
  Host 334 (Golf-1, 32 CPUs, 30GB): 8 WUs
  Host 16 (dahyun, 32 CPUs, 32GB): 8 WUs
  Skipped: Host 63 (4GB), Host 118 (3GB) — insufficient RAM

  CPU experiment distribution (weighted):
    svd_rank_intervention (35%), percolation_scaling (20%), wd_lr_interaction (20%),
    wd_window_duration (10%), representation_crystallization (5%),
    regularization_mechanisms (5%), wd_rebound_dynamics (3%), bottleneck_mechanism (2%)

GPU Deployment: 208 workunits to 78 hosts
  26 hosts with 2 GPUs: 4 WUs each (104 total)
  52 hosts with 1 GPU: 2 WUs each (104 total)
  GPU scripts deployed: wd_timing_scale_gpu.py, grad_subspace_wd_gpu.py (NEW)
  Skipped: Host 116 (7GB), Host 118 (3GB) — insufficient RAM

GPU CHECKPOINT:
  - 78 GPU hosts identified with idle GPU capacity
  - 208 GPU workunits deployed (2 experiments x host count)
  - Scripts used: wd_timing_scale_gpu.py (finding #54), grad_subspace_wd_gpu.py (NEW finding #55)

TOTAL: 394 workunits (186 CPU + 208 GPU)

NEW EXPERIMENT: Gradient Subspace Dimension and WD Timing (#55)
===============================================================

Script: grad_subspace_wd_gpu.py (GPU-accelerated)

Novelty check (MANDATORY 4A):
  4A-1. Literature search queries:
    - "gradient subspace dimensionality during neural network training"
    - "gradient subspace dimension weight decay timing"
    - "Hessian eigenspectrum weight decay timing"
    - "loss landscape curvature regularization timing"

  4A-2. What was found:
    - Gur-Ari et al. (2018) "Gradient Descent Happens in a Tiny Subspace": Gradients collapse
      into top-k Hessian eigenvectors. Well-established finding.
    - Ghorbani et al. (2019): Large isolated Hessian eigenvalues appear rapidly; batch norm
      suppresses them.
    - Xie & Sato (2021) "Understanding and Scheduling Weight Decay": WD accelerates convergence
      in near-zero Hessian eigenvalue subspace, but does NOT connect to gradient subspace dim.
    - Galanti et al. (2022) "SGD and WD Secretly Minimize Rank": SGD+WD induces low-rank bias
      on weight matrices. Different from gradient subspace.
    - Cohen et al. (2021) "Edge of Stability": Progressive sharpening then oscillation.
      NOT connected to WD timing.
    - AdaDecay (2019): Per-parameter adaptive WD based on gradient norms, but not subspace dim.

  4A-3. Novel angle:
    NO prior work connects gradient subspace dimensionality to weight decay timing.
    The pieces exist in isolation: (a) gradient subspace collapse is well-measured,
    (b) WD timing matters (our findings #44-45), but (c) nobody has linked them.
    This experiment provides the causal bridge — testing whether gradient subspace
    collapse PREDATES the optimal WD onset, explaining WHY late WD works.

Hypothesis: WD becomes effective precisely when the gradient subspace has collapsed
into a low-rank manifold. Early WD disrupts high-dimensional exploration; late WD
regularizes the (large) complement of the (small) learning subspace for "free."

Method: Width-512 MLP, 4 WD conditions, measure effective gradient rank every 10
epochs via participation ratio of output-layer gradient covariance eigenvalues.

KEY SCIENTIFIC FINDINGS
=======================

1. WD TIMING SCALE DEPENDENCE (Finding #54): 2 successful GPU results (RTX 4070 Ti).
   PRELIMINARY: Inverse CP does NOT consistently hold at all network widths.
   Late WD effectiveness: 67% at width-64, 5% at 256, 25% at 512, 75% at 1024, 0% at 2048.
   Second seed shows different pattern: 600% at 64, 0% at 256, -27% at 512, 63% at 1024, 85% at 2048.
   Highly noisy with only 2 seeds. MANY GPU hosts failing with CUDA_ERROR_NO_BINARY_FOR_GPU
   (architecture mismatch). Only host 330 (RTX 4070 Ti) succeeded. Need more diverse seeds from
   hosts with compatible CUDA architectures.

2. GRADIENT SUBSPACE WD TIMING (Finding #55): NEW experiment deployed this session.
   208 GPU workunits sent to 78 hosts. First results expected within ~30 minutes.
   This is the first experiment to directly test the gradient subspace collapse → WD timing
   connection identified as a literature gap.

3. CPU experiments continue collecting diverse-seed data for:
   - SVD rank intervention (#51): Causal mechanism test, needs diverse seeds
   - Percolation phase transitions (#52): Cross-disciplinary statistical physics
   - WD x LR interaction (#49): Confirmed interaction, needs diverse seeds
   - WD window duration (#50): Seed bug fixed, collecting diverse results

KNOWN ISSUES
============
- GPU CUDA arch mismatch: Most GPU hosts fail with CUDA_ERROR_NO_BINARY_FOR_GPU.
  Only ~22 hosts have successfully completed GPU experiments. The bundled CuPy NVRTC
  cannot JIT-compile for newer GPU architectures (Blackwell sm_120+). This requires
  a binary rebuild with newer CUDA toolkit — outside scope of this deployment script.
- Large backlog of 8000+ unsent WUs from old/retired experiments. Low priority —
  BOINC scheduler deprioritizes them vs newer targeted work.

NEXT SESSION PRIORITIES
=======================
1. Analyze first grad_subspace_wd_gpu results — check if gradient subspace collapse
   timing correlates with WD effectiveness timing
2. Continue collecting wd_timing_scale results from compatible GPU hosts
3. If gradient subspace experiment shows signal, design follow-up with multiple
   hidden widths to test scale dependence of the collapse → WD timing relationship
4. Consider retiring old unsent WUs to clean up the work queue