Training Replay
Animated replay of a 10k-step XSAKE training run on OpenWebText (5% subset), GPT-style transformer, seq=1024, batch=32. Loss converges to <0.3. HADS recalibration events marked in red.
Speed:
step=0 / 10000
Step
0
Loss
6.1976
Perplexity
491.6
HADS recal
1
Training Loss (raw + EMA smoothed)
HADS Recalibration Events
step=0Head 2: sparsity 0% → 18%✓ fired
step=500Head 1: sparsity 75% → 82%
step=1000Head 8: sparsity 0% → 25%
step=1500Head 5: sparsity 40% → 44%
step=2000Head 3: sparsity 60% → 71%
step=2500Head 11: sparsity 80% → 76%
step=3000Head 0: sparsity 55% → 62%
HADS recalibrates every 500 steps. Sparsity ratios adapt as heads specialize during training.