Training Replay

Animated replay of a 10k-step XSAKE training run on OpenWebText (5% subset), GPT-style transformer, seq=1024, batch=32. Loss converges to <0.3. HADS recalibration events marked in red.

Speed:
step=0 / 10000
Step
0
Loss
6.1976
Perplexity
491.6
HADS recal
1

Training Loss (raw + EMA smoothed)

HADS Recalibration Events

step=0Head 2: sparsity 0%18%✓ fired
step=500Head 1: sparsity 75%82%
step=1000Head 8: sparsity 0%25%
step=1500Head 5: sparsity 40%44%
step=2000Head 3: sparsity 60%71%
step=2500Head 11: sparsity 80%76%
step=3000Head 0: sparsity 55%62%

HADS recalibrates every 500 steps. Sparsity ratios adapt as heads specialize during training.