UoS SIAT Oxford UT Austin PCL
UCAS SUAT SYSU

AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs

1Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, 2Peng Cheng Laboratory, 3University of Chinese Academy of Sciences, 4University of Texas at Austin, 5Shenzhen Campus of Sun Yat-sen University, 6Shenzhen University of Advanced Technology, 7University of Oxford, 8University of Surrey,
NeurIPS 2025
*Corresponding author

Overview

AlphaDecay is a plug-and-play method that boosts training efficacy for diverse optimizers by dynamically tuning module-wise weight-decay coefficients according to the spectral-feature discrepancies observed in LLMs, yielding better perplexity and downstream generalization.

Zero Extra Tuning Cost: After the global weight-decay value is fixed, AlphaDecay is applied instantly—no additional search is required.

Optimizer Agnostic Mode: One-click integration with Adam, AdamW and more; no per-optimizer code changes.

Task Versatile Support: Demonstrated gains on LLM pre-training, fine-tuning, vision transformers, and large-scale image classification.

Method Overview

Distinct Spectra: Attention layers exhibit heavier-tailed spectra, whereas MLP layers show lighter-tailed characteristics.

Module Spectral

PL_Alpha-Hill Guided: AlphaDecay assigns larger weight decay to modules with higher PL_Alpha_Hill values, and smaller decay to those with lower values.

More Structure

Spectrum Balanced: By equalizing module-wise spectra, AlphaDecay consistently improves performance over existing optimizers.

Different Weight Decay

Experimental Results

Main Results: LLaMA Pre-training on C4 Dataset

Comparison with various weight decay scheduling strategies using Adam optimizer on pre-training various sizes of LLaMa models (60M, 135M, 350M, 1B) on C4 dataset. Validation perplexity (↓) is reported. All baselines are carefully tuned. AlphaDecay consistently outperforms uniform decay, AWD, and AdaDecay across all model sizes and weight decay values.

Main Results

Zero-shot Evaluation on Downstream Tasks

Zero-shot performance comparison on various downstream tasks including ARC-c, ARC-e, PIQA, Hellaswag, OBQA, Winogrande, and BOOLQ. AlphaDecay demonstrates superior generalization capability, outperforming Uniform, AdaDecay, and AWD methods.

Zero-shot Results

More model structures and datasets

Comparison of AlphaDecay with baseline methods on GPT-nano/C4 and ViT-tiny/ImageNet-1K. AlphaDecay achieves the best performance across different architectures and datasets.

Zero-shot Results

BibTeX

@article{he2025alphadecay,
  title={AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs},
  author={Di He and Songjun Tu and Ajay Jaiswal and Li Shen and Ganzhao Yuan and Shiwei Liu and Lu Yin},
  journal={arXiv preprint arXiv:2506.14562},
  year={2025}
}

Acknowledgements

This repository is built upon the Galore and ConvNeXt repositories. Thanks for their great work!