AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs

¹Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, ²Peng Cheng Laboratory, ³University of Chinese Academy of Sciences, ⁴University of Texas at Austin, ⁵Shenzhen Campus of Sun Yat-sen University, ⁶Shenzhen University of Advanced Technology， ⁷University of Oxford， ⁸University of Surrey，

NeurIPS 2025

*Corresponding author

Overview

AlphaDecay is a plug-and-play method that boosts training efficacy for diverse optimizers by dynamically tuning module-wise weight-decay coefficients according to the spectral-feature discrepancies observed in LLMs, yielding better perplexity and downstream generalization.

Zero Extra Tuning Cost: After the global weight-decay value is fixed, AlphaDecay is applied instantly—no additional search is required.

Optimizer Agnostic Mode: One-click integration with Adam, AdamW and more; no per-optimizer code changes.

Task Versatile Support: Demonstrated gains on LLM pre-training, fine-tuning, vision transformers, and large-scale image classification.

Method Overview

Distinct Spectra: Attention layers exhibit heavier-tailed spectra, whereas MLP layers show lighter-tailed characteristics.

PL_Alpha-Hill Guided: AlphaDecay assigns larger weight decay to modules with higher PL_Alpha_Hill values, and smaller decay to those with lower values.

Spectrum Balanced: By equalizing module-wise spectra, AlphaDecay consistently improves performance over existing optimizers.

Experimental Results

Main Results: LLaMA Pre-training on C4 Dataset

Comparison with various weight decay scheduling strategies using Adam optimizer on pre-training various sizes of LLaMa models (60M, 135M, 350M, 1B) on C4 dataset. Validation perplexity (↓) is reported. All baselines are carefully tuned. AlphaDecay consistently outperforms uniform decay, AWD, and AdaDecay across all model sizes and weight decay values.

Zero-shot Evaluation on Downstream Tasks

Zero-shot performance comparison on various downstream tasks including ARC-c, ARC-e, PIQA, Hellaswag, OBQA, Winogrande, and BOOLQ. AlphaDecay demonstrates superior generalization capability, outperforming Uniform, AdaDecay, and AWD methods.

More model structures and datasets

Comparison of AlphaDecay with baseline methods on GPT-nano/C4 and ViT-tiny/ImageNet-1K. AlphaDecay achieves the best performance across different architectures and datasets.

@article{he2025alphadecay, title={AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs}, author={Di He and Songjun Tu and Ajay Jaiswal and Li Shen and Ganzhao Yuan and Shiwei Liu and Lu Yin}, journal={arXiv preprint arXiv:2506.14562}, year={2025} }