—— Simple, effective, intuitive, akin to learning rate scheduling
Authors: Xingjin Wang, Howe Tissue, Lu Wang
PS: This is the third paper in our scaling law series, after our pre-training (PT) scaling law [1], continual pre-training (CPT) scaling law [2]. Now we present laws in reinforcement learning (RL) and post-training. This blog is written by Howe Tissue.
Background and Preliminary
Reinforcement learning (RL) has emerged as a prominent approach for training strong reasoning models. In RL, entropy is an important measure the uncertainty in actions of the actor model (i.e., the probability distribution in decoding each token for LLMs), furthermore, the balance between exploration and exploitation for models.
Now we know many facts about entropy in RL for LLM, based on hundreds of recent papers. Summaries as follows:
- Entropy collapse, meaning that entropy deceases at the beginning and then converges, limits the explorations of LLMs, and should be avoided [3].
- Clip higher tricks and entropy bonus (by subtraction in loss), can improve the performance by making the entropy stable without collapse [3, 4].
- Entropy minimization, seemingly another name of the bad phenomenon: entropy collapse, has been yet found significantly effective in boosting LLM performance in several training steps [5, 6, 7]. Many so-called unsupervised, or even flipped-reward RL training might be actually based entropy minimization [8, 9].
- Without intervention, entropy and performance are seemingly connected within a fixed function, which means that performance improvement is just traded by entropy [10].
So, we have some research questions:
- In practice, is entropy reduction good or bad to RL training?
- If entropy reduction is beneficial under certain conditions, when exactly does this occur? This raises a compelling possibility: can we strategically schedule entropy dynamics throughout training to optimize final model performance?
Parallelogram Law of Entropy