On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

IIIS, Tsinghua University,   Shanghai Qi Zhi Institute,   University of California, Los Angeles
*Equal contribution Corresponding author

Abstract

Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). KL regularization is ubiquitous, yet the design surface, choice of KL direction (forward vs. reverse), normalization (normalized vs. unnormalized), and estimator (k1/k2/k3), is scattered across the literature and often intertwined with off-policy estimation. We ask a focused question: under the off-policy setting, what weighting is required for each KL variant so that the surrogate we optimize yields the exact gradient of the intended KL-regularized objective? We answer this with a compact, unified derivation we call the Regularized Policy Gradient (RPG) view. RPG (i) unifies normalized and unnormalized KL variants and shows that the widely-used k3 penalty is exactly the unnormalized KL; (ii) specifies conditions under which REINFORCE-style losses with stop-gradient are gradient-equivalent to fully differentiable surrogates; (iii) identifies and corrects an off-policy importance-weighting mismatch in GRPO’s KL term; and (iv) introduces RPG-Style Clip, a clipped-importance-sampling step within RPGREINFORCE that enables stable, off-policy policy-gradient training at scale. On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to +6 absolute percentage points over DAPO. Notably, RPG is a stable and scalable RL algorithm for LLM reasoning, realized via (a) a KL-correct objective, (b) clipped importance sampling, and (c) an iterative reference-policy update scheme.

Regularized Policy Gradient

  • We derive policy gradients and corresponding surrogate losses for Forward/Reverse KL, in normalized (KL) and unnormalized (UKL) forms, under off-policy sampling with importance weights.
  • We give both fully differentiable surrogates and REINFORCE-style losses (with stop-gradient) and prove their gradient-equivalence to the intended regularized objective (Proposition 4.1, Appendix J).
  • We introduce RPG-Style Clip, a clipped-importance-weighted REINFORCE estimator that substantially improves stability and variance control while preserving the RPG gradients.
  • We reveal the equality between the k3 estimator and unnormalized KL (Appendix B), and show that GRPO’s KL penalty omits an essential importance weight under off-policy sampling. We provide a corrected estimator and loss consistent with the intended objective.
  • We present an iterative training framework that periodically updates the reference model to satisfy KL constraints while allowing the policy to depart meaningfully from the initial checkpoint.
  • On math reasoning, RPG-REINFORCE (with RPG-Style Clip) yields stable and scalable training and outperforms DAPO by up to +6 absolute points on AIME24/25.

Experimental Results

4K context length results

Combined performance metrics on the AIME24 and AIME25 mathematical reasoning benchmarks, showing "Last" and "Best" scores for 4K context length. The "Last" score is from the 400th training step, assuming the training process remained stable to that point. The highest score in each column is bolded, and the second highest is underlined. RPG and RPG-REINFORCE methods are highlighted with light cyan and light green backgrounds, respectively.

2K context length results

Combined performance metrics on the AIME24 and AIME25 mathematical reasoning benchmarks, showing "Last" and "Best" scores for 2K context length. The "Last" score is from the 400th training step, assuming the training process remained stable to that point. The highest score in each column is bolded, and the second highest is underlined. RPG and RPG-REINFORCE methods are highlighted with light cyan and light green backgrounds, respectively.

Training loss plot

Training dynamics and benchmark performance for RPG and REINFORCE-style RPG compared to baselines (GRPO, DAPO) with 4K context length.

Validation loss plot

Training dynamics and benchmark performance for RPG and REINFORCE-style RPG compared to baselines (GRPO, DAPO) with 2K context length.

Regularized Policy Gradients with fully differentiable surrogate loss functions

Regularized Policy Gradients with fully differentiable surrogate loss functions

REINFORCE-Style Regularized Policy Gradients

REINFORCE-Style Regularized Policy Gradients

Citation

Please cite the paper and star this repo if you use RPG and find it interesting/useful, thanks!

@article{zhang2025design,
    title={On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning},
    author={Zhang, Yifan and Liu, Yifeng and Yuan, Huizhuo and Yuan, Yang and Gu, Quanquan and Yao, Andrew C},
    journal={arXiv preprint arXiv:2505.17508},
    year={2025},
}