On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

IIIS, Tsinghua University, Shanghai Qi Zhi Institute, University of California, Los Angeles
^*Equal contribution ^†Corresponding author

Abstract

Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). Despite the widespread use of Kullback-Leibler (KL) regularization in policy gradient algorithms to stabilize training, the systematic exploration of how different KL divergence formulations can be estimated and integrated into surrogate loss functions for online reinforcement learning (RL) presents a nuanced and systematically explorable design space. In this paper, we propose Regularized Policy Gradient (RPG), a systematic framework for deriving and analyzing KL-regularized policy gradient methods in the online RL setting. We derive policy gradients and corresponding surrogate loss functions for objectives regularized by both forward and reverse KL divergences, considering both normalized and unnormalized policy distributions. Furthermore, we present derivations for fully differentiable loss functions as well as REINFORCE-style gradient estimators, accommodating diverse algorithmic needs. We conduct extensive experiments on RL for LLM reasoning using these methods, showing improved or competitive results in terms of training stability and performance compared to strong baselines such as GRPO, REINFORCE++, and DAPO.

We derive policy gradients and corresponding surrogate loss functions for objectives regularized by Forward and Reverse KL divergences, considering both standard normalized (KL) and unnormalized (UKL) forms.
Our methods operate within an iterative training framework where the reference model for KL regularization is the policy from the last iteration, providing a dynamic and adaptive regularization target.
We systematically provide derivations for fully differentiable loss functions (offering connections to variational inference) and REINFORCE-style gradient estimators (employing the stop-gradient operator). These are developed for the online setting, using off-policy gradient estimation via importance sampling from a prior policy. We explicitly detail the connection between the k3 estimator and our unnormalized KL (UKL) framework.
Based on our derivations, we identify a theoretical inconsistency in the GRPO objective's KL term estimation and propose a corrected gradient estimator and corresponding loss function that properly incorporates importance weighting. We also analyze the KL handling in REINFORCE++ [Hu et. al., 2025], examining its non-standard KL penalty term and its implications for off-policy regularization.
We present extensive experimental results on RL for LLM reasoning, demonstrating that our proposed methods can achieve stable training dynamics and competitive or improved performance compared to strong baselines like GRPO, REINFORCE++, and DAPO [Yu et. al., 2025]

Citation

Please cite the paper and star this repo if you use RPG and find it interesting/useful, thanks!

@article{zhang2025design, title={On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning}, author={Zhang, Yifan and Liu, Yifeng and Yuan, Huizhuo and Yuan, Yang and Gu, Quanquan and Yao, Andrew C}, journal={arXiv preprint arXiv:2505.17508}, year={2025}, }

On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

Abstract

Regularized Policy Gradient

Experimental Results

Regularized Policy Gradients with fully differentiable surrogate loss functions

REINFORCE-Style Regularized Policy Gradients

Citation