2026/4/6 7:56:06
网站建设
项目流程
建设网站经验,wordpress卢松松评论模板,xd软件可做网站吗,widgets wordpressLLMs之RL之GDPO#xff1a;《GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization》翻译与解读 导读#xff1a;本文识别并定量说明了在多奖励强化学习中广泛使用的 GRPO 会导致“奖励信号压缩/信息丢失”的结构性问题#x…LLMs之RL之GDPO《GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization》翻译与解读导读本文识别并定量说明了在多奖励强化学习中广泛使用的 GRPO 会导致“奖励信号压缩/信息丢失”的结构性问题从而影响训练分辨率与稳定性为此提出 GDPO对每个奖励分量先做组内归一化、再汇总并做 batch-wise 归一化来保留跨奖励差异、控制数值尺度并显著提升训练稳定性与下游性能实验证明在工具调用、数学与编程推理任务上 GDPO 一致优于 GRPO论文同时提供了关于 reward 设计与权重管理的实务建议与未来研究方向适合直接应用于多偏好对齐的 RL 微调流水线。 背景痛点● 多奖励整合挑战多奖赏multi-reward强化学习在对话/推理/工具调用等场景被广泛采用以同时优化准确性、格式、长度、约束等多种人类偏好但如何把这些异质奖励高效且稳定地整合进策略优化仍存在难题。● GRPO 的内在压缩问题现行常用的 Group Relative Policy OptimizationGRPO方法先将多个奖励相加再在组内归一化这会把具有不同组合意义的 rollout 奖励压缩为相同或极为相近的 advantage从而丢失跨奖励维度的重要区分信号影响训练精度与稳定性。● 训练不稳定与性能退化风险论文观察到在多奖励场景下直接使用 GRPO 有时导致优势估计不准确、学习信号分辨率下降出现收敛差或训练早期失败部分设置下 correctness reward 开始下降。 具体的解决方案● 方法总体GDPO 提出Group reward-Decoupled Normalization Policy OptimizationGDPO通过对每一类奖励单独进行组内归一化decoupled group-wise normalization再将这些归一化后的 reward-advantages 求和最后做 batch-wise advantage normalization以兼顾保留跨奖励差异与数值稳定性。● 保持奖励区分度GDPO 的核心在于先单独标准化每个奖励避免“先求和再归一化”导致的不同奖励组合被压缩为相同 advantage 的问题从而为优化提供更有辨识度的训练信号。● 批次级归一化保障稳定在将各奖励的归一化优势求和后GDPO 再做 batch-wise advantage normalization以保证当奖励数量增加时数值尺度不会膨胀并改善训练稳定性论文指出去掉该步偶发收敛失败。● 奖励优先级/权重处理论文还提供如何调整奖励函数与权重以反映不同偏好优先级的系统性说明便于在实际中对偏好权重进行可解释调整。 核心思路步骤可操作化流程● 步骤 1 — 构造多奖励目标为目标任务设计若干互补但可能冲突的奖励分量如 correctness、format、length、bug_ratio 等并定义其度量函数。● 步骤 2 — 逐奖励组内归一化在每个问题/每组 rollout 内分别对每个奖励维度计算组内group-wise归一化优势而不是先把奖励求和。● 步骤 3 —汇总并做批次归一化将每一维的归一化优势在样本上求和得到总优势再对该总优势进行 batch-wise 归一化保持数值稳定且避免随奖励数量增加而放大。● 步骤 4 — 策略更新与监控用 GDPO 得到的 advantage 进行策略更新类似 GRPO 的更新范式同时监控各奖励的收敛行为与训练稳定性如有必要调整 reward 权重或归一化设置。 实验设计与评价论文中的实验要点● 三类任务覆盖在工具调用tool calling、数学推理math reasoning与代码推理coding reasoning三类任务上比较 GDPO 与 GRPO 的表现指标涵盖正确率、格式遵守、长度约束、代码通过率与 bug 比率等。● 结果概览Across-the-boardGDPO 在收敛性、下游准确率与约束遵守方面均优于 GRPO示例在 AIME 数学任务上GDPO 在某些模型上带来高达 6.3%DeepSeek-R1-1.5B和 2.3%Qwen3-4B-Instruct的准确率提升同时更好地保持响应长度约束。论文中多次展示训练曲线与稳态性能证明了改进。● 对比消融论文还检验了移除 GRPO 中标准差归一化 (std normalization) 的变体发现仅去掉 std 并不能从根本上恢复丢失的信息GDPO 的分解式归一化在保持 distinct advantage groups 数量方面更具优势。 优势论文方法/贡献的强项● 信息保留更好GDPO 通过对每个奖励分量独立归一化能保留更多“不同奖励组合”的细粒度差别从而提升学习信号的分辨率。● 更高训练稳定性加入 batch-wise advantage normalization 后GDPO 在多奖励场景下显著降低训练崩溃或早期退化的风险得到更平滑的收敛曲线。● 通用性与实用性论文在三类不同任务与多种模型上复现对比实验表明 GDPO 在多奖励 RL 优化中具有广泛适用性是直接替代 GRPO 的工程可行方案。● 操作建议伴随不仅提出算法还给出关于权重调整与 reward 设计的系统性指南方便工程实践落地。 后续结论观点经验、建议与工程/研究导向● 实务建议在多奖励微调/强化学习流水线中应优先考虑对每个奖励分量采取分开归一化以防信息压缩并保留批次级归一化以维护数值稳定。● 设计奖励时慎选权重论文强调对奖励权重和优先级的系统性调整是必要的——不同优先级的偏好需要在归一化与加权步骤中得到体现以避免次优偏置。● 对现有 GRPO 变体的反思去掉标准差归一化GRPO w/o std虽能带来少量改进但不足以替代对奖励解耦的做法提示未来算法设计应更关注保留跨奖励信息的能力。● 研究方向可进一步研究在极多数奖励、稀疏奖励或奖励相互冲突更严重的场景下 GDPO 的扩展例如自适应权重调度、层级优先级编码以及更严格的理论解析奖励解耦对收敛性的影响。目录《GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization》翻译与解读AbstractFigure 1:(a): An overview of GDPO, which performs group-wise normalization per reward and then applies batch-wise advantage normalization to preserve a stable numerical range independent of reward count and improve update stability. (b): Median and IQR reward curves over five runs of Qwen2.5-Instruct-1.5B tool-calling RL, demonstrating that GDPO consistently converges to higher correctness and format reward score than GRPO.图 1(a)GDPO 的概述它对每个奖励执行组内归一化然后应用批内优势归一化以保持稳定的数值范围不受奖励数量的影响并提高更新的稳定性。(b)Qwen2.5-Instruct-1.5B 工具调用 RL 五次运行的中位数和四分位距奖励曲线表明 GDPO 一直收敛到比 GRPO 更高的正确性和格式奖励得分。1、Introduction6 Conclusion《GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization》翻译与解读地址论文地址https://arxiv.org/abs/2601.05242时间2026年01月08日作者NVIDIAAbstractAs language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.随着语言模型能力的不断增强用户期望它们不仅能够给出准确的回答还能在各种场景中展现出与人类多样偏好相一致的行为。为了实现这一目标强化学习RL流程已开始采用多种奖励机制每种奖励机制都捕捉到一种不同的偏好以引导模型朝着这些期望的行为发展。然而近期的研究在多奖励设置下默认采用组相对策略优化GRPO却未对其适用性进行考察。在本文中我们证明直接将 GRPO 应用于标准化不同的滚动奖励组合会导致它们坍缩为相同的优值从而降低训练信号的分辨率导致次优收敛甚至在某些情况下出现早期训练失败。随后我们引入了组奖励解耦标准化策略优化GDPO这是一种新的策略优化方法通过解耦单个奖励的标准化过程更忠实地保留它们之间的相对差异从而实现更精确的多奖励优化并显著提高训练的稳定性。我们通过工具调用、数学推理和编程推理这三项任务将 GDPO 与 GRPO 进行了比较评估了正确性指标准确率、错误率和约束遵守指标格式、长度。在所有设置中GDPO 始终优于GRPO这表明其在多奖励强化学习优化方面的有效性和通用性。Figure 1:(a): An overview of GDPO, which performs group-wise normalization per reward and then applies batch-wise advantage normalization to preserve a stable numerical range independent of reward count and improve update stability. (b): Median and IQR reward curves over five runs of Qwen2.5-Instruct-1.5B tool-calling RL, demonstrating that GDPO consistently converges to higher correctness and format reward score than GRPO.图 1(a)GDPO 的概述它对每个奖励执行组内归一化然后应用批内优势归一化以保持稳定的数值范围不受奖励数量的影响并提高更新的稳定性。(b)Qwen2.5-Instruct-1.5B 工具调用 RL 五次运行的中位数和四分位距奖励曲线表明 GDPO 一直收敛到比 GRPO 更高的正确性和格式奖励得分。1、IntroductionAs language models continue to advance in capability, expectations for their behavior have grown accordingly. Demand for models to not only provide accurate responses but also exhibit behaviors aligned with a wide range of human preferences across diverse scenarios has continued to increase. These preferences span efficiency [1, 2, 3], safety [4], response coherence and logic [5, 6], gender biases [7] and many other objectives. Meeting such heterogeneous requirements within a single model is a challenging task.Reinforcement learning (RL) has emerged as the de facto training pipeline for aligning large language models to fulfill such diverse human preferences. In particular, recent RL-based approaches have begun to incorporate multiple rewards into training, with each reward designed to capture different human preferences and collectively guide models toward human-favored behaviors. Despite this growing interest in multi-reward RL, recent work [1, 3, 5] has largely focused on the reward design itself and often directly relied on applying Group Relative Policy Optimization (GRPO) directly for multi-reward RL optimization, often without examining whether GRPO is well-suited for optimizing combinations of heterogeneous rewards.In this paper, we revisit the applicability of GRPO in multi-reward settings and show that directly applying GRPO to normalize different combinations of rollout rewards can cause them to collapse into identical advantage values, which effectively limits the precision of the training signal, as illustrated in Fig. 2. This collapse removes important distinctions across reward dimensions and leads to inaccurate policy updates, suboptimal reward convergence, and, in many cases, early training failure.To overcome these challenges, we propose Group reward-Decoupled Normalization Policy Optimization (GDPO) which decouples the group-wise normalization of each individual reward as illustrated in Fig. 1(a), to ensure that distinctions across different reward combinations are better preserved and more accurately reflect the relative differences in model responses. This leads to more precise multi-reward optimization and substantially improved training convergence. After this decoupled group-wise normalization, we apply batch-wise advantage normalization to ensure that the magnitude of advantage does not increase as the number of individual rewards increases.随着语言模型能力的不断提升人们对它们行为的期望也相应提高。人们不仅希望模型能提供准确的响应还希望它们在各种场景中展现出与广泛的人类偏好相一致的行为。这些偏好涵盖了 2, 3]、安全性[4]、响应的连贯性和逻辑 6]、性别偏见[7]以及许多其他目标。在一个单一模型中满足如此多样化的需要是一项艰巨的任务。强化学习RL已成为将大型语言模型与各种人类偏好对齐的默认训练流程。特别是最近基于强化学习的方法开始在训练中纳入多个奖励每个奖励旨在捕捉不同的偏好并共同引导模型朝着人类偏好的行为发展。尽管人们对多奖励强化学习的兴趣日益浓厚但近期的研究[1, 3, 5]主要集中在奖励设计本身并且常常直接将组相对策略优化GRPO应用于多奖励强化学习的优化而很少探究 GRPO 是否适合用于优化异质奖励的组合。在本文中我们重新审视了 GRPO 在多奖励环境中的适用性并表明直接将 GRPO 应用于归一化不同组合的滚动奖励会导致它们收敛到相同的优值这实际上限制了训练信号的精度如图 2 所示。这种收敛消除了奖励维度之间的重要差异导致策略更新不准确、奖励收敛不理想并且在很多情况下会导致训练提前失败。为了克服这些挑战我们提出了组奖励解耦归一化策略优化GDPO如图 1(a) 所示它将每个个体奖励的组级归一化解耦以确保不同奖励组合之间的差异得到更好的保留并更准确地反映模型响应的相对差异。这带来了更精确的多奖励优化并显著提高了训练收敛性。在进行这种解耦的组级归一化之后我们应用批级优势归一化以确保优势的大小不会随着个体奖励数量的增加而增大。We compare GDPO and GRPO across three tasks: tool calling, math reasoning, and code reasoning. These tasks cover a wide range of objectives, including tool-calling accuracy and format correctness, mathematical reasoning accuracy and adherence to reasoning-length constraints, and code pass rate and bug ratio. Across all tasks, GDPO converges better. For example, in Fig. 1(b), training Qwen2.5-1.5B-Instruct with GDPO attains both higher correctness and format compliance than GRPO on the tool-calling task. On challenging math tasks, GDPO consistently outperforms GRPO. For instance, training DeepSeek-R1-1.5B and Qwen3-4B-Instruct with GDPO yields up to 6.3% and 2.3% higher accuracy on AIME compared to GRPO, while keeping more responses short simultaneously.Taken together, these results demonstrate the effectiveness and generalizability of GDPO, showing it to be a better alternative to GRPO for multi-reward RL optimization.Our contributions are as follows:• Analysis of GRPO reward collapse. We demonstrate that applying GRPO naively for multi-reward RL optimization can collapse distinct rollout reward combinations into identical advantage values, thereby diminishing the resolution of the learning signal.• Remediation of GRPO reward collapse. We propose GDPO, which performs group-wise decoupled normalization of each reward separately to better preserve cross-reward distinctions and enable more accurate multi-reward optimization.• In addition to GDPO, we provide a systematic overview of how to modify reward functions and adjust reward weights to more faithfully align with preferences of varying priority.• We carry out extensive experiments on three tasks: tool calling, math reasoning, and code reasoning, and compare the effectiveness of GDPO on optimizing a wide range of rewards corresponding to accuracy, format correctness, length constraints, and code quality. In all settings, GDPO consistently outperforms GRPO, showing improved training convergence and stronger downstream performance that align more closely with a diverse set of preferences.我们通过三个任务对 GDPO 和 GRPO 进行了比较工具调用、数学推理和代码推理。这些任务涵盖了广泛的目标包括工具调用的准确性和格式正确性、数学推理的准确性和推理长度约束的遵循情况以及代码的通过率和错误率。在所有任务中GDPO 的收敛效果更好。例如在图 1(b) 中使用 GDPO 训练 Qwen2.5-1.5B-Instruct 在工具调用任务上比 GRPO 达到了更高的正确性和格式合规性。在具有挑战性的数学任务上GDPO 也始终优于 GRPO。例如使用 GDPO 训练 DeepSeek-R1-1.5B 和 Qwen3-4B-Instruct 在 AIME 上的准确率分别比 GRPO 高出 6.3% 和 2.3%同时保持了更多回答的简短性。综上所述这些结果证明了 GDPO 的有效性和通用性表明它在多奖励强化学习优化方面是 GRPO 更好的替代方案。我们的贡献如下• 对 GRPO 奖励崩溃的分析。我们证明若在多奖励强化学习优化中直接应用 GRPO可能会将不同的滚动奖励组合压缩为相同的优值从而降低学习信号的分辨率。• 解决 GRPO 奖励压缩问题。我们提出了 GDPO它对每个奖励分别进行组内解耦归一化以更好地保留跨奖励的差异并实现更准确的多奖励优化。• 除了 GDPO 之外我们还系统地概述了如何修改奖励函数和调整奖励权重以更忠实地与不同优先级的偏好保持一致。• 我们在三个任务上进行了大量实验工具调用、数学推理和代码推理并比较了 GDPO 在优化对应于准确性、格式正确性、长度限制和代码质量的广泛奖励方面的有效性。在所有设置中GDPO 始终优于 GRPO表现出更好的训练收敛性和更强的下游性能更紧密地与多样化的偏好保持一致。6 ConclusionIn contrast to prior work that focuses on designing new reward functions for multi-reward reinforcement learning while assuming GRPO is the default optimization method, this study revisits a fundamental but often overlooked question: whether GRPO is actually suitable for multi-reward optimization. Our analysis shows that applying GRPO directly to the summed reward can cause different reward combinations to collapse into the same advantage values. This collapse eliminates important distinctions across reward dimensions, produces inaccurate policy updates and weaker optimization performance, and can in many cases lead to early training failure.To address this limitation, we introduce Group-wise Decoupled Policy Optimization (GDPO), a simple and effective modification to GRPO tailored for multi-reward reinforcement learning. GDPO performs normalization separately for each reward to preserve cross-reward differences, and it incorporates batch-wise advantage normalization to maintain a stable numerical range as additional rewards are included. These changes result in better convergence behavior and models that more faithfully reflect the intended preference structure.We further present a systematic study for incorporating human preference priorities into the training process and explain how reward functions can be adjusted when the difficulty disparity between objectives is large. Through extensive experiments on tool calling, math reasoning, and coding reasoning, we show that GDPO consistently outperforms GRPO. Its advantages hold across different numbers of rewards, across different models, and across different reward functions.Overall, our findings establish GDPO as a more stable, accurate, and preference-aligned optimization method than GRPO for multi-reward reinforcement learning, making it a strong foundation for aligning language models with diverse human preferences in real-world settings.与以往专注于为多奖励强化学习设计新奖励函数同时默认使用 GRPO 作为优化方法的研究不同本研究重新审视了一个基本但常被忽视的问题GRPO 是否真的适合多奖励优化。我们的分析表明直接将 GRPO 应用于总和奖励会导致不同的奖励组合坍缩为相同的优值。这种坍缩消除了奖励维度之间的重要差异导致策略更新不准确和优化性能较弱并且在许多情况下会导致训练提前失败。为了解决这一局限性我们引入了组解耦策略优化GDPO这是一种针对多奖励强化学习对 GRPO 进行的简单而有效的改进。GDPO 对每个奖励分别进行归一化以保留跨奖励的差异并采用批量优值归一化以在包含更多奖励时保持稳定的数值范围。这些改变带来了更好的收敛行为并且模型能更忠实地反映预期的偏好结构。我们进一步系统地研究了如何将人类偏好优先级纳入训练过程并解释了当目标难度差异较大时如何调整奖励函数。通过在工具调用、数学推理和编码推理方面的大量实验我们表明 GDPO 一直优于 GRPO。其优势在不同奖励数量、不同模型以及不同奖励函数的情况下都存在。总体而言我们的研究结果表明对于多奖励强化学习GDPO 是一种比 GRPO 更稳定、更准确且更符合偏好的优化方法这使其成为在现实世界环境中将语言模型与多样化的人类偏好对齐的坚实基础。