重新思考奖励监督:Rubric-Conditioned Self-Distillation

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

精选理由

想提升推理模型训练效果?这篇用评分标准做细粒度自蒸馏,比GRPO和OPSD都强,实验扎实。

AI 摘要

提出Rubric-Conditioned Self-Distillation框架,用评分标准替代标量奖励,提供token级指导。方法分两步:先学习生成任务级评分标准,再训练评分标准引导的推理器。在多个科学推理基准上平均超越GRPO 1.0分、OPSD 0.9分。避免了单一参考推理链的噪声和标量奖励的模糊性。

AI 翻译 · 中文

提出Rubric-Conditioned Self-Distillation框架,用评分标准替代标量奖励,提供token级指导。方法分两步:先学习生成任务级评分标准,再训练评分标准引导的推理器。在多个科学推理基准上平均超越GRPO 1.0分、OPSD 0.9分。避免了单一参考推理链的噪声和标量奖励的模糊性。

arXiv cs.AIPost-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to