重新思考奖励监督：Rubric-Conditioned Self-Distillation

精选理由

想提升推理模型训练效果？这篇用评分标准做细粒度自蒸馏，比GRPO和OPSD都强，实验扎实。

AI 摘要

提出Rubric-Conditioned Self-Distillation框架，用评分标准替代标量奖励，提供token级指导。方法分两步：先学习生成任务级评分标准，再训练评分标准引导的推理器。在多个科学推理基准上平均超越GRPO 1.0分、OPSD 0.9分。避免了单一参考推理链的噪声和标量奖励的模糊性。

AI 翻译 · 中文

arXiv cs.AIPost-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to…

阅读原文