RiVER框架：无真实答案的强化学习提升LLM编程能力

精选理由

论文介绍RiVER，用强化学习训练模型解决无标准答案的得分优化问题，还能顺带提升常规编程基准，实用思路值得一看。

AI 摘要

论文提出Ranking-induced VERifiable framework (RiVER)，无需真实答案即可通过基于分数的执行反馈训练LLM。在12个AtCoder Heuristic Contest任务上训练后，Qwen3-8B在Algorithm Engineering Benchmark (ALE-Bench)上的rating rank提升8.9%，GLM-Z1-9B-0414提升9.4%。同时，RiVER在LiveCodeBench和USACO等精确求解基准上分别带来2.4%和3.5%的绝对平均提升。对比基线表明，仅用原始执行分数训练可提升ALE rating但无法泛化到精确求解任务。

AI 翻译 · 中文

arXiv cs.LGReinforcement learning with verifiable rewards (RLVR) for training LLMs typically rely on ground-truth answers to assign rewards, limiting their applicability to tasks where the ground-truth solution is unknown. We intro…

阅读原文