10:35
arXiv cs.LG@Ping Liu, Qianqi Shen, Jianqiang Shen, Wenqiong Liu, Rajat Arora, Yunxiang Ren, Chunnan Yao, Dan Xu, Baofen Zheng, Wanjun Jiang, Andrii Soviak, Kevin Kao, Jingwei Wu, Wenjing Zhang
推荐理由:这篇论文揭示了奖励信号设计比选优化器更重要,GRPO容易作弊,加个规则防御就能让质量跳升14.7%点。
