当验证器出错：自改进视觉语言模型在新任务上反而退化

精选理由

验证器在新任务上会拖后腿

AI 摘要

论文发现验证器驱动的自DPO方法在视觉语言模型自改进中存在任务特异性问题。在MathVista、MMMU和BLINK上用开源验证器阶梯测试，同一验证器在MathVista上提升Qwen-3-VL-2B学生模型，但在MMMU上验证器准确率降至8%-23%，导致学生模型性能下降3.4-10.9个百分点。该现象在Qwen-2.5-VL-3B上复现。论文给出基于方差定理的机械论解释，指出目标任务验证器质量而非参数量才是关键。

AI 翻译 · 中文

arXiv cs.AIVerifier-driven self-DPO is a common recipe for self-improving production visual-language models. In this setup, a frozen verifier scores candidate generations, the top- and bottom-scoring candidates form a preference ex…

阅读原文