VISTA: 视图一致自验证训练提升GUI接地

VISTA: View-Consistent Self-Verified Training for GUI Grounding

精选理由

多视图训练让GUI定位更准

AI 摘要

GRPO在GUI接地训练中因单视图采样导致有效信号不足。VISTA框架从多个保持目标元素可见的裁剪视图中构建比较组,并添加自验证跨视图锚点。在五个GUI接地基准上持续提升,ScreenSpot-Pro上Qwen3-VL 4B/8B/30B-A3B从55.5/52.7/53.7分别升至63.4/65.8/67.0。鲁棒性分析显示最差视图准确率更高、预测翻转率更低。

AI 翻译 · 中文

GRPO在GUI接地训练中因单视图采样导致有效信号不足。VISTA框架从多个保持目标元素可见的裁剪视图中构建比较组,并添加自验证跨视图锚点。在五个GUI接地基准上持续提升,ScreenSpot-Pro上Qwen3-VL 4B/8B/30B-A3B从55.5/52.7/53.7分别升至63.4/65.8/67.0。鲁棒性分析显示最差视图准确率更高、预测翻转率更低。

arXiv cs.AIWhen applying Group Relative Policy Optimization (GRPO) for GUI Grounding, rollouts are sampled from a single screenshot view; groups often become either all failures on difficult instances or all successes on easy ones,