Gaze Heads：视觉语言模型如何注视它们描述的对象

精选理由

操控VLM输出，像翻漫画一样准

AI 摘要

论文发现视觉语言模型的LM骨干中存在一组称为gaze heads的注意力头，其注意力会追踪模型当前描述的图像区域。通过仅对top-100个gaze heads（少于全部9%）进行注意力掩码干预，能以83.1%的准确率引导模型描述指定的漫画面板，而随机干预无效。该干预同样适用于自然COCO图像，且机制在2B到32B参数规模及多种VLM架构中复现。该工作展示了通过机制分析实现无需重训的推理时多模态行为操控。

AI 翻译 · 中文

arXiv cs.LGHow a vision-language model internally solves the task of describing an image is far from obvious. We find that the model develops a specific mechanism for this: a small set of attention heads in its language-model backb…

阅读原文