替代保真度：开放模型何时能解释封闭模型？

精选理由

想用开源模型解释GPT？这篇论文告诉你预测一致不代表归因一致，小心踩坑。

AI 摘要

该论文研究使用开放语言模型（如Llama、Qwen）解释封闭API模型（如GPT、Gemini）时的可靠性问题。在11个模型上的实验表明，预测层面的保真度（log-odds一致性）远高于归因层面的保真度（leave-one-out重要性）。存在访问-效度反转：白盒信号（如注意力模式）虽稳定但无法预测因果归因，而黑盒输入消融能直接捕捉归因。论文警告，仅凭预测一致性不足以证明机械可解释性可从开放模型迁移到封闭模型。

AI 翻译 · 中文

arXiv cs.LGMechanistic interpretability (MI) requires full access to model internals, yet the APIs for most widely deployed language models at best expose log-probabilities over output tokens. This creates a surrogate problem: when…

阅读原文