稀疏自编码器基准测试可靠吗？SAEBench 两项核心指标被指失效

精选理由

做可解释性研究的团队会发现，你依赖的 SAE 评估指标可能不可靠——TPP 和 SCR 已被证伪，建议改用 sae-probes 并关注新基准的进展。

AI 摘要

一篇来自 arXiv 的论文对 SAEBench（稀疏自编码器标准评估套件）中的质量指标进行了审计，发现 Targeted Probe Perturbation (TPP) 和 Spurious Correlation Removal (SCR) 在标准设置下无法通过多种可靠性测试，不应再用于 SAE 评估。其他指标也存在噪声高、区分度低的问题。sae-probes 变体是测试中最可靠的指标，但仍难以区分同一架构的不同变体。研究结论指出，当前 SAE 领域需要更好的基准测试方法。

AI 翻译 · 中文

arXiv cs.LGSparse autoencoders (SAEs) are a core interpretability tool for large language models, and progress on SAE architectures depends on benchmarks that reliably distinguish better SAEs from worse ones. We audit the SAE quali…

阅读原文