Defending Against Harmful Supervision Hidden in Benign Samples

精选理由

这篇论文揭示了有害微调的新方式，提出Embedded Attack和DR-SFT，对AI安全研究者很有启发。

AI 摘要

论文提出Embedded Attack，将有害的问答对嵌入良性训练样本中，测试表明代表性防护机制在样本级别难以检测。为应对这一威胁，作者提出双参考SFT（DR-SFT），通过词元级正则化将DPO风格的对比目标适配到SFT，在粗粒度数据过滤之外缓解有害微调。实验证明该攻击能绕过现有防御，而DR-SFT可有效降低有害行为。

AI 翻译 · 中文

arXiv cs.AIExisting defenses are effective when harmful content is explicitly mixed into downstream fine-tuning data, but crafted samples can instead hide harmful supervision inside benign tasks. We propose Embedded Attack, where h…

阅读原文