AI模型精选

OpenAI测试模型对齐持久性:对抗提示下更难被导向有害行为

We also tested whether alignment persisted under pressure. The model was harder to steer toward ha...

精选理由

OpenAI发现他们的模型在对抗压力下挺得住,不容易被带坏,安全对齐效果不错。

AI 摘要

OpenAI发布测试结果,评估模型对齐在压力下的表现。在对抗性提示下,模型更难被引导至有害行为,同时依然能响应有益指令。初步证据表明,模型对有害微调也表现出更强的抵抗力。这项测试关注模型的安全鲁棒性,未提及具体模型版本或基准分数。

AI 翻译 · 中文

OpenAI发布测试结果,评估模型对齐在压力下的表现。在对抗性提示下,模型更难被引导至有害行为,同时依然能响应有益指令。初步证据表明,模型对有害微调也表现出更强的抵抗力。这项测试关注模型的安全鲁棒性,未提及具体模型版本或基准分数。

OpenAIWe also tested whether alignment persisted under pressure. The model was harder to steer toward harmful behavior with adversarial prompts, while remaining responsive to helpful instructions. We saw preliminary evidence o