Artificial Analysis 更新编程智能体排行：DeepSWE 取代 SWE-Bench Pro，Claude Fable 5 登顶

精选理由

基准测试更新直接影响了主流编程智能体的排名，做 AI 编程工具选型或评估模型能力的开发者值得关注——Claude Fable 5 新登顶，Codex 也大幅提升，建议点开看具体得分和对比。

AI 摘要

Artificial Analysis 更新了其编程智能体指数，用 Datacurve 的 DeepSWE 基准测试取代了 SWE-Bench Pro。DeepSWE 从零编写任务，避免模型从公开 GitHub 问题或 PR 中记忆答案，解决了原基准可被游戏化的问题。更新后，Codex with GPT-5.5 (xhigh) 得分从 65 升至 76，超越 Claude Code with Opus 4.8 (max) 的 73 分；新发布的 Claude Fable 5 (max) 在 Claude Code 中以 77 分位居榜首。这一变化揭示了原基准对某些模型组合的偏差。

AI 翻译 · 中文

Artificial AnalysisWe've updated the Artificial Analysis Coding Agent Index, replacing SWE-Bench Pro with Datacurve's DeepSWE benchmark - the swap lifts Codex with GPT-5.5 (xhigh) above Claude Code with Opus 4.8 (max), while the newly rele…

lmarena.ai06-11 19:35原文
Decoder06-12 17:10原文
Aravind Srinivas06-10 18:24原文
shao__meng06-11 00:40原文
AI Will06-11 07:28原文
rohanpaul_ai06-11 13:00原文
Nous Research06-12 03:56原文
Perplexity06-10 18:07原文
IT之家06-10 23:21原文
小互06-11 03:10原文

查看原推