TxBench-PP基准测试：AI Agent在临床前药理学任务中表现不佳

精选理由

想看看AI在药物发现中到底行不行？这个基准测试用4800条轨迹告诉你，Claude Opus 4.8和GPT-5.5都还差得远，最高才59.3%的通过率。

AI 摘要

TxBench-PP是一个用于评估AI agent在小分子临床前药理学中决策能力的基准，包含100个涉及作用机制、药效学等任务的评估。在16个模型配置（涉及11个模型和4800条轨迹）中，最佳配置Claude Opus 4.8 / Pi仅通过59.3%（178/300）的端点尝试，GPT-5.5 / Pi通过55.3%。结果表明，当前AI系统无法可靠复现临床前药理学决策。

AI 翻译 · 中文

arXiv cs.LGArtificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision-making loops, but practical deployment requires trusted evaluation on realistic program decisions. We in…

@hebbia06-16 05:50原文

阅读原文