LLMs写单GPU内核快，多GPU就崩溃

精选理由

新基准ParallelKernelBench发现，LLM写单GPU代码还行，但多个GPU一起就瞎了。想看看AI编程到底卡在哪？

AI 摘要

ParallelKernelBench评估了LLMs编写多GPU内核的能力，包含87个来自Megatron-LM、DeepSpeed、DeepEP、TensorRT-LLM、NeMo-RL等真实代码库的问题。测试结果显示LLMs在单GPU内核上表现良好，但在多GPU场景下完全失败。该研究由Willy Chan等人完成，揭示了当前LLM在多GPU编程中的核心缺陷。

AI 翻译 · 中文

Together AILLMs write fast single-GPU kernels. Ask for a multi-GPU one and they fall apart. ParallelKernelBench measures how they fail by benchmarking against 87 problems pulled from real codebases including Megatron-LM, DeepSpeed,…

查看原推