vLLM 集成 DFlash 投机解码，Gemma-4 31B 吞吐量提升最高 5.8 倍

精选理由

vLLM 和 NVIDIA 合作推出 DFlash 投机解码，Gemma-4 31B 推理速度提升近 6 倍，配置只需改一行 checkpoint 路径。

AI 摘要

vLLM 项目宣布支持 DFlash 投机解码，用户只需将 EAGLE-3 检查点替换为 DFlash 检查点即可启用，无需修改代码。该功能通过开源 Speculators 库将 DFlash 草案模型与目标模型的隐藏状态连接。在单块 Blackwell Ultra GPU 上运行 Gemma-4 31B 模型，Math500 基准取得 5.8 倍吞吐量提升，GSM8K 提升 5.3 倍，HumanEval 提升 5.6 倍，MBPP 提升 4.4 倍。

AI 翻译 · 中文

vLLM🙏 Thanks to the @NVIDIAAI team for highlighting DFlash support on vLLM! With DFlash speculative decoding, swapping EAGLE-3 for a DFlash checkpoint is a config-only change — no code edits needed. It runs through the open…

NVIDIA AI06-23 17:00原文
LMSYS Org (SGLang)06-23 17:02原文
marktechpost06-24 07:21原文
Thomas Wolf06-25 13:16原文
lmarena.ai06-23 02:15原文
AI Will06-24 01:13原文
AWS Machine Learning Blog06-25 16:41原文
Viking06-26 00:33原文
IT之家06-22 13:30原文
techcrunch06-22 16:51原文

查看原推