重计算技巧使SSM的推测解码速度翻倍

精选理由

不用存状态，算完就扔，SSM推理直接快一倍，Qwen 3.5和Nemotron Ultra用户试试这个技巧。

AI 摘要

在运行大规模上下文智能体时，Qwen 3.5和Nemotron Ultra等混合模型面临Gated-DeltaNet/Mamba状态的瓶颈。一个简单洞察是加载状态并计算但不存储，可使速度提升2倍。该重计算技巧最终解锁了状态空间模型（SSM）的推测解码（spec decoding）功能。

AI 翻译 · 中文

Tri Dao (FlashAttention)As hybrid models (Qwen 3.5 / Nemotron Ultra) run agents with massive context, Gated-DeltaNet / Mamba states become a bottleneck. A simple insight to make this 2x faster: load the states, compute, but don't store them. Th…

arXiv: DeepSeek06-15 18:06原文

查看原推