MTP 技术让 Qwen 在 Atomic Chat 中提速 2.5 倍

精选理由

MTP 技术让本地大模型推理速度翻倍，尤其适合在消费级显卡上跑密集模型的开发者——2 块 RTX 5090 就能让 27B 模型达到 117 tps，值得直接试开源代码。

AI 摘要

Atomic Chat 团队通过 Multi-Token Prediction (MTP) 技术，在 2 块 RTX 5090 上对 Qwen 模型实现了最高 2.5 倍的推理加速。其中，Qwen3.6 27B 密集模型从 51 tps 提升至 117 tps（+137%），而 MoE 模型 35B-A3B 从 218 tps 提升至 267 tps（+25%）。MTP 通过一次前向传播验证多个预测 token，显著减少了内存带宽瓶颈，密集模型受益更大。该技术保持零精度损失，仅需额外约 1 GB 显存，且代码已开源。

AI 翻译 · 中文

@atomic_chat_hqMTP speedup Qwen by 2.5x in Atomic Chat Dense vs MoE models on 2x RTX 5090 Qwen3.6 27B: 51 → 117 tps +137% Qwen3.6 35B-A3B: 218 → 267 tps +25% MTP drafts several tokens ahead and verifies them in one pass. The speedup de…

查看原推