NVIDIA推出Nemotron-Labs-TwoTower:将30B模型拆二并行生成token

We took a 30B model and split it in two to write tokens in parallel instead of one at a time. Intro...

精选理由

NVIDIA把30B模型劈成两半,并行写token,速度翻倍还保质量,不是新训模型是复用预训练,挺聪明的做法。

AI 摘要

NVIDIA Research将30B参数的Nemotron-3-Nano-30B-A3B模型拆分为两半,一半维护上下文,一半生成token。该扩散语言模型仅复用预训练权重而非从头训练,在保持98.7%原始质量的同时实现了2.42倍的生成加速。这种方法将传统的自回归逐token生成改为并行写入,显著提升了推理效率。

AI 翻译 · 中文

NVIDIA Research将30B参数的Nemotron-3-Nano-30B-A3B模型拆分为两半,一半维护上下文,一半生成token。该扩散语言模型仅复用预训练权重而非从头训练,在保持98.7%原始质量的同时实现了2.42倍的生成加速。这种方法将传统的自回归逐token生成改为并行写入,显著提升了推理效率。

NVIDIA AIWe took a 30B model and split it in two to write tokens in parallel instead of one at a time. Introducing Nemotron-Labs-TwoTower: a diffusion language model from NVIDIA Research adapted from Nemotron-3-Nano-30B-A3B. Here