阿里Qwen3-Omni实时语音推理优化：首音延迟降至0.6秒

精选理由

阿里和蚂蚁团队搞了个优化，Qwen3-Omni实时对话延迟从6秒降到0.6秒，吞吐还翻了5倍多，推荐看技术博客。

AI 摘要

Qwen3-Omni采用多模态Thinker与Talker（Code2Wav）流水线架构。高并发下仅复制语音阶段，复用Thinker结果，首音频延迟从约6秒降至0.6秒。吞吐量在同GPU上提升5.4倍，语音生成快于实时。该优化由阿里、蚂蚁集团SCT团队和vLLM-Omni团队共同实现。

AI 翻译 · 中文

vLLM🎙️ @Alibaba_Qwen's Qwen3-Omni listens, reasons, and talks back. Serving that in real time is a pipeline problem, not a single model: a multimodal Thinker, then Talker → Code2Wav for the speech. Each stage bottlenecks di…

查看原推