Move the Query, Not the Cache：跨实例 MLA 注意力新策略

精选理由

做大规模 LLM 推理部署的团队，这篇论文给出了跨 GPU 注意力优化的新思路——路由查询而非移动缓存，实测能大幅降低延迟。建议关注其成本模型和决策谓词，可直接用于优化自家推理系统。

AI 摘要

本文研究跨 GPU 实例的注意力机制优化问题。传统方法在查询需要访问其他 GPU 上的 KV 缓存块时，会移动缓存块到查询所在 GPU，但多查询注意力（MLA）将每个 token 的键和值压缩为窄向量，使得路由查询（约 1KB）比移动缓存块更便宜。作者在真实多节点 H100 集群上测量了跨实例 MLA 注意力，提出了拓扑感知成本模型和路由/获取/本地决策谓词，发现解码时路由查询可将缓存移动的约 3 毫秒开销降低到几十微秒。该模型不限于 MLA，可推广到 DeepSeek-V3.2、V4 和 GLM-5.1 等架构。

AI 翻译 · 中文

arXiv: DeepSeekFrontier LLMs increasingly decide what a query attends to with a sparse-attention indexer that picks a few KV-cache blocks per query: attention's unit is now a small, reusable chunk. Agentic workloads hammer it: many sub…

阅读原文