Transformer 可省去 Key 和 Value 投影？新论文砍掉 50% KV 缓存

精选理由

做 LLM 推理优化的团队可以直接参考这个设计——砍掉一半 KV 缓存但几乎不损质量，值得在自家模型上试试。

AI 摘要

一篇新论文发现 Transformer 的 Key 和 Value 投影可以共享同一映射，从而将 KV 缓存减少 50%，而困惑度仅上升 3.1%。最佳变体 Q-K=V 保留了 Query 的独立性，使注意力仍具有方向性。结合 GQA 和 MQA 时，缓存削减可达 87.5% 和 96.9%。弱变体 Q=K-V 因对称性不适合因果语言模型，且无缓存节省。该发现挑战了传统 QKV 三投影的必要性，对推理内存优化有重要意义。

AI 翻译 · 中文

rohanpaul_aiInteresting, this paper shows that Transformers may not need separate key and value projections to work well. This paper's design cut the KV cache by 50% in language modeling with only 3.1% higher perplexity, meaning inf…

查看原推