MiniMax发布MSA：两分支稀疏注意力，1M上下文计算量降28倍

精选理由

MiniMax 搞了个新稀疏注意力 MSA，1M 上下文计算量降 28 倍，准度却一点没掉，适合长文本场景。

AI 摘要

MiniMax 发布 Sparse Attention (MSA) 机制，基于 Grouped Query Attention (GQA) 架构。MSA 包含一个轻量级索引分支，为每个查询和 GQA 组选择 Top-k 键值块；主分支仅关注这些块。在 1M 上下文长度下，每个 token 的注意力计算量减少 28.4 倍。该机制训练在 109B 参数的 MoE 模型上，使用 3T token 预算，下游基准测试中与 GQA 性能相当。

AI 翻译 · 中文

marktechpostMiniMax released MSA, a sparse attention built on Grouped Query Attention. A lightweight Index Branch selects Top-k key-value blocks per query and GQA group; the Main Branch attends only to those blocks. It matches GQA o…

阅读原文