Attention-State Memory：无训练长上下文生成新方法

精选理由

长上下文推理的注意力瓶颈终于有了轻量级解法——无训练、可更新、内存高效，做 LLM 推理优化或长文档应用的团队值得关注。

AI 摘要

现代大语言模型依赖长前缀来控制推理行为，但前缀影响会随生成衰减，且注意力计算成本随前缀长度线性增长。现有方法要么压缩前缀但仍需注意力计算，要么通过梯度训练内化前缀但更新困难。本文提出 attention-state memory，一种无训练方法，将前缀与查询 token 的预计算注意力状态外化到轻量级查找表中。在 ManyICLBench 上，LLaMA-3.1-8B 在 1K-8K 内存预算下准确率超过上下文学习，注意力延迟降低 1.36 倍；在 NBA 基准上仅用 20% 内存就超越全注意力 RAG 性能。

AI 翻译 · 中文

arXiv cs.AIModern large language model (LLM) applications increasingly rely on long conditioning prefixes to control model behavior at inference time. While prefix-augmented inference is effective, it incurs two structural limitati…

阅读原文