面向高效Transformer的互补注意力头剪枝

精选理由

想压缩Transformer模型？CAHP自动剪掉冗余注意力头，不用调参，在SST-5和MNLI上比梯度方法更强，还保住了中间层的关键结构。

AI 摘要

CAHP将注意力头选择重新定义为全局图论问题，利用图聚类和信息论距离识别互补子集。该方法无需预定义稀疏度，通过检测边际性能下降曲线自动确定每层保留的头数。在SST-5和MNLI基准上，CAHP在不同规模Transformer中均优于梯度方法，尤其在高压缩率下。结构分析表明，CAHP避免了梯度方法的“邻近偏差”，保留了模型中间层的功能关键头。

AI 翻译 · 中文

arXiv cs.LGThe remarkable success of Transformer-based models in natural language processing stems from architectural scaling, which leads to a large number of parameters and hinders deployment in resource-constrained environments.…

阅读原文