Bridge-Garden 理论：混合硬软标签提升 LLM 蒸馏效果

精选理由

做 LLM 蒸馏的团队终于有了理论指导——Bridge-Garden 理论解释了为什么混合标签有效，并且直接给出了可落地的方案，训练成本还降了 9.7 倍，建议做模型压缩的开发者点开看看。

AI 摘要

这篇论文发现，在 LLM 知识蒸馏中，混合使用教师模型的硬标签（采样 token）和软标签（完整分布）比单独使用任何一种效果更好。作者提出 Bridge-Garden 分解理论，将生成步骤分为“桥”（需精确 token）和“花园”（可灵活选择）两类，硬标签擅长处理桥，软标签擅长处理花园，混合策略能减少训练与推理之间的暴露偏差。基于该理论开发的混合监督方法在 7 组师生模型（含 Qwen、Llama、Gemma、DeepSeek）上优于现有基线，同时将训练成本降低 9.7 倍。代码已开源。

AI 翻译 · 中文

arXiv: DeepSeekKnowledge distillation (KD) transfers knowledge from a large teacher model to a smaller student. In language modeling, the student is trained either on tokens sampled from the teacher (hard labels) or the teacher's full …

阅读原文