RLM-Cascade: 响应级投机解码代理层系统降低LLM API成本45.8%

精选理由

这个系统把DeepSeek和Opus组合起来，用投机解码省了近一半API成本，还快了一倍，质量也有提升，而且开源可部署。

AI 摘要

RLM-Cascade是一个代理层投机解码系统，在响应级别优化LLM API调用。它使用DeepSeek作为草稿模型、Opus作为验证模型，并通过轻量复杂度路由器选择路径。在Claude Code生产环境中，系统达到88.8%的草稿使用率，API成本相比直接使用Opus降低45.8%。P50延迟从3698毫秒降至2026毫秒，实现1.83倍加速。在20个Code/Math/Instruct任务基准上，RLM-Cascade通过率达100%，高于Opus的95%。

AI 翻译 · 中文

arXiv: DeepSeekWe present RLM-Cascade, a proxy-layer system that applies speculative decoding at the response level to reduce LLM API costs without requiring model architecture access or a shared vocabulary. A fast, inexpensive draft m…

阅读原文