PearlVLA: 渐进式潜在空间具身动作规划细化

精选理由

这篇论文提出了PearlVLA，把动作规划放到了潜在空间里，比传统文本链式推理延迟更低，在LIBERO上刷了SOTA，做具身智能的可以看看。

AI 摘要

PearlVLA提出一种将动作规划调度到VLM潜在空间的新框架，通过将元查询表示分为视觉定位分支和迭代潜在规划分支，利用冻结的潜在世界模型生成未来观测，并经过K轮细化后并行解码动作块。在LIBERO基准上，PearlVLA达到了现有方法中的最佳性能，证明了潜在空间推理在降低延迟的同时提升规划质量的有效性。

原文 · arXiv cs.AI

PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space

Current Vision-Language-Action (VLA) models face a trade-off between efficient action generation and explicit deliberation. Directly decoding actions from vision-language backbone representations enables low-latency control, whereas explicit reasoning through textual chains, pixel-level subgoals, or action search can improve planning but incurs substantial latency and computational cost. We propose PearlVLA, a VLA framework that moves deliberation into the latent space of a vision-language model (VLM). PearlVLA separates VLM meta-query representations into a fixed visual grounding branch and an iterative latent plan branch. At each refinement round, a plan-conditioned world query probes a lightweight frozen latent world model for an action-free future observation latent, which is fed back to guide plan refinement. A future-guided RefineNet then applies scheduled residual updates to progressively refine a coarse semantic draft into a fine-grained latent action plan. The refined plan after K rounds is then decoded in parallel into an action chunk for low-latency execution. We further introduce Causal Refinement-Grouped Process-Reward RL to optimize the latent refinement process with rewards from longer-horizon imagined futures induced by latent plan edits. Empirical evaluations on the LIBERO benchmark demonstrate that PearlVLA achieves state-of-the-art performance among existing methods.

阅读原文