2026/4/6 7:25:44
网站建设
项目流程
h5网站开发公司,网站开发设计框图,wordpress文字转图插件下载,公司网络维护通知LLMs之ISC#xff1a;《Enhancing LLM Planning Capabilities through Intrinsic Self-Critique》翻译与解读 导读#xff1a;本文提出并验证了一种结构化的“内在自我批评”迭代方法—内在自检推动LLM规划能力跃升—通过让模型逐步生成计划、逐条检验动作前置条件并把历史失…LLMs之ISC《Enhancing LLM Planning Capabilities through Intrinsic Self-Critique》翻译与解读导读本文提出并验证了一种结构化的“内在自我批评”迭代方法—内在自检推动LLM规划能力跃升—通过让模型逐步生成计划、逐条检验动作前置条件并把历史失败与批评作为后续上下文来修正计划——在无需外部 oracle 的前提下能在多项经典规划基准上显著提升大模型的规划正确率为实用化 LLM 规划在无监督/低监督场景下的可靠应用提供了可操作的路径与工程实践建议。 背景痛点● 推理与规划能力差距现有大型语言模型LLMs在自然语言类规划任务上已有进展但在经典规划如 Blocksworld、Logistics、Mini-grid等需要严格状态/前置条件校验的任务中仍落后于传统规划器且早期研究对 LLM 自我批评self-critique提升效果持怀疑态度。● 外部验证依赖很多迭代自我改进方法依赖外部 oracle/验证器 或 人类反馈来判定方案正确性但现实应用中常不可得或成本高昂。● 自我评估鲁棒性不足先前工作报道 LLM 用自身作 verifier 时存在高假阳性和漏报真阴性的风险导致自我修正失败或误导性改进。 具体的解决方案● 方法—内在自检Intrinsic Self-Critique机制提出一种仅依赖模型自身作为评估者的迭代自我改进流程—每轮包含“生成计划 → 自我批评评估每步前置条件并给出理由 → 根据批评与历史失败上下文修正/再生计划”直到判定为正确或达到迭代上限。●提示设计—Few-shot → Many-shot 扩展与提示工程采用精心设计的零/少样本提示起步并通过把失败实例和对应批评作为上下文加入下一轮来增强模型的自我改进能力。●验证策略—基于前置条件检查的评估方案在自我批评步骤中提示模型逐条检验动作的前置条件并提供后续状态从而把“是否满足规则”作为判据减少纯主观判断带来的假阳性。 核心思路步骤● 假设—可改进性假设更强或更适配的 LLM 在没有外部验证器的情况下也能通过内部自评 历史失败上下文进行迭代改进从而显著提升规划正确率。● 数据与任务—选取多类规划基准Blocksworld3–5/3–7 blocks、Mystery Blocksworld、Logistics、Mini-grid、AutoPlanBench 等在 PDDL符号或自然语言描述下要求返回能到达目标的计划。● 迭代流程—每次迭代包含1模型生成候选计划2模型对计划中每个动作的前置条件与后续状态进行自我批评与逐条评估3将批评与以往失败作为上下文提示模型修正并再生成重复直至通过评估或超时。● 实验验证—通过 ablation去掉历史失败上下文、改变自评提示格式、调整迭代次数、zero/few/many-shot 比较评估哪些组件贡献最大并在多种模型与基准上测试可复现性与可扩展性。 优势● 无需外部 oracle方法仅用 LLM 本身作为生成器与评判者避免对不可得或昂贵外部验证器的依赖提升实用性。●大幅性能提升在多项规划基准上观察到显著准确率提升例如文中报告 Gemini 1.5 Pro 在 Blocksworld3–5 blocks从约 49.8% 提升到 89.3%在其它数据集亦有明显跃升Claude 3.5 与 GPT-4o 等在部分基准上也获得强提升。● 通用性与可移植性该内在自检策略可与不同 LLM checkpoint 结合研究以 2024-10 的多个模型为示例并在多域Blocksworld、Logistics、Mini-grid、AutoPlanBench取得益处说明方法具有一定普适性。●设计可控且可剖析通过要求模型检查每个动作的前置条件并返回状态给予工程师更结构化的诊断信息便于定位失败原因与改进提示设计。 论文结论与经验/建议实践导向●自我批评有效但模型与任务敏感内在自检在多数实验证明能显著提升规划正确率尤其在较强的大模型上效果更佳但并非对所有模型与域都同样显著例如某些模型在某些域改进有限。●提示与上下文建议—把“失败示例 自我批评”作为后续上下文提供会显著帮助模型避开此前的错误模式提示中明确要求逐步检验前置条件与返回中间状态有助于降低虚假通过率。●迭代与成本权衡—增加自检迭代次数通常带来增益但有边际递减并且会增加延迟与成本应在准确率提升与推理成本间权衡可通过设置最大迭代次数、早停判据或仅对难题启用多轮自检。● 模型选择与规模建议—更大/更强的模型如 Gemini 1.5 Pro在自我改进中表现更鲁棒对于中等模型可以尝试增强 few-shot 上下文或引入部分外部验证以弥补自评能力不足。● 评估谨慎性建议—尽管自评能提高正确答案比例但仍需对自评质量进行抽样验证自动化度量 人工抽查以避免过度信任模型自检结果。目录《Enhancing LLM Planning Capabilities through Intrinsic Self-Critique》翻译与解读Abstract1、IntroductionFigure 1 | Illustration of the iterative self-improvement process for Large Language Models (LLMs) using in-context learning to incorporate self-critique feedback. The LLM, also represented by the explorer character, functions as the agent’s brain, accepting prompts as inputs and generating outputs, represented by green and red semi-circles, respectively. Each iteration of the self-improvement mechanism comprises two key steps: i) plan generation and ii) self-critiquing, aimed at iteratively refining LLM outputs. In step i), the LLM generates a plan (symbolized by a map) based on a prompt incorporating domain-specific knowledge and instructions (symbolized by the treasure chest). Step involves a self-critique mechanism where the LLM evaluates its own performance, providing correctness assessments and justifications, again leveraging domain knowledge. The process continues until a plan deemed correct is identified. Previous plans and their associated self-critique feedback are aggregated into a collection (symbolized by a bag), serving as contextual material for subsequent plan generation cycles.图 1 | 大型语言模型LLM使用上下文学习纳入自我批评反馈的迭代自我改进过程示意图。LLM 也由探险者角色代表充当智能体的大脑接受提示作为输入并生成输出分别用绿色和红色半圆表示。自我改进机制的每次迭代包含两个关键步骤i计划生成和 ii自我批评旨在逐步完善 LLM 的输出。在步骤 i中LLM 根据包含特定领域知识和指令的提示用宝箱表示生成计划用地图表示。步骤 ii涉及自我批评机制LLM 评估自身表现提供正确性评估和理由再次利用领域知识。该过程持续进行直至确定一个正确的计划。先前的计划及其相关的自我批评反馈被汇总到一个集合用袋子表示作为后续计划生成周期的上下文材料。6 Conclusion《Enhancing LLM Planning Capabilities through Intrinsic Self-Critique》翻译与解读地址论文地址https://arxiv.org/abs/2512.24103时间2025年12月30日作者Google DeepMindAbstractWe demonstrate an approach for LLMs to critique their \emph{own} answers with the goal of enhancing their performance that leads to significant improvements over established planning benchmarks. Despite the findings of earlier research that has cast doubt on the effectiveness of LLMs leveraging self critique methods, we show significant performance gains on planning datasets in the Blocksworld domain through intrinsic self-critique, without external source such as a verifier. We also demonstrate similar improvements on Logistics and Mini-grid datasets, exceeding strong baseline accuracies. We employ a few-shot learning technique and progressively extend it to a many-shot approach as our base method and demonstrate that it is possible to gain substantial improvement on top of this already competitive approach by employing an iterative process for correction and refinement. We illustrate how self-critique can significantly boost planning performance. Our empirical results present new state-of-the-art on the class of models considered, namely LLM model checkpoints from October 2024. Our primary focus lies on the method itself, demonstrating intrinsic self-improvement capabilities that are applicable regardless of the specific model version, and we believe that applying our method to more complex search techniques and more capable models will lead to even better performance.我们展示了一种方法让语言模型LLMs对其自身的答案进行批评旨在提升其性能从而在规划基准测试中取得了显著进步。尽管早期研究对语言模型利用自我批评方法的有效性提出了质疑但我们通过内在的自我批评在“积木世界”领域的规划数据集上取得了显著的性能提升且无需外部验证源。我们还在物流和迷你网格数据集上展示了类似的改进超过了强大的基线准确率。我们采用少量样本学习技术并逐步将其扩展为大量样本方法作为基础方法并证明通过采用迭代校正和细化过程在这种已经具有竞争力的方法之上还能获得显著的改进。我们说明了自我批评如何显著提升规划性能。我们的实证结果在所考虑的模型类别中达到了新的最先进水平即 2024 年 10 月的语言模型检查点。我们的主要关注点在于方法本身展示出的内在自我提升能力适用于任何特定的模型版本而且我们相信将我们的方法应用于更复杂的搜索技术和更强大的模型将会带来更出色的表现。1、IntroductionRecent advancements of Large Language Models (LLMs) have extended their applications to include planning, a domain traditionally dominated by algorithmic methods. LLM planning is important for a wide-range of tasks in which the task has implicit or explicit constraints and the LLM must generate a plan satisfying these constraints. Early investigations into LLM planning capabilities were unfavorable (Valmeekam et al., 2023b), reinforcing the notion that Language Models cannot plan. However, subsequent studies have introduced techniques, such as Many-Shot learning on planning tasks, to enhance these capabilities (Agarwal et al., 2024; Bohnet et al., 2024). Despite these enhancements, LLMs still lag behind classic planners that tackle algorithmically complex problems. Our approach introduces further improvements that are very promising.For instance, where LLMs are typically tested on simplified problems such as Blocksworld with 3-5 (Valmeekam et al., 2023c) or 3-7 blocks (Bohnet et al., 2024), classic planners can be applied to considerably more complex problems. Nevertheless, there’s a variety of natural-language tasks such as planning holiday trips or scheduling meetings (Gemini-Team, 2024; Hao et al., 2024) which are less computationally demanding, less classic planner friendly and less structured (often being posed in natural language). These tasks are more difficult for classical planners compared to LLMs underscoring the practical relevance of improving LLMs’ planning capabilities. This range of tasks, from natural-language to classic planning tasks, highlights the importance of continuing to enhance LLMs’ planning abilities, even though a gap remains with classic planners for more specialized, complex planning tasks.大型语言模型LLMs的最新进展已将其应用范围扩展到规划领域而规划领域传统上一直由算法方法主导。LLM 规划对于许多任务来说至关重要这些任务具有隐含或显式的约束条件而 LLM 必须生成满足这些约束条件的计划。早期对 LLM 规划能力的研究结果并不乐观Valmeekam 等人2023b这进一步证实了“语言模型无法规划”的观点。然而后续的研究引入了诸如在规划任务上进行多示例学习等技术以增强这些能力Agarwal 等人2024Bohnet 等人2024。尽管有了这些改进LLM 仍落后于能够解决算法复杂问题的经典规划器。我们的方法带来了进一步的改进前景十分广阔。例如LLM 通常在诸如 3 至 5Valmeekam 等人2023c或 3 至 7 个积木的简化问题Bohnet 等人2024上进行测试而经典规划器则能够应用于复杂得多的问题。然而诸如规划度假行程或安排会议等自然语言任务Gemini-Team2024Hao 等人2024计算需求较低对经典规划器不太友好且结构化程度较低通常以自然语言形式提出。这些任务对于经典规划器来说比大型语言模型更难处理这凸显了提升大型语言模型规划能力的实际意义。从自然语言任务到经典规划任务的这一系列任务突显了持续增强大型语言模型规划能力的重要性尽管在更专业、复杂的规划任务方面大型语言模型与经典规划器之间仍存在差距。The concept of self-improvement in LLMs (refining their responses based on self-generated feedback, often referred to as self-criticism), has gained significant attention often relying on an oracle for feedback (Shinn et al., 2023; Yao et al., 2023a,b). This technique is particularly appealing in the context of planning tasks, where the ability to identify and rectify errors to improve the plans could lead to substantial performance gains.Earlier attempts to enhance LLM planning through self-critique yielded discouraging results, principally due to the lack of sufficient self-evaluation capabilities (Huang et al., 2024; Valmeekam et al., 2023a). The application of self-critique in prior work resulted in models that had high false-positive (FP) rates and that missed nearly all of the true negatives (TN), suggesting that current LLMs cannot effectively critique themselves (Valmeekam et al., 2023a).Thus, earlier results on iterative plan improvements motivated the use of external feedback such as verification with an oracle (that has access to correct plan) or relying on human input to guide the process. While these results give an indication of how much headroom there is for improving the plans in the presence of a perfect evaluator, the assumption that oracles are available at test time is unrealistic for most planning tasks of interest.大型语言模型的自我改进概念基于自生成反馈来优化其响应通常称为自我批评已引起广泛关注这通常依赖于一个权威来源提供反馈Shinn 等人2023Yao 等人2023ab。这种技术在规划任务的背景下尤其具有吸引力因为能够识别并纠正错误以改进计划可能会带来显著的性能提升。此前通过自我批评来增强大型语言模型LLM规划能力的尝试结果令人沮丧主要是由于缺乏足够的自我评估能力Huang 等人2024 年Valmeekam 等人2023a。在先前的研究中应用自我批评导致模型出现高误报率FP并且几乎遗漏了所有真阴性TN这表明当前的 LLM 无法有效地自我批评Valmeekam 等人2023a。因此早期关于迭代计划改进的研究结果促使人们采用外部反馈例如借助拥有正确计划的“神谕”进行验证或者依靠人类输入来引导过程。虽然这些结果表明在有完美评估者的情况下改进计划还有多大空间但在测试时假设神谕可用对于大多数感兴趣的规划任务来说是不切实际的。Exploring the limits of intrinsic LLM self-improvement - improving generations without the aid of external signals or further training - remains an active area of research (Huang et al., 2024; Singh et al., 2024). In this work, we propose an effective self-improvement method using zero-shot or few-shot prompts without the need for an external verifier. Figure 1 illustrates the introduced self-improvement method via intrinsic self-critique in this paper, utilizing the LLM as the sole source of critique. The Figure shows the iterative process of a plan generation followed by a self-critique step while adding previous failures as context (see Section 4). The figure also highlights the crucial components: a clear definition of the domain and instructions (e.g. pre-conditions), proper prompt design, and the use of previous failures and critiques in subsequent plan refinement steps.We investigate variations of our proposed method via ablation studies. We perform ablation study on different elements of our method. Moreover, we explore how the method scales when across zero-shot and few-shot prompting and assess the impact of varying the number of self-critique iterations. We describe a novel approach to self-critique, where the LLM is prompted to evaluate the preconditions of each action and provided the state in a plan. Lastly, we examine the limitations of our method and evaluate the quality of self-criticisms aiming to provide a comprehensive understanding of the scalability and effectiveness of the proposed methods.探索内在语言模型自我改进的极限——即在没有外部信号或进一步训练的情况下改进生成内容——仍是研究的活跃领域Huang 等人2024 年Singh 等人2024 年。在本研究中我们提出了一种有效的自我改进方法使用零样本或少样本提示且无需外部验证器。图 1 展示了本文所提出的通过内在自我批评实现自我改进的方法其中仅使用语言模型作为批评的唯一来源。该图展示了计划生成后进行自我批评的迭代过程并将之前的失败作为上下文添加见第 4 节。该图还突出了关键组成部分明确的领域定义和指令例如先决条件、恰当的提示设计以及在后续计划细化步骤中使用之前的失败和批评。我们通过消融研究来探究所提出方法的不同变体。我们对方法的不同元素进行了消融研究。此外我们还研究了该方法在零样本和少样本提示下的扩展性并评估了自我批评迭代次数变化的影响。我们描述了一种新颖的自我批评方法其中大型语言模型LLM被提示评估计划中每个动作的前提条件并提供计划的状态。最后我们探讨了该方法的局限性并评估了自我批评的质量旨在全面了解所提出方法的可扩展性和有效性。Our focus lies on demonstrating intrinsic self-improvement capabilities of LLMs, independent of the specific model version. To this end, we use model checkpoints from October 2024 in our empirical studies. We conduct the exploratory experiments with Gemini 1.5 Pro (GeminiTeam et al., 2024) and confirm our results with other foundation models. Our work aims to demonstrate the applicability of our method to multiple foundational models, rather than comparing them to each other.In the Blocksworld domain, Gemini 1.5 Pro achieves significant performance gains on planning datasets. With the dataset from Valmeekam et al. (2023c) involving 3-5 blocks, we enhance perfor-mance from 49.8% to 89.3% through intrinsic self-critique, without external source such as a verifier. Similarly, for the dataset by Bohnet et al. (2024) with 3-7 blocks, we boost Gemini 1.5 Pro accuracy from 57.2% to 79.5%. We also demonstrate similar improvements on Logistics and Mini-grid datasets, exceeding strong baseline accuracies. In addition to results on Gemini 1.5 Pro, we have positive results with Claude 3.5 Sonnet (Anthropic, 2023) where accuracy improves from 68% to 89.5% and GPT-4o (OpenAI et al., 2024) for Blocksworld 3-5. With Gemma-2 27B (Team et al., 2024) we only see a modest improvement for Logistics but not for Blocksworld, suggesting larger and more capable models are better able to self-improve.2The paper is organized as follows: Section 2 reviews related work. Section 4 details the methods employed and Section 3 provides an overview on the datasets we use in our experiments. Section 5 presents the main findings across several datasets from related work, exploratory experiments on substantial number of validation examples to select hyper-parameters, and an in-depth investigation of what contributions of the self-critique method where important to improve the state-of-the-art. Finally, Section 6 concludes the paper.我们的重点在于展示大型语言模型的内在自我改进能力而不受特定模型版本的影响。为此在实证研究中我们使用了 2024 年 10 月的模型检查点。我们使用 Gemini 1.5 ProGeminiTeam 等人2024 年进行了探索性实验并通过其他基础模型确认了我们的结果。我们的工作旨在展示我们的方法在多个基础模型中的适用性而非对它们进行相互比较。在积木世界领域Gemini 1.5 Pro 在规划数据集上取得了显著的性能提升。使用 Valmeekam 等人2023 年 c涉及 3 至 5 个积木的数据集通过内在自我批评性能从 49.8% 提升至 89.3%且无需外部验证源。同样对于 Bohnet 等人2024 年的 3-7 块数据集我们将 Gemini 1.5 Pro 的准确率从 57.2% 提升到了 79.5%。我们还在 Logistics 和 Mini-grid 数据集上展示了类似的改进超过了强大的基线准确率。除了在 Gemini 1.5 Pro 上的结果外我们还使用 Claude 3.5 SonnetAnthropic2023 年在 Blocksworld 3-5 上取得了积极的结果准确率从 68% 提升到了 89.5%以及 GPT-4oOpenAI 等2024 年。使用 Gemma-2 27BTeam 等2024 年我们仅在 Logistics 数据集上看到了适度的提升但在 Blocksworld 数据集上没有这表明更大、更强大的模型更能够自我改进。本文的组织结构如下第 2 节回顾了相关工作。第 4 节详细介绍了所采用的方法第 3 节概述了我们在实验中使用的数据集。第 5 节展示了在多个数据集上的主要发现包括对大量验证样本进行的探索性实验以选择超参数以及对自我批评方法中哪些贡献对提升现有技术水平至关重要的深入研究。最后第 6 节总结了本文。Figure 1 | Illustration of the iterative self-improvement process for Large Language Models (LLMs) using in-context learning to incorporate self-critique feedback. The LLM, also represented by the explorer character, functions as the agent’s brain, accepting prompts as inputs and generating outputs, represented by green and red semi-circles, respectively. Each iteration of the self-improvement mechanism comprises two key steps: i) plan generation and ii) self-critiquing, aimed at iteratively refining LLM outputs. In step i), the LLM generates a plan (symbolized by a map) based on a prompt incorporating domain-specific knowledge and instructions (symbolized by the treasure chest). Stepinvolves a self-critique mechanism where the LLM evaluates its own performance, providing correctness assessments and justifications, again leveraging domain knowledge. The process continues until a plan deemed correct is identified. Previous plans and their associated self-critique feedback are aggregated into a collection (symbolized by a bag), serving as contextual material for subsequent plan generation cycles.图 1 | 大型语言模型LLM使用上下文学习纳入自我批评反馈的迭代自我改进过程示意图。LLM 也由探险者角色代表充当智能体的大脑接受提示作为输入并生成输出分别用绿色和红色半圆表示。自我改进机制的每次迭代包含两个关键步骤i计划生成和 ii自我批评旨在逐步完善 LLM 的输出。在步骤 i中LLM 根据包含特定领域知识和指令的提示用宝箱表示生成计划用地图表示。步骤 ii涉及自我批评机制LLM 评估自身表现提供正确性评估和理由再次利用领域知识。该过程持续进行直至确定一个正确的计划。先前的计划及其相关的自我批评反馈被汇总到一个集合用袋子表示作为后续计划生成周期的上下文材料。6 ConclusionThis work shows that self-improvement using intrinsic self-critique can significantly enhances per-formance on standard planning benchmarks, when properly implemented. We obtain substantial performance gains on all of the studied benchmarks. Problems that had previously been challenging are now solvable with a high accuracy presenting SoTA for the model class considered, such as in Blocksworld 3-5, where we achieve an 89.3% success rate employing self-critique with self-consistency. This work is also the first to demonstrate that LLMs can solve Mystery Blocksworld problems with 22% accuracy and achieve substantial accuracy gains, reaching 37.8%, when self-improvement using self-critique and self-consistency is employed.Furthermore, there is potential to swap out in-context learning for more sophisticated planning methods, such as Chain-of-Thought, Self-Consistency, or even to integrate these with advanced search-based algorithms like Monte-Carlo Tree Search to further improve accuracy or tackle even more complex tasks (Coulom, 2007; Madaan et al., 2023; Wei et al., 2022).Lastly, this work not only demonstrates the viability of self-improvement using self-critique in enhancing planning accuracy but also lays the groundwork for a new paradigm in AI planning. By bridging the gap between symbolic planning and language models, we open up possibilities for tackling more complex, real-world planning scenarios and pushing the boundaries of AI problem-solving capabilities by leveraging LLMs.这项工作表明当正确实施时利用内在自我批评进行自我改进能够显著提升在标准规划基准测试中的表现。我们在所有研究的基准测试中都获得了显著的性能提升。此前难以解决的问题现在能够以高精度解决例如在“Blocksworld 3-5”中通过采用自我批评和自我一致性我们实现了 89.3%的成功率达到了所考虑模型类别的最先进水平。此外这项工作也是首次证明通过采用自我批评和自我一致性进行自我改进大型语言模型能够以 22%的准确率解决“神秘积木世界”问题并将准确率大幅提升至 37.8%。此外还有可能用更复杂的规划方法如链式思维、自我一致性取代上下文学习甚至将这些方法与蒙特卡罗树搜索等先进的搜索算法相结合以进一步提高准确率或解决更复杂的任务Coulom2007Madaan 等人2023Wei 等人2022。最后这项工作不仅证明了通过自我批评进行自我改进以提高规划准确性的可行性还为人工智能规划领域的新范式奠定了基础。通过弥合符号规划与语言模型之间的差距我们为解决更复杂、更贴近现实世界的规划场景开辟了可能性并通过利用大型语言模型来拓展人工智能解决问题的能力边界。