【原】LLM之LRMs：《Revisiting Prompt Optimization with Large Reasoning Models-A Case Study on Event Extractio

处女座的程序猿 2025-04-21 发布于上海

展开全文

LLM之LRMs：《Revisiting Prompt Optimization with Large Reasoning Models-A Case Study on Event Extraction》翻译与解读

导读：这篇论文研究的是大型推理模型 (LRMs) 在提示优化方面的有效性，特别是将其应用于事件抽取任务。该论文证明了即使对于强大的LRMs，提示优化仍然是有效的，并且LRMs本身可以作为高效且稳定的提示优化器。这为未来利用LRMs进行提示优化提供了重要的理论和实践指导。

>> 背景痛点：

● 大型语言模型 (LLMs) 在复杂推理任务上的能力有限：尽管LLMs在各种自然语言处理任务中表现出色，但在需要复杂推理的任务（例如事件抽取）上表现仍然不足。

● 提示优化对LLMs至关重要，但对LRMs的影响尚不明确：传统的提示优化方法在提升LLMs性能方面非常有效，但LRMs凭借其强大的推理能力，人们对其是否仍然需要提示优化存在疑问。缺乏针对LRMs提示优化的系统性研究。

● 现有提示优化研究多集中在零样本基线表现良好的任务上：许多研究忽略了像事件抽取这样对推理能力要求极高的任务，而这些任务即使对于强大的模型如GPT-4也存在挑战。

>> 具体的解决方案：论文提出了一种基于蒙特卡洛树搜索 (MCTS) 的提示优化框架，系统地研究了LRMs在事件抽取任务中的提示优化效果，并将其与LLMs进行了比较。该框架包含以下步骤：

>> 核心思路步骤：

● 问题设定：将提示优化定义为一个离散搜索问题，目标是找到一个最佳提示，最大化事件抽取任务的F1分数。

● 提示表示：使用Python代码表示模型的输入和输出，初始提示包含任务指令和事件模式（用Python类定义，并附带人工编写的指导说明）。

● MCTS框架：使用MCTS算法探索提示空间，在每个迭代中：

● 答案生成：使用任务模型 (Mtask) 对当前提示和输入文本生成答案。

● 错误提取：使用Python解释器识别答案中的错误（例如解析错误、未定义的事件类、幻觉的跨度等）。

● 反馈生成：使用优化器模型 (Mopt) 分析错误并生成反馈，建议修改任务指令和事件指导说明。

● 提示优化：使用优化器模型根据反馈生成更新后的提示。

● 奖励评估：在开发集上评估更新后的提示，并使用平均F1分数作为奖励函数。

● 反馈生成和提示优化：使用元提示指导优化器模型生成结构化的反馈，并根据反馈更新提示。

● 模型评估：使用四个F1分数指标评估模型性能：触发词识别 (TI)、触发词分类 (TC)、参数识别 (AI) 和参数分类 (AC)。

>> 优势：

● 系统性研究：首次系统性地研究了LRMs的提示优化，并与LLMs进行了比较。

● 挑战性任务：在具有挑战性的事件抽取任务上进行了实验，该任务需要复杂的推理能力。

● 统一框架：使用统一的MCTS框架评估LRMs作为任务模型和优化器的性能。

● 多模型比较：实验了两种LRMs (DeepSeek-R1 和 o1) 和两种LLMs (GPT-4.5 和 GPT-4o)。

● 资源条件考虑：考虑了低资源和中等资源场景。

>> 结论和观点：

● LRMs受益于提示优化： LRMs从提示优化中获得的收益比LLMs更大，即使在训练集非常小的情况下也是如此。

● LRMs是更好的提示优化器： LRMs作为优化器可以生成更高质量的提示，这些提示通常更简洁、更精确，包含了任务特定的启发式规则和异常处理规则。

● LRMs作为优化器更高效稳定： LRMs引导任务模型更快、更稳定地达到最佳性能。

● 提示长度与性能的关系：较短的提示并不一定意味着较低的性能，不同的任务模型可能对不同长度的提示有不同的偏好。DeepSeek-R1在最短的提示下取得了最佳性能。

● 错误分析： LRMs优化的提示可以减少多种错误，例如参数过度预测、幻觉和解析错误。

《Revisiting Prompt Optimization with Large Reasoning Models-A Case Study on Event Extraction》翻译与解读

地址

论文地址：[2504.07357] Revisiting Prompt Optimization with Large Reasoning Models-A Case Study on Event Extraction

时间

2025年4月10日

作者

Saurabh Srivastava & Ziyu Yao

George Mason University

Abstract

Large Reasoning Models (LRMs) such as DeepSeek-R1 and OpenAI o1 have demonstrated remarkable capabilities in various reasoning tasks. Their strong capability to generate and reason over intermediate thoughts has also led to arguments that they may no longer require extensive prompt engineering or optimization to interpret human instructions and produce accurate outputs. In this work, we aim to systematically study this open question, using the structured task of event extraction for a case study. We experimented with two LRMs (DeepSeek-R1 and o1) and two general-purpose Large Language Models (LLMs) (GPT-4o and GPT-4.5), when they were used as task models or prompt optimizers. Our results show that on tasks as complicated as event extraction, LRMs as task models still benefit from prompt optimization, and that using LRMs as prompt optimizers yields more effective prompts. Finally, we provide an error analysis of common errors made by LRMs and highlight the stability and consistency of LRMs in refining task instructions and event guidelines.

诸如 DeepSeek-R1 和 OpenAI o1 这样的大型推理模型（LRMs）在各种推理任务中展现出了卓越的能力。它们生成和推理中间思想的强大能力也引发了争论，认为它们可能不再需要大量的提示工程或优化来理解人类指令并生成准确的输出。在本研究中，我们旨在系统地研究这一开放性问题，以事件抽取这一结构化任务作为案例研究。我们对两种大型推理模型（DeepSeek-R1 和 o1）以及两种通用大型语言模型（LLMs）（GPT-4o 和 GPT-4.5）进行了实验，考察它们作为任务模型或提示优化器时的表现。我们的结果表明，在像事件抽取这样复杂的任务中，作为任务模型的大型推理模型仍然能从提示优化中获益，并且使用大型推理模型作为提示优化器能生成更有效的提示。最后，我们对大型推理模型常见的错误进行了分析，并强调了它们在细化任务指令和事件指南方面的稳定性和一致性。

1、Introduction

In recent years, Large Language Models (LLMs) have demonstrated remarkable ca-pabilities across various natural language processing tasks. However, their profi-ciency in complex reasoning tasks has often been limited (Zhou et al., 2022). To address this, a new class of models, known as Large Reasoning Models (LRMs), has emerged, focusing on enhancing reasoning abilities through advanced training methodologies. One prominent example is DeepSeek-R1 (Guo et al., 2025), an open-source LRM that has achieved state-of-the-art performance on several reasoning benchmarks, includ-ing MATH-500 (Lin et al., 2025) and SWE-bench Verified (Jimenez et al., 2023). Simi-larly, OpenAI’s o1 (Zhong et al., 2024) has set new standards in reasoning tasks, show-casing superior performance in complex problem-solving scenarios.

The advent of these advanced reasoning models has sparked discussions (Wang et al., 2024a; OpenAI, 2025; Mantaras, 2025; Together AI, 2025; Menendez et al., 2025) about the necessity of prompt optimization—the process of refining input prompts to guide model outputs effectively (Zhou et al., 2022; Yang et al., 2024; Srivastava et al., 2024; Agarwal et al., 2024; Guo et al., 2024; Fernando et al., 2024; Li et al., 2025). Traditionally, prompt optimization has been crucial for enhancing LLM performance, with frameworks like PromptAgent (Wang et al., 2024b) and OPRO (Yang et al., 2024) automating the creation and refinement of prompts through iterative feedback and strategic planning. However, the inherent reasoning capabilities of LRMs like DeepSeek-R1 and o1 raise questions about whether such prompt optimization techniques are equally beneficial for these models. While previous studies have demonstrated the effectiveness of prompt optimization in improving LLM performance, there is a notable gap in research focusing on its impact on LRMs. Moreover, many existing prompt optimization studies focus on tasks where zero-shot baselines already perform well, whereas recent work, such as Gao et al. (2024), demonstrates that even powerful models like GPT-4 struggle with Information Extraction tasks, underscoring the need for more targeted and optimized prompting strategies. We present a discussion on related works in Appendix A.

近年来，大型语言模型（LLMs）在各种自然语言处理任务中展现出了卓越的能力。然而，它们在复杂推理任务上的表现往往受限（Zhou 等人，2022）。为解决这一问题，一类新的模型——大型推理模型（LRMs）应运而生，通过先进的训练方法来提升推理能力。其中，DeepSeek-R1（Guo 等人，2025）是一个开源的 LRM，在包括 MATH-500（Lin 等人，2025）和 SWE-bench Verified（Jimenez 等人，2023）在内的多个推理基准测试中取得了最先进的成绩。同样，OpenAI 的 o1（Zhong 等人，2024）在推理任务中树立了新的标杆，在复杂问题解决场景中表现出了卓越的性能。

这些先进推理模型的出现引发了关于提示优化必要性的讨论（Wang 等人，2024a；OpenAI，2025；Mantaras，2025；Together AI，2025；Menendez 等人，2025），提示优化即通过改进输入提示来有效引导模型输出的过程（Zhou 等人，2022；Yang 等人，2024；Srivastava 等人，2024；Agarwal 等人，2024；郭等人（2024 年）；费尔南多等人（2024 年）；李等人（2025 年）。传统上，提示优化对于提升大型语言模型（LLM）的性能至关重要，诸如 PromptAgent（王等人，2024 年 b）和 OPRO（杨等人，2024 年）之类的框架通过迭代反馈和策略规划来自动创建和优化提示。然而，像 DeepSeek-R1 和 o1 这样的逻辑推理模型（LRM）所具备的内在推理能力引发了这样的疑问：这些提示优化技术对这些模型是否同样有益。尽管先前的研究已经证明了提示优化在提升 LLM 性能方面的有效性，但针对其对 LRM 影响的研究却存在显著空白。此外，许多现有的提示优化研究都集中在零样本基线表现良好的任务上，而最近的工作，如高等人（2024 年）表明，即使是像 GPT-4 这样强大的模型在信息抽取任务上也存在困难，这凸显了需要更针对性和优化的提示策略。我们在附录 A 中对相关工作进行了讨论。

To fill this gap, we conduct the first systematic study of prompt optimization with LRMs and compare their performance with LLMs. In particular, we experimented with these models on a challenging task, i.e., end-to-end event extraction (EE), a structured prediction task of information extraction that requires identifying and classifying event triggers and arguments within text. EE poses unique challenges: models must follow schema constraints, handle coreference, and balance precision with recall, all of which demand nuanced reasoning. We evaluated four models, two LRMs (DeepSeek-R1, o1) and two LLMs (GPT-4.5, GPT-4o) as both task models and prompt optimizers within a Monte Carlo Tree Search (MCTS) framework (Wang et al., 2024b). This setup allows us to examine both task performance and prompt optimization quality under a consistent setting. Our findings are organized around the following research questions:

1. Do LRMs benefit from prompt optimization? We find that LRMs such as DeepSeek-R1 and o1 show substantial gains from prompt optimization, outperforming their non-optimized versions as well as LLMs, even when the training set is extremely small, showing that even strong reasoning models still benefit significantly from prompt optimization.

2. How do LRMs behave under the full-scale MCTS prompt optimization? Using our MCTS-based framework, we analyze how model performance evolves across optimization depth. LRMs scale more consistently than LLMs, converging faster and with less variance. For instance, DeepSeek-R1 achieves peak performance by depth 2, while LLMs require deeper exploration and still underperform.

3. Do LRMs make better prompt optimizers? LRMs generate high-quality prompts when used as optimizers, often (especially for DeepSeek-R1) producing shorter, more precise prompts than LLMs. These prompts contain extraction rules and exception cases that mirror human annotation guidelines, leading to better downstream task performance.

4. Can LRMs act as efficient and stable optimizers in prompt optimization? When used as optimizers, LRMs guide models to peak performance more efficiently than LLMs. They help task models achieve convergence at shallower MCTS depth with lower variance across nodes, indicating both faster and greater stability.

为了填补这一空白，我们对基于语言模型的提示优化进行了首次系统研究，并将其性能与大型语言模型进行了比较。具体而言，我们在一项具有挑战性的任务上对这些模型进行了实验，即端到端事件抽取（EE），这是一个信息抽取的结构化预测任务，需要在文本中识别和分类事件触发词和论元。EE 带来了独特的挑战：模型必须遵循模式约束、处理共指关系，并在准确率和召回率之间取得平衡，所有这些都需要细致入微的推理。我们在一个蒙特卡罗树搜索（MCTS）框架（Wang 等人，2024b）中评估了四种模型，包括两个基于语言的模型（DeepSeek-R1、o1）和两个大型语言模型（GPT-4.5、GPT-4o），它们既作为任务模型又作为提示优化器。这种设置使我们能够在一致的环境中考察任务性能和提示优化质量。我们的研究结果围绕以下研究问题展开：

1. 长程模型（LRMs）能从提示优化中获益吗？我们发现诸如 DeepSeek-R1 和 o1 这样的长程模型在经过提示优化后表现出了显著的提升，不仅超越了未优化的版本，还超过了大型语言模型（LLMs），即便训练集极其有限也是如此。这表明即便是强大的推理模型也能从提示优化中获得显著收益。

2. 在全规模的蒙特卡罗树搜索（MCTS）提示优化下，长程记忆模型（LRMs）的表现如何？利用我们基于 MCTS 的框架，我们分析了模型性能在优化深度上的演变情况。长程记忆模型比大型语言模型（LLMs）的扩展性更稳定，收敛速度更快且方差更小。例如，DeepSeek-R1 在深度为 2 时就达到了峰值性能，而大型语言模型则需要更深入的探索，且表现仍不如前者。

3. 语言表示模型（LRMs）是否能成为更出色的提示优化器？当用作优化器时，LRMs 能生成高质量的提示，通常（尤其是对于 DeepSeek-R1）生成的提示比大型语言模型（LLMs）更短、更精准。这些提示包含提取规则和例外情况，与人类标注指南相呼应，从而能提升下游任务的表现。4. 局部响应模型（LRMs）能否在提示优化中充当高效且稳定的优化器？当用作优化器时，LRMs 比大型语言模型（LLMs）更有效地引导模型达到最佳性能。它们帮助任务模型在更浅的蒙特卡罗树搜索（MCTS）深度实现收敛，且节点间的方差更低，这表明收敛速度更快且稳定性更强。

Finally, our analyses show that LRMs generally produce more effective prompts. These optimized prompts often include task-specific heuristics and exception handling rules, which help reduce common trigger-related mistakes such as identifying multiple or implicit events, and slightly mitigate argument-level errors like coreferences and span overprediction. Among all the models in our experiments, DeepSeek-R1 produced the shortest (yet most effective) prompts. Interestingly, we observe that a longer prompt is not necessarily a more effective one, and various task models may have different preferences over various lengths of prompts. These findings align with the guidance on prompting LRMs (Mantaras, 2025; Together AI, 2025; OpenAI, 2025), which recommends using concise, focused instructions that avoid extraneous or overly complex phrasing, but in the meantime supplying the models with necessary task specifications. Our work demonstrates that, even with LRMs, prompt optimization is still valuable by automatically optimizing the prompt to be task-targeted yet concise.

最后，我们的分析表明，LRMs 通常能生成更有效的提示。这些优化后的提示往往包含特定任务的启发式方法和异常处理规则，有助于减少常见的触发相关错误，例如识别多个或隐含事件，还能略微减轻诸如共指和跨度过度预测等论元级错误。在我们实验中的所有模型中，DeepSeek-R1 生成的提示最短（但最有效）。有趣的是，我们发现较长的提示并不一定更有效，不同的任务模型可能对不同长度的提示有不同的偏好。这些发现与对 LRMs 提示的指导（Mantaras，2025；Together AI，2025；）相符。OpenAI（2025 年）建议使用简洁、明确的指令，避免冗余或过于复杂的措辞，但同时要为模型提供必要的任务说明。我们的工作表明，即使使用语言模型，提示优化仍然很有价值，能够自动优化提示，使其既针对任务又简洁明了。

Figure 1: Summary of our main results, where LRMs and LLMs are used as either the task model (Mtask) or the optimizer (Mopt) in prompt optimization, and we observed a strong advantage of LRMs over LLMs.图 1：我们主要结果的总结，其中在提示优化中，LRM 和 LLM 被用作任务模型（Mtask）或优化器（Mopt），并且我们观察到 LRM 相对于 LLM 具有显著优势。

Figure 2: Overview of our prompt optimization framework using language models. At each iteration, a zero-shot task LLM generates outputs, while a separate optimizer LLM analyzes the errors and updates the prompt, including task instructions and event guidelines, accordingly. This process continues over batches of training samples Dtrain, and the final optimized prompt is evaluated on the development set to determine the node reward rt.图 2：使用语言模型的提示优化框架概述。在每次迭代中，零样本任务语言模型生成输出，而另一个优化器语言模型分析错误并相应地更新提示，包括任务说明和事件指南。此过程在训练样本批次 Dtrain 上持续进行，最终优化的提示在开发集上进行评估以确定节点奖励 rt。

Figure 3: A code prompt consists of a task instruction and an event schema. The event schema contains information about the labels that are represented as Python classes and event guidelines defining both the event classes and the arguments. In prompt optimization, we refine both the task instruction and event guidelines (shown for two events; others omitted due to space limits) to generate more effective prompts for the task model.图 3：代码提示由任务指令和事件模式组成。事件模式包含有关标签的信息，这些标签以 Python 类的形式表示，以及定义事件类和参数的事件指南。在提示优化中，我们对任务指令和事件指南（此处仅展示两个事件；由于篇幅限制，其他省略）进行细化，以生成更有效的任务模型提示。

Conclusion

We present the first systematic study of prompt optimization for LRMs, evaluating their roles as both task models and optimizers in a unified MCTS framework. On the structured task of event extraction, we find that LRMs benefit more from prompt optimization than LLMs and serve as stronger optimizers. They produce higher-quality prompts, converge faster, and generalize more reliably across models—highlighting their effectiveness in both prompt consumption and generation. Our error analysis further reveals that prompts optimized by LRMs reduce overprediction, hallucination, and parsing errors, contributing to more faithful and structured outputs.

我们首次对语言反应模型（LRMs）的提示优化进行了系统研究，在统一的蒙特卡罗树搜索（MCTS）框架中评估了它们作为任务模型和优化器的作用。在事件抽取这一结构化任务上，我们发现与大型语言模型（LLMs）相比，LRMs 更能从提示优化中获益，并且作为优化器表现更出色。它们生成的提示质量更高，收敛速度更快，并且在不同模型之间泛化更可靠——这凸显了它们在提示消费和生成方面的有效性。我们的错误分析进一步表明，由 LRMs 优化的提示减少了过度预测、幻觉和解析错误，从而有助于生成更忠实和结构化的输出。