【原】LLMs之GraphRAG：《From Local to Global: A Graph RAG Approach to Query-Focused Summarization》翻译与解读

处女座的程序猿 2024-06-19 发布于上海

展开全文

LLMs之GraphRAG：《From Local to Global: A Graph RAG Approach to Query-Focused Summarization》翻译与解读

导读：该论文提出了一种基于图结构的知识图谱增强生成(Graph RAG)方法，用于回答用户针对整个文本集合的全局性质询问，以支持人们对大量数据进行全面理解。

背景痛点：传统的回答增强生成(RAG)方法主要用于本地问答任务，无法很好解决针对整个文本集合的全局性质询问问题。传统的根据查询聚焦的自动摘要(QFS)方法难以应对RAG系统常见的大规模文本索引。
核心原理：GraphRAG实现全局性质问答的核心原理如下：
>> 建立基于知识图谱的二级索引结构。第一步，从源文档中通过LLM提取实体与关系，构建知识图谱；第二步，使用社区检测算法将知识图谱分割成与实体紧密相关的社区模块。
>> 对每个社区模块使用LLM生成报告式自动摘要，形成一个覆盖源文档及其基础的知识图谱的模块性图索引。
>> 用户提出查询时，首先让每个社区摘要独立并行使用LLM生成部分回答；然后对所有相关部分回答再次使用LLM进行汇总，得出全局回答返回给用户。
思路步骤：源文档→文本块→实体与关系实例→实体与关系描述→知识图谱→Graph Communities→社区自动摘要→社区答案→全局答案
总体来说，GraphRAG通过分层构建知识图谱索引，利用其内在的模块性达成并行处理能力；然后使用map-reduce思想实现对全局查询的回答，在保证回答全面性的同时提升了效率，这是其实现全局性质问答任务的核心思路。

核心特点：

>> 充分利用知识图谱内在的模块性，实现并行处理能力。

>> 社区模块中的实体与关系得到充分深入描述，有利于生成更全面和多样化的回答。

>> 与直接采用源文档相比，图结构索引节省大量上下文信息量，且查询效率更高。

优势：

>> 实验结果表明，与传统RAG方法和直接全局文本汇总方法相比，Graph RAG方法在回答全面性和多样性方面都有显著提升，同时节省大量上下文信息量，尤其是利用根社区水平得到很好的查询性能。该方法实现了复杂问题回答任务的可扩展性。

总之，该论文提出的Graph RAG方法很好地将知识图谱、RAG和查询聚焦摘要技术相结合，实现了对大规模文本集合的全局性质询问的回答，有利于支持人类进行深入理解和宏观把握。

《From Local to Global: A Graph RAG Approach to Query-Focused Summarization》翻译与解读

地址	论文地址：https:///abs/2404.16130
时间	2024年4月24日
作者	Microsoft团队

Abstract摘要

The use of retrieval-augmented generation (RAG) to retrieve relevant informa-tion from an external knowledge source enables large language models (LLMs) to answer questions over private and/or previously unseen document collections. However, RAG fails on global questions directed at an entire text corpus, such as “What are the main themes in the dataset?”, since this is inherently a query-focused summarization (QFS) task, rather than an explicit retrieval task. Prior QFS methods, meanwhile, fail to scale to the quantities of text indexed by typical RAG systems. To combine the strengths of these contrasting methods, we propose a Graph RAG approach to question answering over private text corpora that scales with both the generality of user questions and the quantity of source text to be in-dexed. Our approach uses an LLM to build a graph-based text index in two stages: first to derive an entity knowledge graph from the source documents, then to pre-generate community summaries for all groups of closely-related entities. Given a question, each community summary is used to generate a partial response, before all partial responses are again summarized in a final response to the user. For a class of global sensemaking questions over datasets in the 1 million token range, we show that Graph RAG leads to substantial improvements over a 简单的RAG baseline for both the comprehensiveness and diversity of generated answers. An open-source, Python-based implementation of both global and local Graph RAG approaches is forthcoming at https:///graphrag.

使用检索增强生成(retrieve -augmented generation, RAG)从外部知识来源检索相关信息，使大型语言模型(LLM)能够回答私有和/或以前未见过的文档集合上的问题。然而，RAG在针对整个文本语料库的全局问题上失败了，例如“数据集中的主题是什么?”，因为这本质上是一个以查询为中心的查询聚焦摘要(QFS)任务，而不是一个明确的检索任务。与此同时，先前的QFS方法无法扩展到典型RAG系统索引的文本数量。为了结合这些对比方法的优势，我们提出了一种基于私有文本语料库的Graph RAG方法，该方法可以根据用户问题的通用性和要索引的源文本的数量进行扩展。我们的方法使用LLM分两个阶段构建基于图的文本索引：首先从源文档中导出实体知识图，然后为所有密切相关的实体组预生成社区摘要。给定一个问题，每个社区摘要用于生成部分响应，然后将所有部分响应再次汇总为对用户的最终响应。对于100万个令牌范围内的数据集上的一类全局语义问题，我们表明Graph RAG在生成答案的全面性和多样性方面比简单的RAG基线有了实质性的改进。一个开源的、基于python的全局和局部Graph RAG方法的实现即将在https:///graphrag上实现。

Figure 1: Graph RAG pipeline using an LLM-derived graph index of source document text. This index spans nodes (e.g., entities), edges (e.g., relationships), and covariates (e.g., claims) that have been detected, extracted, and summarized by LLM prompts tailored to the domain of the dataset. Community detection (e.g., Leiden, Traag et al., 2019) is used to partition the graph index into groups of elements (nodes, edges, covariates) that the LLM can summarize in parallel at both index-ing time and query time. The “global answer” to a given query is produced using a final round of query-focused summarization over all community summaries reporting relevance to that query.图1：使用LLM派生的源文档文本的图索引的Graph RAG管道。该索引涵盖了节点（例如，实体）、边（例如，关系）和协变量（例如，主张），这些节点、边和协变量是由针对数据集领域的LLM提示进行检测、提取和总结的。社区检测（例如，Leiden，Traag等人，2019年）用于将图索引划分为元素组（节点、边、协变量），LLM可以在索引时间和查询时间并行总结这些元素组。给定查询的“全局答案”是通过在所有与该查询相关的社区摘要上使用最后一轮的查询聚焦摘要来产生的。

1 Introduction介绍

Human endeavors across a range of domains rely on our ability to read and reason about large collections of documents, often reaching conclusions that go beyond anything stated in the source texts themselves. With the emergence of large language models (LLMs), we are already witnessing attempts to automate human-like sensemaking in complex domains like scientific discovery (Mi-crosoft, 2023) and intelligence analysis (Ranade and Joshi, 2023), where sensemaking is defined as “a motivated, continuous effort to understand connections (which can be among people, places, and events) in order to anticipate their trajectories and act effectively” (Klein et al., 2006a). Supporting human-led sensemaking over entire text corpora, however, needs a way for people to both apply and refine their mental model of the data (Klein et al., 2006b) by asking questions of a global nature.	人类在各个领域进行的活动依赖于我们阅读和推理大量文档的能力，常常得出超出源文本本身的结论。随着大型语言模型（LLMs）的出现，我们已经见证了在科学发现(Mi-crosoft, 2023)和情报分析(Ranade和Joshi, 2023)等复杂领域自动化类人语义构建的尝试，其中语义构建被定义为“一种有动机的、持续的努力，以理解联系(可以是人、地点和事件之间的联系)，以便预测它们的轨迹并有效地采取行动”(Klein等人，2006a)。然而，支持人类主导的整个文本语料库的语义构建，需要一种方法，让人们通过提出全局性的问题来应用和完善他们对数据的心理模型(Klein等人，2006b)。
Retrieval-augmented generation (RAG, Lewis et al., 2020) is an established approach to answering user questions over entire datasets, but it is designed for situations where these answers are contained locally within regions of text whose retrieval provides sufficient grounding for the generation task. Instead, a more appropriate task framing is query-focused summarization (QFS, Dang, 2006), and in particular, query-focused abstractive summarization that generates natural language summaries and not just concatenated excerpts (Baumel et al., 2018; Laskar et al., 2020; Yao et al., 2017) . In recent years, however, such distinctions between summarization tasks that are abstractive versus extractive, generic versus query-focused, and single-document versus multi-document, have become less rele-vant. While early applications of the transformer architecture showed substantial improvements on the state-of-the-art for all such summarization tasks (Goodwin et al., 2020; Laskar et al., 2022; Liu and Lapata, 2019), these tasks are now trivialized by modern LLMs, including the GPT (Achiam et al., 2023; Brown et al., 2020), Llama (Touvron et al., 2023), and Gemini (Anil et al., 2023) series, all of which can use in-context learning to summarize any content provided in their context window.	检索增强生成(RAG, Lewis等人，2020)是一种针对整个数据集回答用户问题的既定方法，但它是为这些答案局部包含在文本区域内的情况而设计的，这些文本区域的检索为生成任务提供了足够的基础。相反，更合适的任务框架是以查询为中心的摘要(QFS, Dang, 2006)，特别是以查询为中心的抽象摘要，它生成自然语言摘要，而不仅仅是连接的摘录(Baumel等人，2018;Laskar et al.， 2020;Yao等人，2017)。然而，近年来，抽象与抽取、通用与以查询为中心、单文档与多文档的摘要任务之间的区别已经变得不那么重要了。虽然transformer架构的早期应用在所有此类汇总任务上都显示出了巨大的进步(Goodwin et al.， 2020;Laskar et al.， 2022;Liu和Lapata, 2019)，这些任务现在被现代LLMs简化了，包括GPT (Achiam等人，2023;Brown et al.， 2020)， Llama (Touvron et al.， 2023)和Gemini (Anil et al.， 2023)系列，所有这些都可以使用上下文学习来总结上下文窗口中提供的任何内容。
The challenge remains, however, for query-focused abstractive summarization over an entire corpus. Such volumes of text can greatly exceed the limits of LLM context windows, and the expansion of such windows may not be enough given that information can be “lost in the middle” of longer contexts (Kuratov et al., 2024; Liu et al., 2023). In addition, although the direct retrieval of text chunks in 简单的RAG is likely inadequate for QFS tasks, it is possible that an alternative form of pre-indexing could support a new RAG approach specifically targeting global summarization.	然而，对于整个语料库的以查询为中心的抽象摘要来说，挑战仍然存在。这样的文本量可以大大超过LLM上下文窗口的限制，并且考虑到信息可能会“丢失在中间”的较长的上下文，这样的窗口的扩展可能是不够的(Kuratov等人，2024;Liu et al.， 2023)。此外，尽管在简单的RAG中直接检索文本块可能不适合QFS任务，但是一种替代形式的预索引可能支持专门针对全局摘要的新RAG方法。
In this paper, we present a Graph RAG approach based on global summarization of an LLM-derived knowledge graph (Figure 1). In contrast with related work that exploits the structured retrieval and traversal affordances of graph indexes (subsection 4.2), we focus on a previously unexplored quality of graphs in this context: their inherent modularity (Newman, 2006) and the ability of com-munity detection algorithms to partition graphs into modular communities of closely-related nodes (e.g., Louvain, Blondel et al., 2008; Leiden, Traag et al., 2019). LLM-generated summaries of these community descriptions provide complete coverage of the underlying graph index and the input doc-uments it represents. Query-focused summarization of an entire corpus is then made possible using a map-reduce approach: first using each community summary to answer the query independently and in parallel, then summarizing all relevant partial answers into a final global answer.	在本文中，我们提出了一种基于LLM派生的知识图的全局总结的Graph RAG方法(图1)。与利用图索引的结构化检索和遍历可视性的相关工作(第4.2节)相比，我们专注于在此背景下以前未探索的图的质量:它们固有的模块化(Newman, 2006)以及社区检测算法将图划分为密切相关节点的模块化社区的能力(例如，Louvain, Blondel等人，2008;莱顿，Traag等人，2019)。LLM生成的这些社区描述的摘要提供了底层图形索引及其所代表的输入文档的完整覆盖。然后，可以使用map-reduce方法对整个语料库进行以查询为中心的汇总:首先使用每个社区汇总来独立地并行地回答查询，然后将所有相关的部分答案汇总为最终的全局答案。
To evaluate this approach, we used an LLM to generate a diverse set of activity-centered sense-making questions from short descriptions of two representative real-world datasets, containing pod-cast transcripts and news articles respectively. For the target qualities of comprehensiveness, diver-sity, and empowerment (defined in subsection 3.4) that develop understanding of broad issues and themes, we both explore the impact of varying the the hierarchical level of community summaries used to answer queries, as well as compare to 简单的RAG and global map-reduce summarization of source texts. We show that all global approaches outperform 简单的RAG on comprehensiveness and diversity, and that Graph RAG with intermediate- and low-level community summaries shows favorable performance over source text summarization on these same metrics, at lower token costs.	为了评估这种方法，我们使用LLM从两个具有代表性的真实世界数据集的简短描述中生成了一组以活动为中心的意义构建问题，这些数据集分别包含播客文稿和新闻文章。对于发展对广泛问题和主题的理解的综合性、多样性和赋权(在第3.4小节中定义)的目标质量，我们都探索了用于回答查询的不同社区摘要的层次水平的影响，并与简单的RAG和源文本的全球地图减少摘要进行了比较。我们表明，所有全局方法在全面性和多样性方面都优于简单的RAG，并且具有中级和低级社区摘要的Graph RAG在这些相同的指标上以更低的令牌成本显示出比源文本摘要更好的性能。

2 Graph RAG Approach & Pipeline图RAG方法和管道

We now unpack the high-level data flow of the Graph RAG approach (Figure 1) and pipeline, de-scribing key design parameters, techniques, and implementation details for each step.

现在我们解压缩Graph RAG方法的高级数据流(图1)和管道，描述每个步骤的关键设计参数、技术和实现细节。

2.1 Source Documents → Text Chunks源文档→文本块

A fundamental design decision is the granularity with which input texts extracted from source doc-uments should be split into text chunks for processing. In the following step, each of these chunks will be passed to a set of LLM prompts designed to extract the various elements of a graph index. Longer text chunks require fewer LLM calls for such extraction, but suffer from the recall degrada-tion of longer LLM context windows (Kuratov et al., 2024; Liu et al., 2023). This behavior can be observed in Figure 2 in the case of a single extraction round (i.e., with zero gleanings): on a sample dataset (HotPotQA, Yang et al., 2018), using a chunk size of 600 token extracted almost twice as many entity references as when using a chunk size of 2400. While more references are generally better, any extraction process needs to balance recall and precision for the target activity.

一个基本的设计决策是将从源文档中提取的输入文本分割成文本块进行处理的粒度。在接下来的步骤中，每个块都将传递给一组LLM提示符，这些提示符旨在提取图索引的各种元素。较长的文本块需要较少的LLM调用来进行这种提取，但较长的LLM上下文窗口会导致召回率下降(Kuratov等人，2024;Liu et al.， 2023)。在单个提取轮(即零收集)的情况下，可以在图2中观察到这种行为:在样本数据集(HotPotQA, Yang等人，2018)上，使用块大小为600的令牌提取的实体引用几乎是使用块大小为2400时的两倍。虽然引用越多越好，但任何提取过程都需要平衡目标活动的召回率和精度。

2.2 Text Chunks → Element Instances文本块→元素实例

The baseline requirement for this step is to identify and extract instances of graph nodes and edges from each chunk of source text. We do this using a multipart LLM prompt that first identifies all entities in the text, including their name, type, and description, before identifying all relationships between clearly-related entities, including the source and target entities and a description of their relationship. Both kinds of element instance are output in a single list of delimited tuples.

The primary opportunity to tailor this prompt to the domain of the document corpus lies in the choice of few-shot examples provided to the LLM for in-context learning (Brown et al., 2020). For example, while our default prompt extracting the broad class of “named entities” like people, places, and organizations is generally applicable, domains with specialized knowledge (e.g., science, medicine, law) will benefit from few-shot examples specialized to those domains. We also support a secondary extraction prompt for any additional covariates we would like to associate with the extracted node instances. Our default covariate prompt aims to extract claims linked to detected entities, including the subject, object, type, description, source text span, and start and end dates.

这一步的基本要求是从每个源文本块中识别和提取图节点和边的实例。我们使用多部分LLM提示符来完成此操作，该提示符首先识别文本中的所有实体，包括它们的名称、类型和描述，然后识别明确相关实体之间的所有关系，包括源实体和目标实体以及它们之间关系的描述。这两种类型的元素实例都输出在单个分隔元组列表中。

将此提示定制为文档语料库领域的主要机会在于选择提供给LLMs进行上下文学习的少量示例(Brown et al.， 2020)。例如，虽然我们的默认提示提取“命名实体”(如人员、地点和组织)的广泛类别通常是适用的，但具有专门知识的领域(例如，科学、医学、法律)将受益于专门针对这些领域的少量示例。对于我们希望与提取的节点实例相关联的任何其他协变量，我们还支持辅助提取提示符。我们默认的协变量提示旨在提取与检测到的实体相关联的声明，包括主题、对象、类型、描述、源文本跨度以及开始和结束日期。

To balance the needs of efficiency and quality, we use multiple rounds of “gleanings”, up to a specified maximum, to encourage the LLM to detect any additional entities it may have missed on prior extraction rounds. This is a multi-stage process in which we first ask the LLM to assess whether all entities were extracted, using a logit bias of 100 to force a yes/no decision. If the LLM responds that entities were missed, then a continuation indicating that “MANY entities were missed in the last extraction” encourages the LLM to glean these missing entities. This approach allows us to use larger chunk sizes without a drop in quality (Figure 2) or the forced introduction of noise.

为了平衡效率和质量的需要，我们使用多轮“收集”，直到指定的最大值，以鼓励LLM检测之前提取轮次中可能遗漏的任何其他实体。这是一个多阶段的过程，我们首先要求LLM评估是否提取了所有实体，使用100的logit偏差来强制做出是/否的决定。如果LLM响应实体丢失了，那么指示“在上次提取中丢失了许多实体”的延续将鼓励LLM收集这些丢失的实体。这种方法允许我们使用更大的块大小，而不会降低质量(图2)或强制引入噪声。

2.3 Element Instances → Element Summaries元素实例→元素摘要

The use of an LLM to “extract” descriptions of entities, relationships, and claims represented in source texts is already a form of abstractive summarization, relying on the LLM to create inde-pendently meaningful summaries of concepts that may be implied but not stated by the text itself (e.g., the presence of implied relationships). To convert all such instance-level summaries into sin-gle blocks of descriptive text for each graph element (i.e., entity node, relationship edge, and claim covariate) requires a further round of LLM summarization over matching groups of instances.

使用LLM来“提取”源文本中表示的实体、关系和声明的描述已经是一种抽象摘要的形式，依靠LLM来创建可能隐含但未由文本本身说明的概念的独立有意义的摘要(例如，隐含关系的存在)。要将所有这样的实例级摘要转换为每个图元素(即实体节点、关系边缘和索赔协变量)的单个描述性文本块，需要对匹配的实例组进行进一步的LLM摘要。

A potential concern at this stage is that the LLM may not consistently extract references to the same entity in the same text format, resulting in duplicate entity elements and thus duplicate nodes in the entity graph. However, since all closely-related “communities” of entities will be detected and summarized in the following step, and given that LLMs can understand the common entity behind multiple name variations, our overall approach is resilient to such variations provided there is sufficient connectivity from all variations to a shared set of closely-related entities.

Overall, our use of rich descriptive text for homogeneous nodes in a potentially noisy graph structure is aligned with both the capabilities of LLMs and the needs of global, query-focused summarization. These qualities also differentiate our graph index from typical knowledge graphs, which rely on concise and consistent knowledge triples (subject, predicate, object) for downstream reasoning tasks.

这个阶段的一个潜在问题是，LLM可能无法始终如一地以相同的文本格式提取对同一实体的引用，从而导致重复的实体元素，从而导致实体图中的重复节点。然而，由于所有密切相关的实体“社区”将在接下来的步骤中被检测和总结，并且考虑到LLM可以理解多个名称变化背后的共同实体，我们的整体方法对于这些变化是有弹性的，只要所有变化与一组共享的密切相关的实体有足够的连接。

2.4 Element Summaries → Graph Communities元素摘要→图社区

The index created in the previous step can be modelled as an homogeneous undirected weighted graph in which entity nodes are connected by relationship edges, with edge weights representing the normalized counts of detected relationship instances. Given such a graph, a variety of community detection algorithms may be used to partition the graph into communities of nodes with stronger connections to one another than to the other nodes in the graph (e.g., see the surveys by Fortu-nato, 2010 and Jin et al., 2021). In our pipeline, we use Leiden (Traag et al., 2019) on account of its ability to recover hierarchical community structure of large-scale graphs efficiently (Figure 3). Each level of this hierarchy provides a community partition that covers the nodes of the graph in a mutually-exclusive, collective-exhaustive way, enabling divide-and-conquer global summarization.

在前一步中创建的索引可以建模为一个同构无向加权图，其中实体节点通过关系边连接，边的权重表示检测到的关系实例的规范化计数。给定这样一个图，可以使用各种社区检测算法将图划分为节点社区，这些节点之间的连接比图中其他节点之间的连接更强(例如，参见fortune -nato, 2010和Jin et al.， 2021的调查)。在我们的管道中，我们使用Leiden (Traag等人，2019)，因为它能够有效地恢复大规模图的分层社区结构(图3)。该层次结构的每个级别都提供了一个社区分区，该分区以互斥的、集体详尽的方式覆盖图的节点，从而实现分而治之的全局总结。

2.5 Graph Communities → Community Summaries社区图→社区汇总

The next step is to create report-like summaries of each community in the Leiden hierarchy, using a method designed to scale to very large datasets. These summaries are independently useful in their own right as a way to understand the global structure and semantics of the dataset, and may themselves be used to make sense of a corpus in the absence of a question. For example, a user may scan through community summaries at one level looking for general themes of interest, then follow links to the reports at the lower level that provide more details for each of the subtopics. Here, however, we focus on their utility as part of a graph-based index used for answering global queries.Community summaries are generated in the following way:

下一步是使用一种旨在扩展到非常大的数据集的方法，为Leiden层次结构中的每个社区创建类似报告的摘要。这些摘要作为理解数据集的整体结构和语义的一种方式，它们本身是独立有用的，并且可以在没有问题的情况下用于理解语料库。例如，用户可以浏览某一级别的社区摘要，寻找感兴趣的一般主题，然后点击指向较低级别报告的链接，这些链接为每个子主题提供了更多详细信息。然而，在这里，我们关注的是它们作为用于回答全局查询的基于图的索引的一部分的效用。社区摘要以以下方式生成:

2.6 Community Summaries → Community Answers → Global Answer社区摘要→社区解答→全局解答

Given a user query, the community summaries generated in the previous step can be used to generate a final answer in a multi-stage process. The hierarchical nature of the community structure also means that questions can be answered using the community summaries from different levels, raising the question of whether a particular level in the hierarchical community structure offers the best balance of summary detail and scope for general sensemaking questions (evaluated in section 3).

给定一个用户查询，在前一步中生成的社区摘要可用于在多阶段流程中生成最终答案。社区结构的层次性也意味着可以使用来自不同层次的社区摘要来回答问题，这就提出了这样一个问题:在层次化社区结构中，某个特定的层次是否提供了概要细节和一般性问题范围的最佳平衡(在第3节中进行了评估)。

For a given community level, the global answer to any user query is generated as follows:

>> Prepare community summaries. Community summaries are randomly shuffled and divided into chunks of pre-specified token size. This ensures relevant information is distributed across chunks, rather than concentrated (and potentially lost) in a single context window.

>> Map community answers. Generate intermediate answers in parallel, one for each chunk.The LLM is also asked to generate a score between 0-100 indicating how helpful the gen-erated answer is in answering the target question. Answers with score 0 are filtered out.

>> Reduce to global answer. Intermediate community answers are sorted in descending order of helpfulness score and iteratively added into a new context window until the token limit is reached. This final context is used to generate the global answer returned to the user.

对于给定的社区级别，生成任何用户查询的全局答案如下:

>>准备社区总结。社区摘要被随机打乱并分成预先指定的令牌大小的块。这确保了相关信息分布在各个块之间，而不是集中在单个上下文窗口中(并且可能丢失)。

>>地图社区答案。并行生成中间答案，每个块一个。LLMs还被要求生成一个0-100分之间的分数，表明生成的答案对回答目标问题的帮助程度。得分为0的答案将被过滤掉。

>>减少到全局答案。中间社区答案按有用性分数降序排序，并迭代地添加到新的上下文窗口中，直到达到令牌限制。最后一个上下文用于生成返回给用户的全局答案。

3 Evaluation评估

3.1 Datasets数据集

We selected two datasets in the one million token range, each equivalent to about 10 novels of text and representative of the kind of corpora that users may encounter in their real world activities:

>> Podcast transcripts. Compiled transcripts of podcast conversations between Kevin Scott, Microsoft CTO, and other technology leaders (Behind the Tech, Scott, 2024). Size: 1669 × 600-token text chunks, with 100-token overlaps between chunks (∼1 million tokens).

>> News articles. Benchmark dataset comprising news articles published from September 2013 to December 2023 in a range of categories, including entertainment, business, sports, technology, health, and science (MultiHop-RAG; Tang and Yang, 2024). Size: 3197 × 600-token text chunks, with 100-token overlaps between chunks (∼1.7 million tokens).

我们在100万个令牌范围内选择了两个数据集，每个数据集相当于大约10本小说的文本，代表了用户在现实世界活动中可能遇到的语料库类型:

>>播客文本。汇编了微软首席技术官凯文·斯科特与其他技术领袖之间的播客对话记录(《科技背后》，斯科特，2024年)。大小:1669 × 600个令牌文本块，块之间有100个令牌重叠(约100万个令牌)。

新闻文章。基准数据集包括从2013年9月到2023年12月在一系列类别中发布的新闻文章，包括娱乐，商业，体育，技术，健康和科学(MultiHop-RAG;Tang and Yang, 2024)。大小:3197 × 600个令牌文本块，块之间有100个令牌重叠(约170万个令牌)。

3.2 Queries查询

Many benchmark datasets for open-domain question answering exist, including HotPotQA (Yang et al., 2018), MultiHop-RAG (Tang and Yang, 2024), and MT-Bench (Zheng et al., 2024). However, the associated question sets target explicit fact retrieval rather than summarization for the purpose of data sensemaking, i.e., the process though which people inspect, engage with, and contextualize data within the broader scope of real-world activities (Koesten et al., 2021). Similarly, methods for extracting latent summarization queries from source texts also exist (Xu and Lapata, 2021), but such extracted questions can target details that betray prior knowledge of the texts.

目前存在许多开放域问答的基准数据集，包括HotPotQA (Yang等人，2018)、MultiHop-RAG (Tang和Yang, 2024)和MT-Bench (Zheng等人，2024)。然而，相关的问题集以明确的事实检索为目标，而不是以数据语义为目的的总结，即人们在更广泛的现实世界活动范围内检查、参与和情境化数据的过程(Koesten et al.， 2021)。同样，从源文本中提取潜在摘要查询的方法也存在(Xu和Lapata, 2021)，但这些提取的问题可能针对背叛文本先验知识的细节。

To evaluate the effectiveness of RAG systems for more global sensemaking tasks, we need questions that convey only a high-level understanding of dataset contents, and not the details of specific texts. We used an activity-centered approach to automate the generation of such questions: given a short description of a dataset, we asked the LLM to identify N potential users and N tasks per user, then for each (user, task) combination, we asked the LLM to generate N questions that require understanding of the entire corpus. For our evaluation, a value of N = 5 resulted in 125 test questions per dataset. Table 1 shows example questions for each of the two evaluation datasets.

为了评估RAG系统在更多全局意义生成任务中的有效性，我们需要的问题只传达对数据集内容的高层次理解，而不是特定文本的细节。我们使用以活动为中心的方法来自动生成此类问题:给定数据集的简短描述，我们要求LLM识别N个潜在用户和每个用户的N个任务，然后对于每个(用户，任务)组合，我们要求LLM生成N个需要理解整个语料库的问题。对于我们的评估，N = 5的值导致每个数据集有125个测试问题。表1显示了两个评估数据集的示例问题。

3.3 Conditions条件

We compare six different conditions in our analysis, including Graph RAG using four levels of graph communities (C0, C1, C2, C3), a text summarization method applying our map-reduce approach directly to source texts (TS), and a na¨ıve “semantic search” RAG approach (SS):

在我们的分析中，我们比较了六种不同的情况，包括使用四个级别的图社区(C0, C1, C2, C3)的Graph RAG，直接应用我们的map-reduce方法到源文本的文本摘要方法(TS)，以及na¨ıve“语义搜索”RAG方法(SS)。

3.4 Metrics指标

LLMs have been shown to be good evaluators of natural language generation, achieving state-of-the-art or competitive results compared against human judgements (Wang et al., 2023a; Zheng et al., 2024). While this approach can generate reference-based metrics when gold standard answers are known, it is also capable of measuring the qualities of generated texts (e.g., fluency) in a reference-free style (Wang et al., 2023a) as well as in head-to-head comparison of competing outputs (LLM-as-a-judge, Zheng et al., 2024). LLMs have also shown promise at evaluating the performance of conventional RAG systems, automatically evaluating qualities like context relevance, faithfulness, and answer relevance (RAGAS, Es et al., 2023).

LLMs已被证明是自然语言生成的良好评估者，与人类判断相比，取得了最先进或具有竞争力的结果(Wang等人，2023a;郑等人，2024)。虽然这种方法可以在黄金标准答案已知的情况下生成基于参考的指标，但它也能够以无参考的方式(Wang等人，2023a)测量生成文本的质量(例如流畅性)，以及对竞争输出进行正面比较(LLM-as-a-judge, Zheng等人，2024)。LLMs在评估传统RAG系统的性能方面也表现出了希望，自动评估上下文相关性、可信度和答案相关性等质量(RAGAS, Es等人，2023)。

3.6 Results结果

The indexing process resulted in a graph consisting of 8564 nodes and 20691 edges for the Podcast dataset, and a larger graph of 15754 nodes and 19520 edges for the News dataset. Table 3 shows the number of community summaries at different levels of each graph community hierarchy.

Global approaches vs. 简单的RAG. As shown in Figure 4, global approaches consistently out-performed the 简单的RAG (SS) approach in both comprehensiveness and diversity metrics across datasets. Specifically, global approaches achieved comprehensiveness win rates between 72-83%for Podcast transcripts and 72-80% for News articles, while diversity win rates ranged from 75-82%and 62-71% respectively. Our use of directness as a validity test also achieved the expected results, e., that 简单的RAG produces the most direct responses across all comparisons.

索引过程的结果是Podcast数据集的图由8564个节点和20691条边组成，News数据集的图由15754个节点和19520条边组成。表3显示了每个图社区层次结构中不同级别的社区摘要数量。

全球方法vs. ıve RAG。如图4所示，在数据集的全面性和多样性指标方面，全局方法始终优于简单的RAG (SS)方法。具体而言，全球方法在播客文本和新闻文章上的综合胜率分别为72-83%和72-80%，而多样性胜率分别为75-82%和62-71%。我们使用直接性作为效度测试也达到了预期的结果，即简单的RAG在所有比较中产生最直接的反应。

Community summaries vs. source texts. When comparing community summaries to source texts using Graph RAG, community summaries generally provided a small but consistent improvement in answer comprehensiveness and diversity, except for root-level summaries. Intermediate-level summaries in the Podcast dataset and low-level community summaries in the News dataset achieved comprehensiveness win rates of 57% and 64%, respectively. Diversity win rates were 57% for Podcast intermediate-level summaries and 60% for News low-level community summaries. Table 3 also illustrates the scalability advantages of Graph RAG compared to source text summarization: for low-level community summaries (C3), Graph RAG required 26-33% fewer context tokens, while for root-level community summaries (C0), it required over 97% fewer tokens. For a modest drop in performance compared with other global methods, root-level Graph RAG offers a highly efficient method for the iterative question answering that characterizes sensemaking activity, while retaining advantages in comprehensiveness (72% win rate) and diversity (62% win rate) over 简单的RAG.

社区摘要vs.源文本。当使用Graph RAG将社区摘要与源文本进行比较时，除了根级摘要外，社区摘要通常在答案的全面性和多样性方面提供了小而一致的改进。Podcast数据集中的中级摘要和News数据集中的低级社区摘要的综合胜率分别为57%和64%。播客中级总结的多样性胜率为57%，新闻低级社区总结的多样性胜率为60%。表3还说明了与源文本摘要相比，Graph RAG的可伸缩性优势:对于低级社区摘要(C3)， Graph RAG需要的上下文令牌减少了26-33%，而对于根级社区摘要(C0)，它需要的令牌减少了97%以上。与其他全局方法相比，在性能上略有下降的情况下，根级图RAG提供了一种高效的迭代问题回答方法，该方法具有意义生成活动的特征，同时保留了比简单的RAG在全面性(72%胜率)和多样性(62%胜率)方面的优势。

Empowerment. Empowerment comparisons showed mixed results for both global approaches versus 简单的RAG (SS) and Graph RAG approaches versus source text summarization (TS). Ad-hoc LLM use to analyze LLM reasoning for this measure indicated that the ability to provide specific exam-ples, quotes, and citations was judged to be key to helping users reach an informed understanding. Tuning element extraction prompts may help to retain more of these details in the Graph RAG index.

赋权。授权比较显示，全局方法与简单的RAG (SS)和图形RAG方法与源文本摘要(TS)的结果不同。专门使用LLMs来分析LLMs对这一度量的推理表明，提供具体示例、引用和引用的能力被认为是帮助用户获得知情理解的关键。调优元素提取提示可能有助于在Graph RAG索引中保留更多这些细节。

4 Related Work相关工作

4.1 RAG Approaches and Systems方法和系统

When using LLMs, RAG involves first retrieving relevant information from external data sources, then adding this information to the context window of the LLM along with the original query (Ram et al., 2023). 简单的RAG approaches (Gao et al., 2023) do this by converting documents to text, splitting text into chunks, and embedding these chunks into a vector space in which similar positions represent similar semantics. Queries are then embedded into the same vector space, with the text chunks of the nearest k vectors used as context. More advanced variations exist, but all solve the problem of what to do when an external dataset of interest exceeds the LLM’s context window.

当使用LLM时，RAG首先涉及从外部数据源检索相关信息，然后将此信息与原始查询一起添加到LLM的上下文窗口(Ram等人，2023)。简单的RAG方法(Gao et al.， 2023)通过将文档转换为文本，将文本分割成块，并将这些块嵌入到向量空间中，其中相似的位置表示相似的语义来实现这一点。然后将查询嵌入到相同的向量空间中，使用最近k个向量的文本块作为上下文。存在更高级的变体，但都解决了当感兴趣的外部数据集超出LLM的上下文窗口时该怎么办的问题。

Advanced RAG systems include pre-retrieval, retrieval, post-retrieval strategies designed to over-come the drawbacks of 简单的RAG, while Modular RAG systems include patterns for iterative and dynamic cycles of interleaved retrieval and generation (Gao et al., 2023). Our implementation of Graph RAG incorporates multiple concepts related to other systems. For example, our community summaries are a kind of self-memory (Selfmem, Cheng et al., 2024) for generation-augmented re-trieval (GAR, Mao et al., 2020) that facilitates future generation cycles, while our parallel generation of community answers from these summaries is a kind of iterative (Iter-RetGen, Shao et al., 2023) or federated (FeB4RAG, Wang et al., 2024) retrieval-generation strategy. Other systems have also combined these concepts for multi-document summarization (CAiRE-COVID, Su et al., 2020) and multi-hop question answering (ITRG, Feng et al., 2023; IR-CoT, Trivedi et al., 2022; DSP, Khattab et al., 2022). Our use of a hierarchical index and summarization also bears resemblance to further approaches, such as generating a hierarchical index of text chunks by clustering the vectors of text embeddings (RAPTOR, Sarthi et al., 2024) or generating a “tree of clarifications” to answer mul-tiple interpretations of ambiguous questions (Kim et al., 2023). However, none of these iterative or hierarchical approaches use the kind of self-generated graph index that enables Graph RAG.

先进的RAG系统包括预检索、检索和后检索策略，旨在克服简单的RAG的缺点，而模块化RAG系统包括交错检索和生成的迭代和动态循环模式(Gao等人，2023)。我们对Graph RAG的实现包含了与其他系统相关的多个概念。例如，我们的社区摘要是一种自我记忆(Selfmem, Cheng等人，2024)，用于世代增强检索(GAR, Mao等人，2020)，有利于未来的世代循环，而我们从这些摘要中并行生成社区答案是一种迭代(ter- retgen, Shao等人，2023)或联合(FeB4RAG, Wang等人，2024)检索生成策略。其他系统也将这些概念结合起来用于多文档摘要(cire - covid, Su等人，2020)和多跳问答(ITRG, Feng等人，2023;IR-CoT, Trivedi等，2022;DSP, Khattab et al.， 2022)。我们对层次索引和摘要的使用也与进一步的方法相似，例如通过聚类文本嵌入向量来生成文本块的层次索引(RAPTOR, Sarthi等人，2024)或生成“澄清树”来回答对歧义问题的多种解释(Kim等人，2023)。然而，这些迭代或分层方法都没有使用支持graph RAG的自生成图索引。

4.2 Graphs and LLMs图与LLMs

Use of graphs in connection with LLMs and RAG is a developing research area, with multiple directions already established. These include using LLMs for knowledge graph creation (Tra-janoska et al., 2023) and completion (Yao et al., 2023), as well as for the extraction of causal graphs (Ban et al., 2023; Zhang et al., 2024) from source texts. They also include forms of ad-vanced RAG (Gao et al., 2023) where the index is a knowledge graph (KAPING, Baek et al., 2023), where subsets of the graph structure (G-Retriever, He et al., 2024) or derived graph metrics (Graph-ToolFormer, Zhang, 2023) are the objects of enquiry, where narrative outputs are strongly grounded in the facts of retrieved subgraphs (SURGE, Kang et al., 2023), where retrieved event-plot sub-graphs are serialized using narrative templates (FABULA, Ranade and Joshi, 2023), and where the system supports both creation and traversal of text-relationship graphs for multi-hop question an-swering (Wang et al., 2023b). In terms of open-source software, a variety a graph databases are supported by both the LangChain (LangChain, 2024) and LlamaIndex (LlamaIndex, 2024) libraries, while a more general class of graph-based RAG applications is also emerging, including systems that can create and reason over knowledge graphs in both Neo4J (NaLLM, Neo4J, 2024) and Nebula-Graph (GraphRAG, NebulaGraph, 2024) formats. Unlike our Graph RAG approach, however, none of these systems use the natural modularity of graphs to partition data for global summarization.

在LLMs和RAG中使用图形是一个发展中的研究领域，已经建立了多个方向。其中包括使用LLMs创建知识图谱(Tra-janoska等人，2023)和完成知识图谱(Yao等人，2023)，以及提取因果图(Ban等人，2023;Zhang et al.， 2024)。它们还包括高级RAG的形式(Gao等人，2023)，其中索引是一个知识图(KAPING, Baek等人，2023)，其中图结构的子集(g - retriver, He等人，2024)或派生的图度量(graph - toolformer, Zhang, 2023)是查询对象，其中叙事输出强烈地基于检索子图的事实(SURGE, Kang等人，2023)，其中检索的事件情节子图使用叙事模板(FABULA, Ranade和Joshi)序列化。2023)，其中系统支持多跳问答的文本关系图的创建和遍历(Wang et al.， 2023b)。在开源软件方面，LangChain (LangChain, 2024)和LlamaIndex (LlamaIndex, 2024)库都支持多种图形数据库，而更通用的基于图形的RAG应用程序也正在兴起，包括可以在Neo4J (NaLLM, Neo4J, 2024)和星云图(GraphRAG，星云图，2024)格式下创建和推理知识图的系统。然而，与我们的Graph RAG方法不同，这些系统都没有使用图的自然模块化来划分数据以进行全局汇总。

5 Discussion讨论

Limitations of evaluation approach. Our evaluation to date has only examined a certain class of sensemaking questions for two corpora in the region of 1 million tokens. More work is needed to understand how performance varies across different ranges of question types, data types, and dataset sizes, as well as to validate our sensemaking questions and target metrics with end users. Comparison of fabrication rates, e.g., using approaches like SelfCheckGPT (Manakul et al., 2023), would also improve on the current analysis.	评价方法的局限性。到目前为止，我们的评估仅检查了100万个令牌区域中两个语料库的某类语义问题。需要做更多的工作来了解不同问题类型、数据类型和数据集大小范围的性能变化，以及与最终用户验证我们的语义问题和目标指标。比较伪造率，例如，使用SelfCheckGPT (Manakul等人，2023)等方法，也将改进当前的分析。
Trade-offs of building a graph index. We consistently observed Graph RAG achieve the best head-to-head results against other methods, but in many cases the graph-free approach to global summa-rization of source texts performed competitively. The real-world decision about whether to invest in building a graph index depends on multiple factors, including the compute budget, expected number of lifetime queries per dataset, and value obtained from other aspects of the graph index (including the generic community summaries and the use of other graph-related RAG approaches).	构建图表索引的利弊权衡。我们一直观察到，Graph RAG与其他方法相比获得了最佳的正面结果，但在许多情况下，无图方法对源文本进行全局汇总具有竞争性。关于是否投资构建图索引的实际决策取决于多个因素，包括计算预算、每个数据集的预期生命周期查询数量，以及从图索引的其他方面获得的值(包括通用社区摘要和其他与图相关的RAG方法的使用)。
Future work. The graph index, rich text annotations, and hierarchical community structure support-ing the current Graph RAG approach offer many possibilities for refinement and adaptation. This includes RAG approaches that operate in a more local manner, via embedding-based matching of user queries and graph annotations, as well as the possibility of hybrid RAG schemes that combine embedding-based matching against community reports before employing our map-reduce summa-rization mechanisms. This “roll-up” operation could also be extended across more levels of the community hierarchy, as well as implemented as a more exploratory “drill down” mechanism that follows the information scent contained in higher-level community summaries.	未来的工作。支持当前graph RAG方法的图索引、富文本注释和分层社区结构为改进和调整提供了许多可能性。这包括以更局部的方式操作的RAG方法，通过基于嵌入的用户查询匹配和图形注释，以及混合RAG方案的可能性，该方案在使用我们的map-reduce汇总机制之前将基于嵌入的匹配与社区报告结合起来。这种“上卷”操作还可以扩展到社区层次结构的更多级别，也可以作为一种更具探索性的“下钻”机制来实现，该机制遵循更高级别社区摘要中包含的信息气味。

6 Conclusion结论

We have presented a global approach to Graph RAG, combining knowledge graph generation, retrieval-augmented generation (RAG), and query-focused summarization (QFS) to support human sensemaking over entire text corpora. Initial evaluations show substantial improvements over a 简单的RAG baseline for both the comprehensiveness and diversity of answers, as well as favorable comparisons to a global but graph-free approach using map-reduce source text summarization. For situations requiring many global queries over the same dataset, summaries of root-level communi-ties in the entity-based graph index provide a data index that is both superior to 简单的RAG and achieves competitive performance to other global methods at a fraction of the token cost.

我们提出了一种全局的Graph RAG方法，将知识图生成、检索增强生成(RAG)和以查询为中心的查询聚焦摘要(QFS)相结合，以支持人类对整个文本语料库的意义理解。初步评估表明，在答案的全面性和多样性方面，相较于简单的RAG基线有实质性的改进，并且与使用映射-简化源文本摘要的全局但无图方法相比也显示出有利的比较。对于需要针对同一数据集进行许多全局查询的情况，基于实体的图索引中的根级社区摘要提供了一个优于简单RAG的数据索引，并且以较小的令牌成本实现了与其他全局方法相媲美的性能。

An open-source, Python-based implementation of both global and local Graph RAG approaches is forthcoming at https:///graphrag.

一个开源的、基于python的全局和局部Graph RAG方法的实现即将在https:///graphrag上实现。