构建图谱检索增强生成（Graph RAG）系统：一种分步方法

作者： Kanwal Mehreen 于 2024年12月2日发布在语言模型 5

Building a Graph RAG System: A Step-by-Step Approach

构建图谱检索增强生成（Graph RAG）系统：一种分步方法
图片来源：作者 | Ideogram.ai

图 RAG，图 RAG，图 RAG！ 这个词已经成为热门话题，您可能也听说过。但图 RAG 究竟是什么，又为何如此受欢迎呢？在本文中，我们将探讨图 RAG 的概念、它的必要性，并额外讨论如何使用 LlamaIndex 来实现它。让我们开始吧！

首先，让我们来谈谈从大型语言模型 (LLM) 到检索增强生成 (RAG) 系统的转变。LLM 依赖于静态知识，这意味着它们只能使用训练过的数据。这种限制常常使它们容易出现幻觉——生成不正确或虚假的信息。为了解决这个问题，RAG 系统应运而生。与 LLM 不同，RAG 从外部知识库实时检索数据，利用这种新鲜的上下文生成更准确、更相关的响应。这些传统的 RAG 系统通过使用文本嵌入来检索特定信息。虽然强大，但它们也存在局限性。如果您曾参与过 RAG 相关项目，您可能会认同这一点：系统响应的质量在很大程度上取决于查询的清晰度和特异性。但一个更大的挑战出现了——无法在多个文档之间进行有效的推理。

那么，这又意味着什么呢？举个例子。假设您正在询问系统

“DNA 双螺旋结构发现的关键贡献者是谁？罗莎琳·富兰克林扮演了什么角色？”

在传统的 RAG 设置中，系统可能会检索以下信息

文档 1：“詹姆斯·沃森和弗朗西斯·克里克于 1953 年提出了双螺旋结构。”
文档 2：“罗莎琳·富兰克林的 X 射线衍射图像对识别 DNA 的螺旋结构至关重要。”
文档 3：“莫里斯·威尔金斯将富兰克林的图像分享给了沃森和克里克，这促进了他们的发现。”

问题是什么？传统的 RAG 系统将这些文档视为独立单元。它们无法有效地连接点，导致响应碎片化，例如：

“沃森和克里克提出了结构，富兰克林的工作也很重要。”

这个响应缺乏深度，并且忽略了贡献者之间的关键关系。图 RAG 应运而生！通过将检索到的数据组织成一个图，图 RAG 将每个文档或事实表示为一个节点，并将它们之间的关系表示为边。

图 RAG 如何处理相同的查询

节点：表示事实（例如，“沃森和克里克提出了结构”，“富兰克林贡献了关键的 X 射线图像”）。
边：表示关系（例如，“富兰克林的图像 → 威尔金斯分享 → 影响了沃森和克里克”）。

通过在这些相互连接的节点之间进行推理，图 RAG 可以生成一个完整且富有洞察力的响应，例如

“DNA 双螺旋结构在 1953 年的发现主要由詹姆斯·沃森和弗朗西斯·克里克领导。然而，这一突破在很大程度上依赖于罗莎琳·富兰克林的 X 射线衍射图像，这些图像由莫里斯·威尔金斯分享给了他们。”

这种整合来自多个信息源的信息并回答更广泛、更复杂问题的能力，正是图 RAG 如此受欢迎的原因。

图 RAG 管道

现在，我们将探讨 Microsoft Research 的论文《从局部到全局：用于查询导向摘要的图 RAG 方法》中提出的图 RAG 管道。

图 RAG 方法：Microsoft Research

步骤 1：源文档 → 文本块

LLM 一次只能处理有限数量的文本。为了保持准确性并确保不遗漏任何重要信息，我们将首先将大型文档分解成更小、更易于管理的文本“块”进行处理。

步骤 2：文本块 → 元素实例

对于每个源文本块，我们将提示 LLM 来识别图节点和边。例如，从一篇新闻文章中，LLM 可能会检测到“NASA 发射了一艘宇宙飞船”，并通过“发射”（关系：边）将“NASA”（实体：节点）连接到“宇宙飞船”（实体：节点）。

步骤 3：元素实例 → 元素摘要

在识别出元素后，下一步是使用 LLM 将它们总结为简洁、有意义的描述。这个过程使数据更易于理解。例如，对于节点“NASA”，摘要可以是：“NASA 是一个负责太空探索任务的航天机构。”对于连接“NASA”和“宇宙飞船”的边，摘要可能是：“NASA 于 2023 年发射了宇宙飞船。”这些摘要确保了图既包含丰富细节又易于解释。

步骤 4：元素摘要 → 图社区

前几步创建的图通常太大，无法直接分析。为了简化它，我们使用像 Leiden 这样的专用算法将图划分为社区。这些社区有助于识别紧密相关信息的集群。例如，一个社区可能专注于“太空探索”，将“NASA”、“宇宙飞船”和“火星探测器”等节点分组。另一个社区可能专注于“环境科学”，将“气候变化”、“碳排放”和“海平面”等节点分组。这一步使得识别数据集中主题和连接变得更加容易。

步骤 5：图社区 → 社区摘要

LLM 会优先处理重要细节，并将其压缩到可管理的尺寸。因此，每个社区都会被总结，以提供其所含信息的大致概述。例如：一个关于“太空探索”的社区可能会总结关键任务、发现以及 NASA 或 SpaceX 等组织。这些摘要对于回答一般性问题或探索数据集中的广泛主题非常有用。

步骤 6：社区摘要 → 社区答案 → 全局答案

最后，社区摘要被用来回答用户查询。具体方法如下：

查询数据：用户询问，“气候变化的主要影响是什么？”
社区分析：AI 会审查相关社区的摘要。
生成部分答案：每个社区提供部分答案，例如：
- “海平面上升威胁沿海城市。”
- “不可预测的天气导致农业中断。”
合并为全局答案：这些部分答案被合并为一个全面的响应：

“气候变化的影响包括海平面上升、农业中断以及自然灾害频率增加。”

这个过程确保最终答案详细、准确且易于理解。

使用 LlamaIndex 分步实现 GraphRAG

您可以构建自己的自定义 Python 实现，或使用 LangChain 或 LlamaIndex 等框架。在本文中，我们将使用 LlamaIndex 网站上提供的基线代码；但是，我会以对初学者友好的方式进行解释。此外，我在原始代码中遇到了一些解析问题，我将在稍后解释以及如何解决它们。

步骤 1：安装依赖项

为管道安装所需的库

pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0

1	pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0

graspologic：用于图算法，例如用于社区检测的分层莱顿算法。

步骤 2：加载和预处理数据

加载样本新闻数据，这些数据将被分块以便于处理。为演示起见，我们将其限制为 50 个样本。每一行（标题和文本）都被转换为文档对象。

import pandas as pd
from llama_index.core import Document

# Load sample dataset
news = pd.read_csv("https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv")[:50]

# Convert data into LlamaIndex Document objects
documents = [
    Document(text=f"{row['title']}: {row['text']}")
    for _, row in news.iterrows()
]

import pandas as pd

from llama_index.core import Document

# 加载样本数据集

news = pd.read_csv("https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv")[:50]

# 将数据转换为 LlamaIndex Document 对象

documents = [

Document(text=f"{row['title']}: {row['text']}")

for _, row in news.iterrows()

]

步骤 3：将文本拆分为节点

使用SentenceSplitter将文档分解成易于管理的块。

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(

chunk_size=1024,

chunk_overlap=20,

)

nodes = splitter.get_nodes_from_documents(documents)

chunk_overlap=20：确保块之间有轻微重叠，以避免在边界处丢失信息。

步骤 4：配置 LLM、提示和 GraphRAG 提取器

设置 LLM（例如，GPT-4）。此 LLM 稍后将分析文本块以提取实体和关系。

from llama_index.llms.openai import OpenAI

os.environ["OPENAI_API_KEY"] = "your_openai_api_key"
llm = OpenAI(model="gpt-4")

from llama_index.llms.openai import OpenAI

os.environ["OPENAI_API_KEY"] = "your_openai_api_key"

llm = OpenAI(model="gpt-4")

GraphRAGExtractor 使用上述 LLM、用于指导提取过程的提示模板以及用于处理 LLM 输出为结构化数据的解析函数。文本块（称为节点）被输入到提取器中。对于每个文本块，提取器将文本与提示一起发送到 LLM，提示指示 LLM 识别实体、它们的类型以及它们之间的关系。响应由函数（parse_fn）解析，该函数提取实体和关系。然后将它们转换为EntityNode对象（用于实体）和Relation对象（用于关系），并将描述存储在元数据中。提取的实体和关系被保存在文本块的元数据中，为构建知识图或执行查询做好准备。

注意：原始实现中的问题在于 parse_fn 未能从 LLM 生成的响应中提取实体和关系，导致解析的实体和关系输出为空。这是由于过于复杂且僵化的正则表达式，它们与 LLM 响应的实际结构（尤其是输出中的不一致格式和换行符）不匹配。为了解决这个问题，我通过用旨在更可靠地匹配 LLM 响应的关键值结构的简单模式替换原始正则表达式模式，来简化了 parse_fn。更新后的部分如下所示：

entity_pattern = r'entity_name:\s*(.+?)\s*entity_type:\s*(.+?)\s*entity_description:\s*(.+?)\s*'
relationship_pattern = r'source_entity:\s*(.+?)\s*target_entity:\s*(.+?)\s*relation:\s*(.+?)\s*relationship_description:\s*(.+?)\s*'

def parse_fn(response_str: str) -> Any:
    entities = re.findall(entity_pattern, response_str)
    relationships = re.findall(relationship_pattern, response_str)
    return entities, relationships

entity_pattern = r'entity_name:\s*(.+?)\s*entity_type:\s*(.+?)\s*entity_description:\s*(.+?)\s*'

relationship_pattern = r'source_entity:\s*(.+?)\s*target_entity:\s*(.+?)\s*relation:\s*(.+?)\s*relationship_description:\s*(.+?)\s*'

def parse_fn(response_str: str) -> Any:

entities = re.findall(entity_pattern, response_str)

relationships = re.findall(relationship_pattern, response_str)

return entities, relationships

提示模板和GraphRAGExtractor类保持不变，如下所示：

import asyncio
import nest_asyncio

nest_asyncio.apply()

from typing import Any, List, Callable, Optional, Union, Dict
from IPython.display import Markdown, display

from llama_index.core.async_utils import run_jobs
from llama_index.core.indices.property_graph.utils import (
    default_parse_triplets_fn,
)
from llama_index.core.graph_stores.types import (
    EntityNode,
    KG_NODES_KEY,
    KG_RELATIONS_KEY,
    Relation,
)
from llama_index.core.llms.llm import LLM
from llama_index.core.prompts import PromptTemplate
from llama_index.core.prompts.default_prompts import (
    DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
)
from llama_index.core.schema import TransformComponent, BaseNode
from llama_index.core.bridge.pydantic import BaseModel, Field
class GraphRAGExtractor(TransformComponent):
    """Extract triples from a graph.

    Uses an LLM and a simple prompt + output parsing to extract paths (i.e. triples) and entity, relation descriptions from text.

    Args:
        llm (LLM):
            The language model to use.
        extract_prompt (Union[str, PromptTemplate]):
            The prompt to use for extracting triples.
        parse_fn (callable):
            A function to parse the output of the language model.
        num_workers (int):
            The number of workers to use for parallel processing.
        max_paths_per_chunk (int):
            The maximum number of paths to extract per chunk.
    """

    llm: LLM
    extract_prompt: PromptTemplate
    parse_fn: Callable
    num_workers: int
    max_paths_per_chunk: int

    def __init__(
        self,
        llm: Optional[LLM] = None,
        extract_prompt: Optional[Union[str, PromptTemplate]] = None,
        parse_fn: Callable = default_parse_triplets_fn,
        max_paths_per_chunk: int = 10,
        num_workers: int = 4,
    ) -> None:
        """Init params."""
        from llama_index.core import Settings

        if isinstance(extract_prompt, str):
            extract_prompt = PromptTemplate(extract_prompt)

        super().__init__(
            llm=llm or Settings.llm,
            extract_prompt=extract_prompt or DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
            parse_fn=parse_fn,
            num_workers=num_workers,
            max_paths_per_chunk=max_paths_per_chunk,
        )

    @classmethod
    def class_name(cls) -> str:
        return "GraphExtractor"

    def __call__(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes."""
        return asyncio.run(
            self.acall(nodes, show_progress=show_progress, **kwargs)
        )

    async def _aextract(self, node: BaseNode) -> BaseNode:
        """Extract triples from a node."""
        assert hasattr(node, "text")

        text = node.get_content(metadata_mode="llm")
        try:
            llm_response = await self.llm.apredict(
                self.extract_prompt,
                text=text,
                max_knowledge_triplets=self.max_paths_per_chunk,
            )
            entities, entities_relationship = self.parse_fn(llm_response)
        except ValueError:
            entities = []
            entities_relationship = []

        existing_nodes = node.metadata.pop(KG_NODES_KEY, [])
        existing_relations = node.metadata.pop(KG_RELATIONS_KEY, [])
        metadata = node.metadata.copy()
        for entity, entity_type, description in entities:
            metadata[
                "entity_description"
            ] = description  # Not used in the current implementation. But will be useful in future work.
            entity_node = EntityNode(
                name=entity, label=entity_type, properties=metadata
            )
            existing_nodes.append(entity_node)

        metadata = node.metadata.copy()
        for triple in entities_relationship:
            subj, rel, obj, description = triple
            subj_node = EntityNode(name=subj, properties=metadata)
            obj_node = EntityNode(name=obj, properties=metadata)
            metadata["relationship_description"] = description
            rel_node = Relation(
                label=rel,
                source_id=subj_node.id,
                target_id=obj_node.id,
                properties=metadata,
            )

            existing_nodes.extend([subj_node, obj_node])
            existing_relations.append(rel_node)

        node.metadata[KG_NODES_KEY] = existing_nodes
        node.metadata[KG_RELATIONS_KEY] = existing_relations
        return node

    async def acall(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes async."""
        jobs = []
        for node in nodes:
            jobs.append(self._aextract(node))

        return await run_jobs(
            jobs,
            workers=self.num_workers,
            show_progress=show_progress,
            desc="Extracting paths from text",
        )

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

import asyncio

import nest_asyncio

nest_asyncio.apply()

from typing import Any, List, Callable, Optional, Union, Dict

from IPython.display import Markdown, display

from llama_index.core.async_utils import run_jobs

from llama_index.core.indices.property_graph.utils import (

default_parse_triplets_fn,

)

from llama_index.core.graph_stores.types import (

EntityNode,

KG_NODES_KEY,

KG_RELATIONS_KEY,

Relation,

)

from llama_index.core.llms.llm import LLM

from llama_index.core.prompts import PromptTemplate

from llama_index.core.prompts.default_prompts import (

DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,

)

from llama_index.core.schema import TransformComponent, BaseNode

from llama_index.core.bridge.pydantic import BaseModel, Field

class GraphRAGExtractor(TransformComponent):

"""从图中提取三元组。

使用 LLM 和简单的提示 + 输出解析来从文本中提取路径（即三元组）和实体、关系描述。

参数

llm (LLM)

要使用的语言模型。

extract_prompt (Union[str, PromptTemplate])

用于提取三元组的提示。

parse_fn (callable)

一个用于解析语言模型输出的函数。

num_workers (int)

用于并行处理的工作线程数。

max_paths_per_chunk (int)

每个块要提取的最大路径数。

"""

llm: LLM

extract_prompt: PromptTemplate

parse_fn: Callable

num_workers: int

max_paths_per_chunk: int

def __init__(

self,

llm: Optional[LLM] = None,

extract_prompt: Optional[Union[str, PromptTemplate]] = None,

parse_fn: Callable = default_parse_triplets_fn,

max_paths_per_chunk: int = 10,

num_workers: int = 4,

) -> None:

"""初始化参数。"""

from llama_index.core import Settings

if isinstance(extract_prompt, str):

extract_prompt = PromptTemplate(extract_prompt)

super().__init__(

llm=llm or Settings.llm,

extract_prompt=extract_prompt or DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,

parse_fn=parse_fn,

num_workers=num_workers,

max_paths_per_chunk=max_paths_per_chunk,

)

@classmethod

def class_name(cls) -> str:

return "GraphExtractor"

def __call__(

self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any

) -> List[BaseNode]:

"""从节点提取三元组。"""

return asyncio.run(

self.acall(nodes, show_progress=show_progress, **kwargs)

)

async def _aextract(self, node: BaseNode) -> BaseNode:

"""从节点中提取三元组。"""

assert hasattr(node, "text")

text = node.get_content(metadata_mode="llm")

try:

llm_response = await self.llm.apredict(

self.extract_prompt,

text=text,

max_knowledge_triplets=self.max_paths_per_chunk,

)

entities, entities_relationship = self.parse_fn(llm_response)

except ValueError:

entities = []

entities_relationship = []

existing_nodes = node.metadata.pop(KG_NODES_KEY, [])

existing_relations = node.metadata.pop(KG_RELATIONS_KEY, [])

metadata = node.metadata.copy()

for entity, entity_type, description in entities:

metadata[

"entity_description"

] = description # 当前实现未使用。但将来会很有用。

entity_node = EntityNode(

name=entity, label=entity_type, properties=metadata

)

existing_nodes.append(entity_node)

metadata = node.metadata.copy()

for triple in entities_relationship:

subj, rel, obj, description = triple

subj_node = EntityNode(name=subj, properties=metadata)

obj_node = EntityNode(name=obj, properties=metadata)

metadata["relationship_description"] = description

rel_node = Relation(

label=rel,

source_id=subj_node.id,

target_id=obj_node.id,

properties=metadata,

)

existing_nodes.extend([subj_node, obj_node])

existing_relations.append(rel_node)

node.metadata[KG_NODES_KEY] = existing_nodes

node.metadata[KG_RELATIONS_KEY] = existing_relations

return node

async def acall(

self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any

) -> List[BaseNode]:

"""Extract triples from nodes async."""

jobs = []

for node in nodes:

jobs.append(self._aextract(node))

return await run_jobs(

jobs,

workers=self.num_workers,

show_progress=show_progress,

desc="Extracting paths from text",

)

<strong>KG_TRIPLET_EXTRACT_TMPL</strong> = """
-Goal-
Given a text document, identify all entities and their entity types from the text and all relationships among the identified entities.
Given the text, extract up to {max_knowledge_triplets} entity-relation triplets.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: Type of the entity
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity")

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relation: relationship between source_entity and target_entity
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other

Format each relationship as ("relationship")

3. When finished, output.

-Real Data-
######################
text: {text}
######################
output:"""

<strong>KG_TRIPLET_EXTRACT_TMPL</strong> = """

-目标-

给定一个文本文档，识别文本中的所有实体及其实体类型，以及已识别实体之间的所有关系。

根据文本，提取最多 {max_knowledge_triplets} 个实体-关系三元组。

-步骤-

1. 识别所有实体。对于每个已识别的实体，提取以下信息：

- entity_name: 实体的名称，首字母大写

- entity_type: 实体的类型

- entity_description: 实体的属性和活动的全面描述

将每个实体格式化为（“entity”）

2. 从步骤 1 中识别出的实体中，识别所有*明显相关*的（源实体，目标实体）对。

对于每一对相关实体，提取以下信息：

- source_entity: 源实体的名称，如步骤 1 中识别的

- target_entity: 目标实体的名称，如步骤 1 中识别的

- relation: 源实体和目标实体之间的关系

- relationship_description: 解释您认为源实体和目标实体相互关联的原因

将每种关系格式化为（“relationship”）

3. 完成后，输出。

-实际数据-

######################

text: {text}

######################

output:"""

kg_extractor = GraphRAGExtractor(
    llm=llm,
    extract_prompt=KG_TRIPLET_EXTRACT_TMPL,
    max_paths_per_chunk=2,
    parse_fn=parse_fn,
)

kg_extractor = GraphRAGExtractor(

llm=llm,

extract_prompt=KG_TRIPLET_EXTRACT_TMPL,

max_paths_per_chunk=2,

parse_fn=parse_fn,

)

步骤 5：构建图索引

PropertyGraphIndex 使用 kg_extractor 从文本中提取实体和关系，并将它们存储在 GraphRAGStore 中作为节点和边。

import re
from llama_index.core.graph_stores import SimplePropertyGraphStore
import networkx as nx
from graspologic.partition import hierarchical_leiden

from llama_index.core.llms import ChatMessage
class GraphRAGStore(SimplePropertyGraphStore):
    community_summary = {}
    max_cluster_size = 5

    def generate_community_summary(self, text):
        """Generate summary for a given text using an LLM."""
        messages = [
            ChatMessage(
                role="system",
                content=(
                    "You are provided with a set of relationships from a knowledge graph, each represented as "
                    "entity1->entity2->relation->relationship_description. Your task is to create a summary of these "
                    "relationships. The summary should include the names of the entities involved and a concise synthesis "
                    "of the relationship descriptions. The goal is to capture the most critical and relevant details that "
                    "highlight the nature and significance of each relationship. Ensure that the summary is coherent and "
                    "integrates the information in a way that emphasizes the key aspects of the relationships."
                ),
            ),
            ChatMessage(role="user", content=text),
        ]
        response = OpenAI().chat(messages)
        clean_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return clean_response

    def build_communities(self):
        """Builds communities from the graph and summarizes them."""
        nx_graph = self._create_nx_graph()
        community_hierarchical_clusters = hierarchical_leiden(
            nx_graph, max_cluster_size=self.max_cluster_size
        )
        community_info = self._collect_community_info(
            nx_graph, community_hierarchical_clusters
        )
        self._summarize_communities(community_info)

    def _create_nx_graph(self):
        """Converts internal graph representation to NetworkX graph."""
        nx_graph = nx.Graph()
        for node in self.graph.nodes.values():
            nx_graph.add_node(str(node))
        for relation in self.graph.relations.values():
            nx_graph.add_edge(
                relation.source_id,
                relation.target_id,
                relationship=relation.label,
                description=relation.properties["relationship_description"],
            )
        return nx_graph

    def _collect_community_info(self, nx_graph, clusters):
        """Collect detailed information for each node based on their community."""
        community_mapping = {item.node: item.cluster for item in clusters}
        community_info = {}
        for item in clusters:
            cluster_id = item.cluster
            node = item.node
            if cluster_id not in community_info:
                community_info[cluster_id] = []

            for neighbor in nx_graph.neighbors(node):
                if community_mapping[neighbor] == cluster_id:
                    edge_data = nx_graph.get_edge_data(node, neighbor)
                    if edge_data:
                        detail = f"{node} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}"
                        community_info[cluster_id].append(detail)
        return community_info

    def _summarize_communities(self, community_info):
        """Generate and store summaries for each community."""
        for community_id, details in community_info.items():
            details_text = (
                "\n".join(details) + "."
            )  # Ensure it ends with a period
            self.community_summary[
                community_id
            ] = self.generate_community_summary(details_text)

    def get_community_summaries(self):
        """Returns the community summaries, building them if not already done."""
        if not self.community_summary:
            self.build_communities()
        return self.community_summary

import re

from llama_index.core.graph_stores import SimplePropertyGraphStore

import networkx as nx

from graspologic.partition import hierarchical_leiden

from llama_index.core.llms import ChatMessage

class GraphRAGStore(SimplePropertyGraphStore):

community_summary = {}

max_cluster_size = 5

def generate_community_summary(self, text):

"""Generate summary for a given text using an LLM."""

messages = [

ChatMessage(

role="system",

content=(

"You are provided with a set of relationships from a knowledge graph, each represented as "

"entity1->entity2->relation->relationship_description. Your task is to create a summary of these "

"relationships. The summary should include the names of the entities involved and a concise synthesis "

"of the relationship descriptions. The goal is to capture the most critical and relevant details that "

"highlight the nature and significance of each relationship. Ensure that the summary is coherent and "

"integrates the information in a way that emphasizes the key aspects of the relationships."

ChatMessage(role="user", content=text),

]

response = OpenAI().chat(messages)

clean_response = re.sub(r"^assistant:\s*", "", str(response)).strip()

return clean_response

def build_communities(self):

"""Builds communities from the graph and summarizes them."""

nx_graph = self._create_nx_graph()

community_hierarchical_clusters = hierarchical_leiden(

nx_graph, max_cluster_size=self.max_cluster_size

)

community_info = self._collect_community_info(

nx_graph, community_hierarchical_clusters

)

self._summarize_communities(community_info)

def _create_nx_graph(self):

"""Converts internal graph representation to NetworkX graph."""

nx_graph = nx.Graph()

for node in self.graph.nodes.values():

nx_graph.add_node(str(node))

for relation in self.graph.relations.values():

nx_graph.add_edge(

relation.source_id,

relation.target_id,

relationship=relation.label,

description=relation.properties["relationship_description"],

)

return nx_graph

def _collect_community_info(self, nx_graph, clusters):

"""Collect detailed information for each node based on their community."""

community_mapping = {item.node: item.cluster for item in clusters}

community_info = {}

for item in clusters:

cluster_id = item.cluster

node = item.node

if cluster_id not in community_info:

community_info[cluster_id] = []

for neighbor in nx_graph.neighbors(node):

if community_mapping[neighbor] == cluster_id:

edge_data = nx_graph.get_edge_data(node, neighbor)

if edge_data:

detail = f"{node} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}"

community_info[cluster_id].append(detail)

return community_info

def _summarize_communities(self, community_info):

"""Generate and store summaries for each community."""

for community_id, details in community_info.items():

details_text = (

"\n".join(details) + "."

) # Ensure it ends with a period

self.community_summary[

community_id

] = self.generate_community_summary(details_text)

def get_community_summaries(self):

"""Returns the community summaries, building them if not already done."""

if not self.community_summary:

self.build_communities()

return self.community_summary

from llama_index.core import PropertyGraphIndex

index = PropertyGraphIndex(
    nodes=nodes,
    property_graph_store=GraphRAGStore(),
    kg_extractors=[kg_extractor],
    show_progress=True,
)

from llama_index.core import PropertyGraphIndex

index = PropertyGraphIndex(

nodes=nodes,

property_graph_store=GraphRAGStore(),

kg_extractors=[kg_extractor],

show_progress=True,

)

输出

Extracting paths from text: 100%|██████████| 50/50 [02:51<00:00,  3.43s/it]
Generating embeddings: 100%|██████████| 1/1 [00:01<00:00,  1.53s/it]
Generating embeddings: 100%|██████████| 4/4 [00:01<00:00,  2.27it/s]

Extracting paths from text: 100%|██████████| 50/50 [02:51<00:00, 3.43s/it]

Generating embeddings: 100%|██████████| 1/1 [00:01<00:00, 1.53s/it]

Generating embeddings: 100%|██████████| 4/4 [00:01<00:00, 2.27it/s]

步骤 6：检测社区并总结

使用graspologic的层级莱登算法（Hierarchical Leiden algorithm）来检测社区并生成摘要。社区是指图中内部连接密集但与其他组连接稀疏的节点（实体）的集合。该算法最大化了一个称为模块度（modularity）的度量，它衡量图划分成社区的质量。

index.property_graph_store.build_communities()

1	index.property_graph_store.build_communities()

警告：孤立节点（没有关系的节点）会被莱登算法忽略。当某些节点未形成有意义的连接时，这是正常的，会导致警告。所以，如果遇到这种情况，不必惊慌。

步骤 7：查询图

初始化GraphRAGQueryEngine以查询已处理的数据。提交查询时，引擎会从 GraphRAGStore 中检索相关的社区摘要。对于每个摘要，它会使用 LLM 通过 **generate_answer_from_summary** 方法生成一个针对查询的特定答案。然后，使用 **aggregate_answers** 方法将这些部分答案综合成一个连贯的最终响应，LLM 在此过程中将多种观点整合成一个简洁的输出。

from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.llms import LLM
class GraphRAGQueryEngine(CustomQueryEngine):
    graph_store: GraphRAGStore
    llm: LLM

    def custom_query(self, query_str: str) -> str:
        """Process all community summaries to generate answers to a specific query."""
        community_summaries = self.graph_store.get_community_summaries()
        community_answers = [
            self.generate_answer_from_summary(community_summary, query_str)
            for _, community_summary in community_summaries.items()
        ]

        final_answer = self.aggregate_answers(community_answers)
        return final_answer

    def generate_answer_from_summary(self, community_summary, query):
        """Generate an answer from a community summary based on a given query using LLM."""
        prompt = (
            f"Given the community summary: {community_summary}, "
            f"how would you answer the following query? Query: {query}"
        )
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content="I need an answer based on the above information.",
            ),
        ]
        response = self.llm.chat(messages)
        cleaned_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return cleaned_response

    def aggregate_answers(self, community_answers):
        """Aggregate individual community answers into a final, coherent response."""
        # intermediate_text = " ".join(community_answers)
        prompt = "Combine the following intermediate answers into a final, concise response."
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content=f"Intermediate answers: {community_answers}",
            ),
        ]
        final_response = self.llm.chat(messages)
        cleaned_final_response = re.sub(
            r"^assistant:\s*", "", str(final_response)
        ).strip()
        return cleaned_final_response

from llama_index.core.query_engine import CustomQueryEngine

from llama_index.core.llms import LLM

class GraphRAGQueryEngine(CustomQueryEngine):

graph_store: GraphRAGStore

llm: LLM

def custom_query(self, query_str: str) -> str:

"""Process all community summaries to generate answers to a specific query."""

community_summaries = self.graph_store.get_community_summaries()

community_answers = [

self.generate_answer_from_summary(community_summary, query_str)

for _, community_summary in community_summaries.items()

]

final_answer = self.aggregate_answers(community_answers)

return final_answer

def generate_answer_from_summary(self, community_summary, query):

"""Generate an answer from a community summary based on a given query using LLM."""

prompt = (

f"鉴于社区摘要：{community_summary}，"

f"您将如何回答以下查询？查询：{query}"

)

messages = [

ChatMessage(role="system", content=prompt),

ChatMessage(

role="user",

content="我需要基于以上信息提供一个答案。",

]

response = self.llm.chat(messages)

cleaned_response = re.sub(r"^assistant:\s*", "", str(response)).strip()

return cleaned_response

def aggregate_answers(self, community_answers):

"""将各个社区的答案汇总成一个最终、连贯的回复。"""

# intermediate_text = " ".join(community_answers)

prompt = "将以下中间答案合并成一个最终的、简洁的回复。"

messages = [

ChatMessage(role="system", content=prompt),

ChatMessage(

role="user",

content=f"中间答案：{community_answers}",

]

final_response = self.llm.chat(messages)

cleaned_final_response = re.sub(

r"^assistant:\s*", "", str(final_response)

).strip()

return cleaned_final_response

query_engine = GraphRAGQueryEngine(
    graph_store=index.property_graph_store, llm=llm
)
response = query_engine.query("What are news related to financial sector?")
display(Markdown(f"{response.response}"))

query_engine = GraphRAGQueryEngine(

graph_store=index.property_graph_store, llm=llm

)

response = query_engine.query("金融行业相关的新闻是什么？")

display(Markdown(f"{response.response}"))

输出

The majority of the provided summaries and information do not contain any news related to the financial sector. However, there are a few exceptions. Matt Pincus, through his company MUSIC, has made investments in Soundtrack Your Brand, indicating a financial commitment to support the company's growth. Nirmal Bang has given a Buy Rating to Tata Chemicals Ltd. (TTCH), suggesting a positive investment recommendation. Coinbase Global Inc. is involved in a legal conflict with the U.S. Securities and Exchange Commission (SEC) and is also engaged in a financial transaction involving the issuance of 0.50% Convertible Senior Notes. Deutsche Bank has recommended buying shares of Allegiant Travel and SkyWest, indicating promising opportunities in the aviation sector. Lastly, Coinbase Global, Inc. has repurchased 0.50% Convertible Senior Notes due 2026, indicating strategic financial management.

提供的摘要和信息大部分不包含任何与金融领域相关的新闻。然而，有几个例外。Matt Pincus，通过他的公司MUSIC，已投资于Soundtrack Your Brand，表明在经济上支持公司的增长。Nirmal Bang已给予Tata Chemicals Ltd.（TTCH）买入评级，表明这是一个积极的投资建议。Coinbase GlobalInc.正在与美国证券交易委员会（SEC）进行法律纠纷，并且还在进行一涉及发行0.50%可转换高级无抵押债券的金融交易。德意志银行建议买入Allegiant Travel和SkyWest的股票，表明在航空领域有广阔的前景。最后，Coinbase Global，Inc.已回购0.50%将于2026年到期的可转换高级无抵押债券，表明其在财务管理方面的战略性。

总结

就这样！希望您喜欢阅读这篇文章。毫无疑问，Graph RAG 通过理解数据中的关系和结构，使您能够回答具体的、事实性的和复杂的抽象问题。然而，它仍处于早期阶段，存在局限性，尤其是在代币利用率方面，这显著高于传统的 RAG。尽管如此，这是一个重要的发展，我个人很期待看到下一步。如果您有任何问题或建议，请随时在下面的评论部分分享。

关于此主题的更多信息

掌握 LLM 的 5 门免费课程

机器学习与传统分析：何时使用哪种？

对《构建图 RAG 系统：分步方法》的 5 条回复

CH.Tseng 2024 年 12 月 4 日下午 7:52 #

非常感谢，但我试用了您的程序。在这一行出现了错误
community_hierarchical_clusters = hierarchical_leiden(

错误消息如下

在 build_communities 中
community_hierarchical_clusters = hierarchical_leiden(
文件 “”，第 293 行，在 hierarchical_leiden
文件 “/home/chtseng/envs/GraphRAG/lib/python3.10/site-packages/graspologic/partition/leiden.py”，第 588 行，在 hierarchical_leiden
hierarchical_clusters_native = gn.hierarchical_leiden(
leiden.EmptyNetworkError: EmptyNetworkError

这是什么原因？

回复
- James Carmichael 2024 年 12 月 5 日上午 8:27 #
  
  您好 CS.Tseng… leiden.EmptyNetworkError 错误表明 Leiden 算法被应用于一个空的网络或无效的网络。这通常发生在传递给 hierarchical_leiden 函数的输入图或网络不包含任何边或节点，或者预处理导致了无效状态。以下是如何进行故障排除和解决问题的方法：
  
  ### 故障排除步骤
  
  1. **验证输入图：**
  – 确保输入图不是空的。在将其传递给 hierarchical_leiden 函数之前，检查节点和边的数量。
  python print("节点数：", graph.number_of_nodes()) print("边数：", graph.number_of_edges())
  
  2. **检查节点和边权重：**
  – 如果您的图是加权的，请验证所有边是否具有有效的权重，并且没有缺失或 None 权重。如果需要权重但不存在，函数可能会将图视为无效。
  
  3. **预处理管道：**
  – 如果有预处理步骤（例如，过滤、子集化），请确保它不会无意中删除所有边或节点。
  python # 节点和边列表的示例检查 print("节点：", list(graph.nodes())) print("边：", list(graph.edges(data=True)))
  
  4. **自环和孤立节点：**
  – 如果只有孤立节点或自环，某些实现可能会失败。如有必要，请过滤掉这些节点。
  python graph.remove_edges_from(nx.selfloop_edges(graph)) graph = nx.Graph(graph) # 不带孤立节点的图的重新创建
  
  5. **分层莱顿参数：**
  – 如果您使用的是特定参数，请确保它们设置正确。例如，确认模块化分辨率是否适合您的数据。
  
  6. **检查图表示：**
  – 确保图以兼容的格式表示。如果您使用的是 networkx，请确保转换为预期格式（例如，原生的 Graspologic 图对象）是正确的。
  
  ### 输入验证示例
  python import networkx as nx from graspologic.partition import hierarchical_leiden
  # 示例图创建 graph = nx.karate_club_graph() # 检查图属性 if graph.number_of_nodes() == 0 or graph.number_of_edges() == 0 raise ValueError("输入图为空。请提供一个有效的图。")
  # 运行分层莱顿 try: community_hierarchical_clusters = hierarchical_leiden(graph) print("社区：", community_hierarchical_clusters) except leiden.EmptyNetworkError as e: print("EmptyNetworkError:", e)
  
  ### 常见原因和修复
  1. **空图：**
  – 确保图已填充有效的边和节点。
  
  2. **数据清理：**
  – 修复重复边、缺失边权重或孤立节点等问题。
  
  3. **图转换：**
  – 确保在将图传递给函数之前将其转换为正确的格式。
  
  回复
Martin 2025 年 1 月 7 日下午 5:51 #

您好，如何持久化 KG 和摘要？目前代码不存储任何内容

回复
nongnongzi 2025 年 3 月 20 日下午 3:39 #

您好，感谢分享这项工作！
我有一个关于传递给索引的 property_graph_store 的问题。
原因如下代码

index = PropertyGraphIndex(
nodes=nodes,
property_graph_store=GraphRAGStore(),
kg_extractors=[kg_extractor],
show_progress=True,
)

GraphRAGStore() 中没有传递任何内容，这意味着 GraphRAGStore 中没有存储任何节点和关系。而在 PropertyGraphIndex 中，也没有将节点和关系插入 GraphRAGStore。
那么，如何在下一步构建摘要和进行查询？

谢谢。

回复
- James Carmichael 2025 年 3 月 21 日下午 5:41 #
  
  好问题——你触及了一个微妙但重要的问题，关于 PropertyGraphIndex 和 GraphRAGStore 如何协同工作。
  
  ### 让我们一步一步来分解
  
  #### 1. **GraphRAGStore()**
  当你初始化 GraphRAGStore() 而不传递任何数据时，它会创建一个*空*的属性图存储。这是真的——那时还没有节点或关系。
  
  python graph_store = GraphRAGStore() # 此时为空
  
  #### 2. **PropertyGraphIndex**
  现在，当你执行
  
  python index = PropertyGraphIndex( nodes=nodes, property_graph_store=graph_store, kg_extractors=[kg_extractor], show_progress=True, )
  
  你提供了 nodes，它们可能是非结构化文本块，以及 kg_extractor——一个可以处理这些块以生成三元组（主语、谓语、宾语）的知识图提取器。
  
  👉 **PropertyGraphIndex 内部发生了什么？**
  
  这里的关键是，即使 GraphRAGStore 最初是空的，**PropertyGraphIndex 会获取 nodes，应用 kg_extractor，并填充图存储**。
  
  – kg_extractor 从 nodes 中提取结构化数据（三元组）。
  – 然后，这些三元组会**在 PropertyGraphIndex 内部插入到 GraphRAGStore 中**。
  
  所以是的，GraphRAGStore() 一开始是空的，但在索引构建期间会被填充。
  
  #### 3. **查询/摘要**
  在此步骤之后，PropertyGraphIndex 已在提取器的帮助和图存储的协助下构建了一个图。从那里，你可以
  
  – 查询索引
  – 从 GraphRAGStore 中检索相关的三元组
  – 基于属性图的遍历构建摘要
  
  #### TL;DR 回答
  ✅ GraphRAGStore() 最初是空的
  ✅ PropertyGraphIndex 使用 nodes 和 kg_extractors 来提取三元组
  ✅ 这些三元组被**添加**到 GraphRAGStore 中
  ✅ 因此，后续的查询/摘要可以按预期工作
  
  —
  
  如果您好奇，可以在构建索引后检查 graph_store 以查看三元组
  
  python print(graph_store.get_all_nodes()) print(graph_store.get_all_relationships())
  
  回复

导航

构建图谱检索增强生成（Graph RAG）系统：一种分步方法

图 RAG 管道

步骤 1：源文档 → 文本块

步骤 2：文本块 → 元素实例

步骤 3：元素实例 → 元素摘要

步骤 4：元素摘要 → 图社区

步骤 5：图社区 → 社区摘要

步骤 6：社区摘要 → 社区答案 → 全局答案

使用 LlamaIndex 分步实现 GraphRAG

步骤 1：安装依赖项

步骤 2：加载和预处理数据

步骤 3：将文本拆分为节点

步骤 4：配置 LLM、提示和 GraphRAG 提取器

步骤 5：构建图索引

步骤 6：检测社区并总结

步骤 7：查询图

总结

关于此主题的更多信息

对《构建图 RAG 系统：分步方法》的 5 条回复

发表回复点击此处取消回复。

导航

图 RAG 管道

步骤 1：源文档 → 文本块

步骤 2：文本块 → 元素实例

步骤 3：元素实例 → 元素摘要

步骤 4：元素摘要 → 图社区

步骤 5：图社区 → 社区摘要

步骤 6：社区摘要 → 社区答案 → 全局答案

使用 LlamaIndex 分步实现 GraphRAG

步骤 1：安装依赖项

步骤 2：加载和预处理数据

步骤 3：将文本拆分为节点

步骤 4：配置 LLM、提示和 GraphRAG 提取器

步骤 5：构建图索引

步骤 6：检测社区并总结

步骤 7：查询图

总结

关于此主题的更多信息

对《构建图 RAG 系统：分步方法》的 5 条回复

发表回复 点击此处取消回复。

发表回复点击此处取消回复。