用 Python 中的 llama.cpp 构建一个 RAG 流水线

作者 Iván Palomares Carrascosa 于 2025年4月19日发表在语言模型 6

Building a RAG Pipeline with llama.cpp in Python

用 Python 中的 llama.cpp 构建一个 RAG 流水线
图片来源：编辑 | Midjourney

使用 llama.cpp 可以在本地设备上高效且方便地推理 大型语言模型 (LLM)，尤其是在 CPU 上运行时。本文将这种能力提升到了完整的 检索增强生成 (RAG) 层面，提供了一份实用的、基于示例的指南，介绍如何使用 Python 和此框架构建 RAG 管道。

分步流程

首先，我们安装必要的包

pip install llama-cpp-python
pip install langchain langchain-community sentence-transformers chromadb
pip install pypdf requests pydantic tqdm

pip install llama-cpp-python

pip install langchain langchain-community sentence-transformers chromadb

pip install pypdf requests pydantic tqdm

请注意，如果您的运行环境中之前未安装任何组件，初始组件的设置将需要几分钟才能完成。

在安装了 llama.cpp、Langchain 和其他组件（例如用于处理文档语料库中 PDF 文档的 pypdf）之后，就可以导入我们需要的所有内容了。

import os
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.llms import LlamaCpp
import requests
from tqdm import tqdm
import time

import os

from langchain.embeddings import HuggingFaceEmbeddings

from langchain.vectorstores import Chroma

from langchain.document_loaders import PyPDFLoader, TextLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.chains import RetrievalQA

from langchain.prompts import PromptTemplate

from langchain.llms import LlamaCpp

import requests

from tqdm import tqdm

import time

现在开始实际操作。我们首先需要本地下载一个 LLM。尽管在实际场景中您可能想要一个更大的 LLM，但为了使我们的示例相对轻量，我们将加载一个相对较小的 LLM（我知道，这听起来有点矛盾！），即 Llama 2 7B 量化模型，该模型可在 Hugging Face 上找到。

model_path = "llama-2-7b-chat.Q4_K_M.gguf"

if not os.path.exists(model_path):
    print(f"Downloading {model_path}...")
    # You may want to replace the model URL by another of your choice
    model_url = "https://hugging-face.cn/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf"
    response = requests.get(model_url, stream=True)
    total_size = int(response.headers.get('content-length', 0))
    
    with open(model_path, 'wb') as f:
        for data in tqdm(response.iter_content(chunk_size=1024), total=total_size//1024):
            f.write(data)
    print("Download complete!")

model_path = "llama-2-7b-chat.Q4_K_M.gguf"

if not os.path.exists(model_path):

print(f"Downloading {model_path}...")

# You may want to replace the model URL by another of your choice

model_url = "https://hugging-face.cn/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf"

response = requests.get(model_url, stream=True)

total_size = int(response.headers.get('content-length', 0))

with open(model_path, 'wb') as f:

for data in tqdm(response.iter_content(chunk_size=1024), total=total_size//1024):

f.write(data)

print("Download complete!")

直观地说，我们现在需要设置 RAG 系统中的另一个主要组件：文档库。在此示例中，我们将创建一个机制来读取多种格式的文档，包括 .doc 和 .txt，并且为了简单起见，我们将提供一个即时生成的默认示例文本文件，并将其添加到我们新创建的文档目录 docs 中。为了增加乐趣，您可以尝试加载您自己的实际文档。

os.makedirs("docs", exist_ok=True)

# Sample text for demonstration purposes
with open("docs/sample.txt", "w") as f:
    f.write("""
    Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches
    for natural language processing tasks. It involves retrieving relevant information from a knowledge base and then 
    using that information to generate more accurate and informed responses.
    
    RAG models first retrieve documents that are relevant to a given query, then use these documents as additional context
    for language generation. This approach helps to ground the model's responses in factual information and reduces hallucinations.
    
    The llama.cpp library is a C/C++ implementation of Meta's LLaMA model, optimized for CPU usage. It allows running LLaMA models
    on consumer hardware without requiring high-end GPUs.
    
    LocalAI is a framework that enables running AI models locally without relying on cloud services. It provides APIs compatible
    with OpenAI's interfaces, allowing developers to use their own models with the same code they would use for OpenAI services.
    """)

documents = []
for file in os.listdir("docs"):
    if file.endswith(".pdf"):
        loader = PyPDFLoader(os.path.join("docs", file))
        documents.extend(loader.load())
    elif file.endswith(".txt"):
        loader = TextLoader(os.path.join("docs", file))
        documents.extend(loader.load())

# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)

chunks = text_splitter.split_documents(documents)

os.makedirs("docs", exist_ok=True)

# Sample text for demonstration purposes

with open("docs/sample.txt", "w") as f:

f.write("""

Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches

for natural language processing tasks. It involves retrieving relevant information from a knowledge base and then

using that information to generate more accurate and informed responses.

RAG models first retrieve documents that are relevant to a given query, then use these documents as additional context

for language generation. This approach helps to ground the model's responses in factual information and reduces hallucinations.

The llama.cpp library is a C/C++ implementation of Meta's LLaMA model, optimized for CPU usage. It allows running LLaMA models

on consumer hardware without requiring high-end GPUs.

LocalAI is a framework that enables running AI models locally without relying on cloud services. It provides APIs compatible

with OpenAI's interfaces, allowing developers to use their own models with the same code they would use for OpenAI services.

""")

documents = []

for file in os.listdir("docs"):

if file.endswith(".pdf"):

loader = PyPDFLoader(os.path.join("docs", file))

documents.extend(loader.load())

elif file.endswith(".txt"):

loader = TextLoader(os.path.join("docs", file))

documents.extend(loader.load())

# Split documents into chunks

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=1000,

chunk_overlap=200,

length_function=len

)

chunks = text_splitter.split_documents(documents)

请注意，在处理完文档后，我们将其分割成块，这是 RAG 系统中常见的做法，用于提高检索准确性并确保 LLM 有效地处理其上下文窗口内的可管理输入。

LLM 和 RAG 系统都需要处理文本的数值表示，而不是原始文本，因此，我们接下来构建一个包含文本文件嵌入的向量存储。Chroma 是一个轻量级的开源向量数据库，用于高效地存储和查询嵌入。

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

vectorstore = Chroma.from_documents(

documents=chunks,

embedding=embeddings,

persist_directory="./chroma_db"

)

现在 llama.cpp 开始发挥作用，用于初始化我们先前下载的 LLM。为此，会实例化一个 LlamaCpp 对象，并提供模型路径和其他设置，如模型温度、最大上下文长度等。

llm = LlamaCpp(
    model_path=model_path,
    temperature=0.7,
    max_tokens=2000,
    n_ctx=4096,
    verbose=False
)

llm = LlamaCpp(

model_path=model_path,

temperature=0.7,

max_tokens=2000,

n_ctx=4096,

verbose=False

)

我们离推理演示越来越近了，只有少数几个演员还需要登场。其中之一是 RAG 提示模板，这是一种优雅的方式，用于定义在推理过程中检索到的上下文和用户查询如何组合成一个单一的、结构良好的 LLM 输入。

template = """
Answer the question based on the following context:

{context}

Question: {question}
Answer:
"""
prompt = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

template = """

Answer the question based on the following context

{context}

Question: {question}

Answer

"""

prompt = PromptTemplate(

template=template,

input_variables=["context", "question"]

)

最后，我们将所有组件组合起来，创建基于 llama.cpp 的 RAG 管道。

rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

rag_pipeline = RetrievalQA.from_chain_type(

llm=llm,

chain_type="stuff",

retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),

return_source_documents=True,

chain_type_kwargs={"prompt": prompt}

)

让我们回顾一下我们刚刚创建的 RAG 管道的构建块，以便更好地理解。

llm：使用 llama.cpp 下载并初始化的 LLM。
chain_type：一种指定 RAG 系统中如何将检索到的文档组合并发送到 LLM 的方法，其中 "stuff" 表示所有检索到的上下文都注入到提示中。
retriever：基于向量存储初始化，并配置为获取三个最相关的文档块。
return_source_documents=True：用于获取有关哪些文档块被用于回答用户问题的有关信息。
chain_type_kwargs={"prompt": prompt}：启用我们最近定义的自定义模板，将检索增强的输入格式化为 LLM 的可呈现格式。

为了完成并查看所有内容，我们定义并使用了一个管道驱动函数 ask_question()，该函数运行 RAG 管道来回答用户的问题。

def ask_question(question):
    start_time = time.time()
    result = rag_pipeline({"query": question})
    end_time = time.time()
    
    print(f"Question: {question}")
    print(f"Answer: {result['result']}")
    print(f"Time taken: {end_time - start_time:.2f} seconds")
    print("\nSource documents:")
    for i, doc in enumerate(result["source_documents"]):
        print(f"Document {i+1}:")
        print(f"Source: {doc.metadata.get('source', 'Unknown')}")
        print(f"Content: {doc.page_content[:150]}...\n")

def ask_question(question):

start_time = time.time()

result = rag_pipeline({"query": question})

end_time = time.time()

print(f"Question: {question}")

print(f"Answer: {result['result']}")

print(f"Time taken: {end_time - start_time:.2f} seconds")

print("\nSource documents:")

for i, doc in enumerate(result["source_documents"]):

print(f"Document {i+1}:")

print(f"Source: {doc.metadata.get('source', 'Unknown')}")

print(f"Content: {doc.page_content[:150]}...\n")

现在让我们用一些具体的问题来测试我们的管道。

ask_question("What is RAG and how does it work?")
ask_question("What is llama.cpp?")
ask_question("How does LocalAI relate to cloud AI services?")

ask_question("What is RAG and how does it work?")

ask_question("What is llama.cpp?")

ask_question("How does LocalAI relate to cloud AI services?")

结果

Question: What is RAG and how does it work?
Answer: RAG is a combination of retrieval-based and generation-based approaches for natural language processing tasks. It involves retrieving relevant information from a knowledge base and using that information to generate more accurate and informed responses. RAG models first retrieve documents that are relevant to a given query, then use these documents as additional context for language generation. This approach helps to ground the model's responses in factual information and reduces hallucinations.
Time taken: 195.05 seconds

Source documents:
Document 1:
Source: docs/sample.txt
Content: Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches
    for natural language processing ...

Document 2:
Source: docs/sample.txt
Content: on consumer hardware without requiring high-end GPUs.
    
    LocalAI is a framework that enables running AI models locally without relying on cloud ...

Question: What is llama.cpp?
Answer: llama.cpp is a C/C++ implementation of Meta's LLaMA model, optimized for CPU usage. It allows running LLaMA models on consumer hardware without requiring high-end GPUs.
Time taken: 35.61 seconds

Source documents:
Document 1:
Source: docs/sample.txt
Content: Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches
    for natural language processing ...

Document 2:
Source: docs/sample.txt
Content: on consumer hardware without requiring high-end GPUs.
    
    LocalAI is a framework that enables running AI models locally without relying on cloud ...

Question: How does LocalAI relate to cloud AI services?
Answer: LocalAI is a framework that enables running AI models locally without relying on cloud services. It provides APIs compatible with OpenAI's interfaces, allowing developers to use their own models with the same code they would use for OpenAI services. This means that LocalAI allows developers to use their own AI models, trained on their own data, without having to rely on cloud-based services.
Time taken: 182.07 seconds

Source documents:
Document 1:
Source: docs/sample.txt
Content: on consumer hardware without requiring high-end GPUs.
    
    LocalAI is a framework that enables running AI models locally without relying on cloud ...

Document 2:
Source: docs/sample.txt
Content: Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches
    for natural language processing ...

Question: What is RAG and how does it work?

Answer: RAG is a combination of retrieval-based and generation-based approaches for natural language processing tasks. It involves retrieving relevant information from a knowledge base and using that information to generate more accurate and informed responses. RAG models first retrieve documents that are relevant to a given query, then use these documents as additional context for language generation. This approach helps to ground the model's responses in factual information and reduces hallucinations.

Time taken: 195.05 seconds

Source documents

Document 1

Source: docs/sample.txt

Content: Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches

for natural language processing ...

Document 2

Source: docs/sample.txt

Content: on consumer hardware without requiring high-end GPUs.

LocalAI is a framework that enables running AI models locally without relying on cloud ...

Question: What is llama.cpp?

Answer: llama.cpp is a C/C++ implementation of Meta's LLaMA model, optimized for CPU usage. It allows running LLaMA models on consumer hardware without requiring high-end GPUs.

Time taken: 35.61 seconds

Source documents:

Document 1:

Source: docs/sample.txt

Content: Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches

for natural language processing ...

Document 2:

Source: docs/sample.txt

Content: on consumer hardware without requiring high-end GPUs.

LocalAI is a framework that enables running AI models locally without relying on cloud ...

Question: How does LocalAI relate to cloud AI services?

答案: LocalAI 是一个框架，它支持在本地运行AI模型，而无需依赖云服务。它提供了与OpenAI'的接口兼容的API，允许开发人员使用他们的自己的模型，并能使用与OpenAI服务相同的代码。这意味着 LocalAI允许开发人员使用他们自己的AI模型，这些模型是在他们自己的数据上训练的，而无需依赖基于云-的服务。

耗时: 182.07 秒

Source documents:

Document 1:

Source: docs/sample.txt

Content: on consumer hardware without requiring high-end GPUs.

LocalAI is a framework that enables running AI models locally without relying on cloud ...

Document 2:

Source: docs/sample.txt

Content: Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches

for natural language processing ...

总结

本文演示了如何使用 llama.cpp 高效地设置和利用本地 RAG 管道。llama.cpp 是一个流行的框架，用于以轻量级和便携式的方式在本地对现有 LLM 进行推理。您现在应该能够将这些新学的技能应用到您自己的项目中。

关于此主题的更多信息

对使用 Python 中的 llama.cpp 构建 RAG 管道的 6 条回复

Todd 2025年4月19日晚上10:25 #

这是一篇很有趣的文章，因为它能帮助您设置一个基本的 RAG 管道并进行推理。RAG 的真正诀窍在于如何架构和调整它以提供相关的见解。以我编写 RAG 管道的经验来看，管道只是基础。如何分块、重叠程度、使用多查询或 HyDE 等技术。所有这些对于获得良好的响应都至关重要。根据您的数据集以及您的使用方式，这可能非常棘手。

回复
- James Carmichael 2025年4月20日凌晨4:35 #
  
  感谢您的反馈 Todd！我们非常感谢您的反馈！
  
  回复
Gold 2025年4月20日下午1:19 #

非常棒的文章，我很快就会复制这些并进行尝试。

回复
- James Carmichael 2025年4月21日早上7:44 #
  
  感谢您的反馈 Gold！请随时告知您的进展！
  
  回复
Johnnyblaze 2025年4月22日凌晨4:12 #

很棒的概念 😎

回复
- James Carmichael 2025年4月22日早上5:39 #
  
  感谢您的反馈 Johnnyblaze！
  
  回复

导航

用 Python 中的 llama.cpp 构建一个 RAG 流水线

分步流程

总结

关于此主题的更多信息

对使用 Python 中的 llama.cpp 构建 RAG 管道的 6 条回复

留下回复点击此处取消回复。

导航

分步流程

总结

关于此主题的更多信息

对 使用 Python 中的 llama.cpp 构建 RAG 管道 的 6 条回复

留下回复 点击此处取消回复。

对使用 Python 中的 llama.cpp 构建 RAG 管道的 6 条回复

留下回复点击此处取消回复。