在 AWS Graviton 上使用 Torch 編譯 RAG 增強 LLM 服務¶

先前已展示如何使用 TorchServe 部署 Llama。僅部署 LLM 可能會有其限制，例如缺乏最新的資訊和有限的特定領域知識。檢索增強生成 (RAG) 是一種技術，可用於透過提供最新、相關資訊的背景資訊來增強 LLM 的準確性和可靠性。這篇部落格文章說明如何在微服務架構中與 LLM 一起實作 RAG，從而增強可擴展性並加快開發速度。此外，透過在 AWS Graviton 上使用基於 CPU 的 RAG，客戶可以有效率地使用運算資源，最終節省成本。

問題¶

考慮使用者查詢服務 Llama 3 (Llama3-8b-instruct) 的 TorchServe 端點的簡單設計，如圖 1 所示。部署此端點的說明可以在此連結中找到。此模型在 NVIDIA GPU (A10Gx4) 上部署，未經量化，可在 AWS 上以 g5.12xlarge 執行個體的形式取得。

LLM Deployment

假設我們想找出 Llama 3 的最新消息，我們將以下查詢傳送到 TorchServe Llama 端點。

Question: What's new with Llama 3.1?

模型傳回的回應談論的是名為 Llama 3.1 的資料視覺化工具，並非我們預期的結果。

Answer:  (Updated)
We've been busy bees in the Llama office, and we're excited to share the latest updates with you!
Llama 3.1 brings a bunch of new features and improvements to make your workflow even smoother and more efficient. Here are some of the highlights:
**New Features:**
1. **Customizable Columns**: You can now customize the columns in your Llama tables to fit your specific needs. This includes adding, removing, and rearranging columns, as well as setting default values for certain columns.
2. **Advanced Filtering**: Llama 3.1 introduces advanced filtering capabilities, allowing you to filter your data using a variety of conditions, such as date ranges, text matches, and more.
3. **Conditional Formatting**: You can now apply conditional formatting to your data, making it easier to visualize and analyze your results.
4. **Improved Data Import**: We've streamlined the data import process, making it easier to import data from various sources, including CSV

檢索增強生成¶

大型語言模型 (LLM) (例如 Llama) 擅長執行許多複雜的文字生成任務。但是，當針對特定領域使用 LLM 時，它們確實存在一些限制，例如

資訊過時：該領域可能出現模型不知道的進展，因為模型是在較早的日期 (又稱知識截斷日期) 訓練的。
缺乏特定領域的知識：當針對特定領域任務使用 LLM 時，LLM 可能會給出不準確的答案，因為特定領域的知識可能不容易取得。

檢索增強生成 (RAG) 是一種用於解決這些限制的技術。RAG 透過使用查詢提供的最新、相關資訊增強 LLM，從而提高 LLM 的準確性。RAG 透過將資料來源分割成指定大小的區塊、為這些區塊建立索引，並根據查詢檢索相關區塊來實現此目的。取得的資訊會用作背景資訊，以增強傳送到 LLM 的查詢。

LangChain 是用於使用 RAG 建置 LLM 應用程式的熱門框架。

雖然 LLM 推論需要昂貴的 ML 加速器，但 RAG 端點可以部署在具成本效益的 CPU 上，仍然可以滿足使用案例的延遲要求。此外，將 RAG 端點卸載到 CPU 可以實現微服務架構，從而將 LLM 和業務基礎架構分離，並使其能夠獨立擴展。在以下章節中，我們將示範如何在基於 linux-aarch64 的 AWS Graviton 上部署 RAG。此外，我們還將展示如何使用 torch.compile 從 RAG 端點獲得更高的輸送量。基本的 RAG 工作流程包含 2 個步驟

建立索引¶

此範例中提供的背景資訊是網頁 URL。我們載入 URL 中的內容，也以遞迴方式包含子頁面。文件會分割成較小的區塊，以提高處理效率。這些區塊會使用嵌入模型進行編碼並儲存在向量資料庫中，從而實現有效率的搜尋和檢索。我們在嵌入模型上使用 torch.compile 來加速推論。您可以在此處閱讀更多關於將 torch.compile 與 AWS Graviton 搭配使用的資訊

from bs4 import BeautifulSoup as Soup
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader

import torch

# Enable AWS Graviton specific torch.compile optimizations
import torch._inductor.config as config
config.cpp.weight_prepack=True
config.freezing=True

class CustomEmbedding(HuggingFaceEmbeddings):
    tokenizer: Any

    def __init__(self, tokenizer: Any, **kwargs: Any):
        """Initialize the sentence_transformer."""
        super().__init__(**kwargs)

        # Load model from HuggingFace Hub
        self.tokenizer = tokenizer
        self.client = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2')
        self.client = torch.compile(self.client)
    class Config:
        arbitrary_types_allowed = True



    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Compute doc embeddings using a HuggingFace transformer model.

        Args:
            texts: The list of texts to embed.

        Returns:
            List of embeddings, one for each text.
        """
        import sentence_transformers

        texts = list(map(lambda x: x.replace("\n", " "), texts))

        # Tokenize sentences
        encoded_input = self.tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

        embeddings = self.client(
           **encoded_input
        )
        embeddings = embeddings.pooler_output.detach().numpy()

        return embeddings.tolist()


# 1. Load the url and its child pages
url="https://huggingface.co/blog/llama3"
loader = RecursiveUrlLoader(
    url=url, max_depth=3, extractor=lambda x: Soup(x, "html.parser").text
)
docs = loader.load()

# 2. Split the document into chunks with a specified chunk size
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
all_splits = text_splitter.split_documents(docs)

# 3. Store the document into a vector store with a specific embedding model
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
model = CustomEmbedding(tokenizer)

vectorstore = FAISS.from_documents(all_splits, model)

檢索¶

對於使用者傳送的每個查詢，我們都會在向量資料庫中對查詢執行相似性搜尋，並取得 N 個 (此處 N=5) 最接近的文件區塊。

docs = vectorstore.similarity_search(query, k=5)

提示工程¶

使用 LLM 的 RAG 的典型實作會使用 langchain 將 RAG 和 LLM 管道串聯起來，並使用查詢在鏈上呼叫 invoke 方法。

已發布的 TorchServe Llama 端點範例預期文字提示作為輸入，並使用 HuggingFace API 來處理查詢。為了使 RAG 設計相容，我們需要從 RAG 端點傳回文字提示。

本節說明我們如何設計 Llama 端點預期的提示，包括相關背景資訊。在底層，LangChain 具有用於 Llama 的 PromptTemplate。透過使用以下偵錯陳述式執行上述程式碼，我們可以判斷傳送到 Llama 的提示。

import langchain
langchain.debug = True

我們從檢索章節中傳回的文件中擷取文字，並將最終提示設計為 Llama 提示，如下所示

from langchain.prompts import PromptTemplate
from langchain_core.prompts import format_document
question="What's new with Llama 3?"

doc_prompt = PromptTemplate.from_template("{page_content}")
context = ""
for doc in docs:
    context += f"\n{format_document(doc, doc_prompt)}\n"

prompt = f"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer."\
         f"\n\n{context}\n\nQuestion: {question}"\
         f"\nHelpful Answer:"

AWS Graviton 特定最佳化¶

為了利用 AWS Graviton 上針對 RAG 的效能最佳化，我們可以設定以下最佳化；最佳化的詳細資訊在此部落格中提及。還有一個教學說明了這些最佳化

export TORCH_MKLDNN_MATMUL_MIN_DIM=1024
export LRU_CACHE_CAPACITY=1024
export THP_MEM_ALLOC_ENABLE=1
export DNNL_DEFAULT_FPMATH_MODE=BF16

為了準確測量使用 torch.compile 與 PyTorch eager 相比的效能提升，我們也設定了

export OMP_NUM_THREADS=1

部署 RAG¶

雖然 TorchServe 在相同的運算執行個體上提供多模型端點支援，但我們在 AWS Graviton 上部署 RAG 端點。由於 RAG 的運算密集度不高，因此我們可以將 CPU 執行個體用於部署，以提供具成本效益的解決方案。

若要使用 TorchServe 部署 RAG，我們需要以下項目

requirements.txt

langchain
Langchain_community
sentence-transformers
faiss-cpu
bs4

這可以與 config.properties 中的 install_py_dep_per_model=true 一起使用，以動態安裝所需的函式庫

rag-config.yaml

我們在 rag-config.yaml 中傳遞用於建立索引和檢索的參數，該檔案用於建立 MAR 檔案。透過使這些參數可設定，我們可以透過使用不同的 yaml 檔案，為不同的任務設定多個 RAG 端點。

# TorchServe frontend parameters
minWorkers: 1
maxWorkers: 1
responseTimeout: 120
handler:
    url_to_scrape: "https://huggingface.co/blog/llama3"
    chunk_size: 1000
    chunk_overlap: 0
    model_path: "model/models--sentence-transformers--all-mpnet-base-v2/snapshots/84f2bcc00d77236f9e89c8a360a00fb1139bf47d"

rag_handler.py

我們定義一個處理常式檔案，其中包含一個衍生自 BaseHandler 的類別。此類別需要定義四個方法：initialize、preprocess、inference 和 postprocess。建立索引部分在 initialize 方法中定義。檢索部分在 inference 方法中，而提示工程部分在 postprocess 方法中。我們使用 timed 函式來判斷處理每個方法所需的時間。

import torch
import transformers
from bs4 import BeautifulSoup as Soup
from hf_custom_embeddings import CustomEmbedding
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import format_document

from ts.torch_handler.base_handler import BaseHandler


class RAGHandler(BaseHandler):
    """
    RAG handler class retrieving documents from a url, encoding & storing in a vector database.
    For a given query, it returns the closest matching documents.
    """

    def __init__(self):
        super(RAGHandler, self).__init__()
        self.vectorstore = None
        self.initialized = False
        self.N = 5

    @torch.inference_mode
    def initialize(self, ctx):
        url = ctx.model_yaml_config["handler"]["url_to_scrape"]
        chunk_size = ctx.model_yaml_config["handler"]["chunk_size"]
        chunk_overlap = ctx.model_yaml_config["handler"]["chunk_overlap"]
        model_path = ctx.model_yaml_config["handler"]["model_path"]

        loader = RecursiveUrlLoader(
            url=url, max_depth=3, extractor=lambda x: Soup(x, "html.parser").text
        )
        docs = loader.load()

        # Split the document into chunks with a specified chunk size
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size, chunk_overlap=chunk_overlap
        )
        all_splits = text_splitter.split_documents(docs)

        # Store the document into a vector store with a specific embedding model
        self.vectorstore = FAISS.from_documents(
            all_splits, CustomEmbedding(model_path=model_path)
        )

    def preprocess(self, requests):
        assert len(requests) == 1, "Expecting batch_size = 1"
        inputs = []
        for request in requests:
            input_text = request.get("data") or request.get("body")
            if isinstance(input_text, (bytes, bytearray)):
                input_text = input_text.decode("utf-8")
            inputs.append(input_text)
        return inputs[0]

    @torch.inference_mode
    def inference(self, data, *args, **kwargs):
        searchDocs = self.vectorstore.similarity_search(data, k=self.N)
        return (searchDocs, data)

    def postprocess(self, data):
        docs, question = data[0], data[1]
        doc_prompt = PromptTemplate.from_template("{page_content}")
        context = ""
        for doc in docs:
            context += f"\n{format_document(doc, doc_prompt)}\n"

        prompt = (
            f"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer."
            f"\n\n{context}\n\nQuestion: {question}"
            f"\nHelpful Answer:"
        )
        return [prompt]

效能基準測試¶

我們使用 ab 工具來測量 RAG 端點的效能

python benchmarks/auto_benchmark.py --input /home/ubuntu/serve/examples/usecases/RAG_based_LLM_serving
benchmark_profile.yaml --skip true

我們使用 OMP_NUM_THREADS 和 PyTorch Eager/torch.compile 的組合重複執行。

結果¶

我們在 AWS EC2 m7g.4xlarge 執行個體上觀察到以下輸送量

RAG Throughput

我們觀察到，使用 torch.compile 可將 RAG 端點輸送量提高 35%。輸送量的規模 (Eager 或 Compile) 顯示，在 CPU 裝置上部署 RAG 對於與部署在 GPU 執行個體上的 LLM 一起使用是可行的。RAG 端點不會成為 LLM 部署中的瓶頸，

RAG + LLM 部署¶

圖 2 顯示了使用基於 RAG 的 LLM 服務的端對端解決方案的系統架構。

RAG + LLM Deployment

完整部署的步驟在部署指南中提及

可以將 RAG 端點與 Llama 端點串聯的程式碼片段如下所示

import requests

prompt="What's new with Llama 3.1?"

RAG_EP = "http://<RAG Endpoint IP Address>:8080/predictions/rag"
LLAMA_EP = "http://<LLAMA Endpoint IP Address>:8080/predictions/llama3-8b-instruct"
# Get response from RAG
response = requests.post(url=RAG_EP, data=prompt)
# Get response from Llama
response = requests.post(url=LLAMA_EP, data=response.text.encode('utf-8'))
print(f"Question: {prompt}")
print(f"Answer: {response.text}")

範例輸出¶

Question: What's new with Llama 3.1?

Answer:  Llama 3.1 has a large context length of 128K tokens, multilingual capabilities, tool usage capabilities, a very large dense model of 405 billion parameters, and a more permissive license. It also introduces six new open LLM models based on the Llama 3 architecture, and continues to use Grouped-Query Attention (GQA) for efficient representation. The new tokenizer expands the vocabulary size to 128,256, and the 8B version of the model now uses GQA. The license allows using model outputs to improve other LLMs.

Question: What's new with Llama 2?

Answer:  There is no mention of Llama 2 in the provided context. The text only discusses Llama 3.1 and its features. Therefore, it is not possible to determine what is new with Llama 2. I don't know.

結論¶

在本部落格中，我們展示了如何使用 TorchServe 部署 RAG 端點、使用 torch.compile 提高輸送量，以及改進 Llama 端點產生的回應。使用圖 2 中描述的架構，我們可以減少 LLM 的幻覺。
我們也展示了如何在 CPU 上使用 AWS Graviton 部署 RAG 端點，同時 Llama 端點仍然部署在 GPU 上。這種基於微服務的 RAG 解決方案有效率地利用了運算資源，為客戶帶來潛在的成本節省。