Executorch 中的 LLM 簡介¶

歡迎來到 LLM 手冊！本手冊旨在提供一個實際範例，以利用 ExecuTorch 加入您自己的大型語言模型 (LLM)。我們的主要目標是提供清晰簡潔的指南，說明如何將我們的系統與您自己的 LLM 整合。

請注意，此專案僅用於示範，並非具有最佳效能的完整功能範例。因此，某些元件（例如取樣器、分詞器等）僅提供最低限度的版本以供示範。因此，模型產生的結果可能會有所不同，並且可能並不總是最佳的。

我們鼓勵使用者將此專案作為起點，並根據其特定需求進行調整，包括建立您自己的分詞器、取樣器、加速後端和其他元件的版本。我們希望此專案能作為您在 LLM 和 ExecuTorch 之旅中的有用指南。

要部署具有最佳效能的 Llama，請參閱Llama 指南。

目錄¶

先決條件
Hello World 範例
量化
使用行動裝置加速
偵錯和效能分析
如何使用自定義核心
如何建置行動應用程式

先決條件¶

要按照本指南操作，您需要複製 ExecuTorch 儲存庫並安裝相依性。 ExecuTorch 建議使用 Python 3.10 和 Conda 來管理您的環境。雖然不是必須使用 Conda，但請注意，您可能需要根據您的環境將 python/pip 替換為 python3/pip3。

conda

有關安裝 miniconda 的說明，請點此處。

# Create a directory for this example.
mkdir et-nanogpt
cd et-nanogpt

# Clone the ExecuTorch repository and submodules.
mkdir third-party
git clone -b release/0.4 https://github.com/pytorch/executorch.git third-party/executorch
cd third-party/executorch
git submodule update --init

# Create a conda environment and install requirements.
conda create -yn executorch python=3.10.0
conda activate executorch
./install_requirements.sh

cd ../..

pyenv-virtualenv

有關安裝 pyenv-virtualenv 的說明，請點此處。

重要的是，如果通過 brew 安裝 pyenv，它不會自動在終端機中啟用 pyenv，從而導致錯誤。執行以下命令以啟用。請參閱上面的 pyenv-virtualenv 安裝指南，瞭解如何將其新增至您的 .bashrc 或 .zshrc，以避免需要手動執行這些命令。

eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"

# Create a directory for this example.
mkdir et-nanogpt
cd et-nanogpt

pyenv install -s 3.10
pyenv virtualenv 3.10 executorch
pyenv activate executorch

# Clone the ExecuTorch repository and submodules.
mkdir third-party
git clone -b release/0.4 https://github.com/pytorch/executorch.git third-party/executorch
cd third-party/executorch
git submodule update --init

# Install requirements.
PYTHON_EXECUTABLE=python ./install_requirements.sh

cd ../..

如需更多資訊，請參閱設定 ExecuTorch。

在本機執行大型語言模型¶

此範例使用 Karpathy 的 nanoGPT，它是 GPT-2 124M 的最小實作。本指南適用於其他語言模型，因為 ExecuTorch 與模型無關。

使用 ExecuTorch 執行模型有兩個步驟

匯出模型。此步驟會將其預處理為適合執行階段執行的格式。
在執行階段，載入模型檔案並使用 ExecuTorch 執行階段執行。

匯出步驟會提前發生，通常作為應用程式建置的一部分或在模型變更時發生。產生的 .pte 檔案會與應用程式一起發佈。在執行階段，應用程式載入 .pte 檔案並將其傳遞給 ExecuTorch 執行階段。

步驟 1. 匯出到 ExecuTorch¶

匯出會取得 PyTorch 模型並將其轉換為可在消費者裝置上有效執行的格式。

對於此範例，您將需要 nanoGPT 模型和對應的分詞器詞彙表。

curl

curl https://raw.githubusercontent.com/karpathy/nanoGPT/master/model.py -O
curl https://huggingface.co/openai-community/gpt2/resolve/main/vocab.json -O

wget

wget https://raw.githubusercontent.com/karpathy/nanoGPT/master/model.py
wget https://huggingface.co/openai-community/gpt2/resolve/main/vocab.json

要將模型轉換為針對獨立執行進行優化的格式，有兩個步驟。首先，使用 PyTorch 的 export 函數將 PyTorch 模型轉換為獨立於平台的過渡表示法。然後使用 ExecuTorch 的 to_edge 和 to_executorch 方法準備模型以供裝置上執行。這會建立一個 .pte 檔案，桌面或行動應用程式可以在執行階段載入該檔案。

建立一個名為 export_nanogpt.py 的檔案，其內容如下

# export_nanogpt.py

import torch

from executorch.exir import EdgeCompileConfig, to_edge
from torch.nn.attention import sdpa_kernel, SDPBackend
from torch.export import export, export_for_training

from model import GPT

# Load the model.
model = GPT.from_pretrained('gpt2')

# Create example inputs. This is used in the export process to provide
# hints on the expected shape of the model input.
example_inputs = (torch.randint(0, 100, (1, model.config.block_size), dtype=torch.long), )

# Set up dynamic shape configuration. This allows the sizes of the input tensors
# to differ from the sizes of the tensors in `example_inputs` during runtime, as
# long as they adhere to the rules specified in the dynamic shape configuration.
# Here we set the range of 0th model input's 1st dimension as
# [0, model.config.block_size].
# See https://pytorch.dev.org.tw/executorch/main/concepts.html#dynamic-shapes
# for details about creating dynamic shapes.
dynamic_shape = (
    {1: torch.export.Dim("token_dim", max=model.config.block_size)},
)

# Trace the model, converting it to a portable intermediate representation.
# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad():
    m = export_for_training(model, example_inputs, dynamic_shapes=dynamic_shape).module()
    traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape)

# Convert the model into a runnable ExecuTorch program.
edge_config = EdgeCompileConfig(_check_ir_validity=False)
edge_manager = to_edge(traced_model,  compile_config=edge_config)
et_program = edge_manager.to_executorch()

# Save the ExecuTorch program to a file.
with open("nanogpt.pte", "wb") as file:
    file.write(et_program.buffer)

要匯出，請使用 python export_nanogpt.py (或 python3，視您的環境而定) 執行腳本。它將在目前目錄中產生一個 nanogpt.pte 檔案。

如需更多資訊，請參閱匯出到 ExecuTorch和torch.export。

步驟 2. 呼叫執行階段¶

ExecuTorch 提供一組執行階段 API 和類型，用於載入和執行模型。

建立一個名為 main.cpp 的檔案，其內容如下

// main.cpp

#include <cstdint>

#include "basic_sampler.h"
#include "basic_tokenizer.h"

#include <executorch/extension/module/module.h>
#include <executorch/extension/tensor/tensor.h>
#include <executorch/runtime/core/evalue.h>
#include <executorch/runtime/core/exec_aten/exec_aten.h>
#include <executorch/runtime/core/result.h>

using executorch::aten::ScalarType;
using executorch::aten::Tensor;
using executorch::extension::from_blob;
using executorch::extension::Module;
using executorch::runtime::EValue;
using executorch::runtime::Result;

模型輸入和輸出採用張量的形式。張量可以被認為是多維陣列。 ExecuTorch EValue 類別提供了一個圍繞張量和其他 ExecuTorch 資料類型的封裝。

由於 LLM 一次生成一個標記，因此驅動程式碼需要重複呼叫模型，逐個建立輸出標記。每個生成的標記都作為下一次執行的輸入傳遞。

// main.cpp

// The value of the gpt2 `<|endoftext|>` token.
#define ENDOFTEXT_TOKEN 50256

std::string generate(
    Module& llm_model,
    std::string& prompt,
    BasicTokenizer& tokenizer,
    BasicSampler& sampler,
    size_t max_input_length,
    size_t max_output_length) {
  // Convert the input text into a list of integers (tokens) that represents it,
  // using the string-to-token mapping that the model was trained on. Each token
  // is an integer that represents a word or part of a word.
  std::vector<int64_t> input_tokens = tokenizer.encode(prompt);
  std::vector<int64_t> output_tokens;

  for (auto i = 0u; i < max_output_length; i++) {
    // Convert the input_tokens from a vector of int64_t to EValue. EValue is a
    // unified data type in the ExecuTorch runtime.
    auto inputs = from_blob(
        input_tokens.data(),
        {1, static_cast<int>(input_tokens.size())},
        ScalarType::Long);

    // Run the model. It will return a tensor of logits (log-probabilities).
    auto logits_evalue = llm_model.forward(inputs);

    // Convert the output logits from EValue to std::vector, which is what the
    // sampler expects.
    Tensor logits_tensor = logits_evalue.get()[0].toTensor();
    std::vector<float> logits(
        logits_tensor.data_ptr<float>(),
        logits_tensor.data_ptr<float>() + logits_tensor.numel());

    // Sample the next token from the logits.
    int64_t next_token = sampler.sample(logits);

    // Break if we reached the end of the text.
    if (next_token == ENDOFTEXT_TOKEN) {
      break;
    }

    // Add the next token to the output.
    output_tokens.push_back(next_token);

    std::cout << tokenizer.decode({next_token});
    std::cout.flush();

    // Update next input.
    input_tokens.push_back(next_token);
    if (input_tokens.size() > max_input_length) {
      input_tokens.erase(input_tokens.begin());
    }
  }

  std::cout << std::endl;

  // Convert the output tokens into a human-readable string.
  std::string output_string = tokenizer.decode(output_tokens);
  return output_string;
}

Module 類別處理載入 .pte 檔案並準備執行。

分詞器負責將提示的人類可讀字串表示形式轉換為模型預期的數值形式。為此，分詞器將短字串與給定的標記 ID 關聯起來。標記可以被認為代表單字或單字的部分，但在實務中，它們可能是任意的字元序列。

分詞器從檔案載入詞彙表，該檔案包含每個標記 ID 與其表示的文字之間的對應。調用 tokenizer.encode() 和 tokenizer.decode() 以在字串和標記表示之間進行轉換。

取樣器負責根據模型輸出的 logits 或 log-probabilities 選擇下一個標記。 LLM 會為每個可能的下一個標記傳回一個 logit 值。取樣器根據某種策略選擇要使用的標記。此處使用的最簡單的方法是採用具有最高 logit 值的標記。

取樣器可以提供可配置的選項，例如輸出選擇的可配置隨機性、重複標記的懲罰以及優先順序或降低特定標記的偏差。

// main.cpp

int main() {
  // Set up the prompt. This provides the seed text for the model to elaborate.
  std::cout << "Enter model prompt: ";
  std::string prompt;
  std::getline(std::cin, prompt);

  // The tokenizer is used to convert between tokens (used by the model) and
  // human-readable strings.
  BasicTokenizer tokenizer("vocab.json");

  // The sampler is used to sample the next token from the logits.
  BasicSampler sampler = BasicSampler();

  // Load the exported nanoGPT program, which was generated via the previous
  // steps.
  Module model("nanogpt.pte", Module::LoadMode::MmapUseMlockIgnoreErrors);

  const auto max_input_tokens = 1024;
  const auto max_output_tokens = 30;
  std::cout << prompt;
  generate(
      model, prompt, tokenizer, sampler, max_input_tokens, max_output_tokens);
}

最後，將以下檔案下載到與 main.cpp 相同的目錄中

curl -O https://raw.githubusercontent.com/pytorch/executorch/main/examples/llm_manual/basic_sampler.h
curl -O https://raw.githubusercontent.com/pytorch/executorch/main/examples/llm_manual/basic_tokenizer.h

要瞭解更多資訊，請參閱執行階段 API 教學課程。

建置和執行¶

ExecuTorch 使用 CMake 建置系統。要編譯和連結到 ExecuTorch 執行階段，請通過 add_directory 包含 ExecuTorch 專案，並連結到 executorch 和其他相依性。

建立一個名為 CMakeLists.txt 的檔案，其內容如下

# CMakeLists.txt

cmake_minimum_required(VERSION 3.19)
project(nanogpt_runner)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED True)

# Set options for executorch build.
option(EXECUTORCH_ENABLE_LOGGING "" ON)
option(EXECUTORCH_BUILD_EXTENSION_DATA_LOADER "" ON)
option(EXECUTORCH_BUILD_EXTENSION_MODULE "" ON)
option(EXECUTORCH_BUILD_EXTENSION_TENSOR "" ON)
option(EXECUTORCH_BUILD_KERNELS_OPTIMIZED "" ON)

# Include the executorch subdirectory.
add_subdirectory(
  ${CMAKE_CURRENT_SOURCE_DIR}/third-party/executorch
  ${CMAKE_BINARY_DIR}/executorch
)

add_executable(nanogpt_runner main.cpp)
target_link_libraries(
  nanogpt_runner
  PRIVATE executorch
          extension_module_static # Provides the Module class
          extension_tensor # Provides the TensorPtr class
          optimized_native_cpu_ops_lib # Provides baseline cross-platform
                                       # kernels
)

此時，工作目錄應包含以下檔案

CMakeLists.txt
main.cpp
basic_tokenizer.h
basic_sampler.h
export_nanogpt.py
model.py
vocab.json
nanogpt.pte

如果所有這些都存在，您現在可以建置和執行

./install_requirements.sh --clean
(mkdir cmake-out && cd cmake-out && cmake ..)
cmake --build cmake-out -j10
./cmake-out/nanogpt_runner

您應該看到訊息

Enter model prompt:

為模型輸入一些種子文字，然後按 Enter。在這裡，我們使用“Hello world!” 作為範例提示

Enter model prompt: Hello world!
Hello world!

I'm not sure if you've heard of the "Curse of the Dragon" or not, but it's a very popular game in

此時，它很可能會執行得非常緩慢。這是因為 ExecuTorch 尚未被告知針對特定硬體（委派）進行優化，並且因為它正在以 32 位浮點數執行所有計算（沒有量化）。

委派¶

雖然 ExecuTorch 為所有運算子提供了一個可移植的跨平台實作，但它也為許多不同的目標提供了專門的後端。這些包括但不限於：通過 XNNPACK 後端進行 x86 和 ARM CPU 加速、通過 Core ML 後端和 Metal Performance Shader (MPS) 後端進行 Apple 加速，以及通過 Vulkan 後端進行 GPU 加速。

由於最佳化是針對特定後端的，因此每個 pte 檔案都特定於要匯出的後端。為了支援多個裝置，例如 Android 的 XNNPACK 加速和 iOS 的 Core ML，請為每個後端匯出單獨的 PTE 檔案。

為了在匯出時委派給後端，ExecuTorch 在 EdgeProgramManager 物件中提供了 to_backend() 函式，該函式接受特定於後端的分割器物件。分割器負責尋找可以被目標後端加速的計算圖部分，並且 to_backend() 函式會將匹配的部分委派給給定的後端以進行加速和最佳化。任何未委派的計算圖部分將由 ExecuTorch 運算子實作執行。

為了將匯出的模型委派給特定的後端，我們需要首先從 ExecuTorch codebase 匯入其分割器以及邊緣編譯配置，然後在使用 to_edge 函式建立的 EdgeProgramManager 物件上，呼叫帶有分割器實例的 to_backend。

以下是如何將 nanoGPT 委派給 XNNPACK 的範例（例如，如果您要部署到 Android 手機）

# export_nanogpt.py

# Load partitioner for Xnnpack backend
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner

# Model to be delegated to specific backend should use specific edge compile config
from executorch.backends.xnnpack.utils.configs import get_xnnpack_edge_compile_config
from executorch.exir import EdgeCompileConfig, to_edge

import torch
from torch.export import export
from torch.nn.attention import sdpa_kernel, SDPBackend
from torch.export import export_for_training

from model import GPT

# Load the nanoGPT model.
model = GPT.from_pretrained('gpt2')

# Create example inputs. This is used in the export process to provide
# hints on the expected shape of the model input.
example_inputs = (
        torch.randint(0, 100, (1, model.config.block_size - 1), dtype=torch.long),
    )

# Set up dynamic shape configuration. This allows the sizes of the input tensors
# to differ from the sizes of the tensors in `example_inputs` during runtime, as
# long as they adhere to the rules specified in the dynamic shape configuration.
# Here we set the range of 0th model input's 1st dimension as
# [0, model.config.block_size].
# See https://pytorch.dev.org.tw/executorch/main/concepts.html#dynamic-shapes
# for details about creating dynamic shapes.
dynamic_shape = (
    {1: torch.export.Dim("token_dim", max=model.config.block_size - 1)},
)

# Trace the model, converting it to a portable intermediate representation.
# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad():
    m = export_for_training(model, example_inputs, dynamic_shapes=dynamic_shape).module()
    traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape)

# Convert the model into a runnable ExecuTorch program.
# To be further lowered to Xnnpack backend, `traced_model` needs xnnpack-specific edge compile config
edge_config = get_xnnpack_edge_compile_config()
edge_manager = to_edge(traced_model, compile_config=edge_config)

# Delegate exported model to Xnnpack backend by invoking `to_backend` function with Xnnpack partitioner.
edge_manager = edge_manager.to_backend(XnnpackPartitioner())
et_program = edge_manager.to_executorch()

# Save the Xnnpack-delegated ExecuTorch program to a file.
with open("nanogpt.pte", "wb") as file:
    file.write(et_program.buffer)

此外，更新 CMakeLists.txt 以建置 XNNPACK 後端並將其連結到 ExecuTorch 執行器。

cmake_minimum_required(VERSION 3.19)
project(nanogpt_runner)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED True)

# Set options for executorch build.
option(EXECUTORCH_ENABLE_LOGGING "" ON)
option(EXECUTORCH_BUILD_EXTENSION_DATA_LOADER "" ON)
option(EXECUTORCH_BUILD_EXTENSION_MODULE "" ON)
option(EXECUTORCH_BUILD_EXTENSION_TENSOR "" ON)
option(EXECUTORCH_BUILD_KERNELS_OPTIMIZED "" ON)
option(EXECUTORCH_BUILD_XNNPACK "" ON) # Build with Xnnpack backend

# Include the executorch subdirectory.
add_subdirectory(
  ${CMAKE_CURRENT_SOURCE_DIR}/third-party/executorch
  ${CMAKE_BINARY_DIR}/executorch
)

add_executable(nanogpt_runner main.cpp)
target_link_libraries(
  nanogpt_runner
  PRIVATE executorch
          extension_module_static # Provides the Module class
          extension_tensor # Provides the TensorPtr class
          optimized_native_cpu_ops_lib # Provides baseline cross-platform
                                       # kernels
          xnnpack_backend # Provides the XNNPACK CPU acceleration backend
)

保持其餘程式碼不變。有關更多詳細資訊，請參閱匯出到 ExecuTorch 和呼叫執行階段以了解更多詳細資訊

此時，工作目錄應包含以下檔案

CMakeLists.txt
main.cpp
basic_tokenizer.h
basic_sampler.h
export_nanogpt.py
model.py
vocab.json

如果所有這些都存在，您現在可以匯出 Xnnpack 委派的 pte 模型

python export_nanogpt.py

它將在相同的工作目錄下產生 nanogpt.pte。

然後我們可以透過以下方式建置和執行模型

(rm -rf cmake-out && mkdir cmake-out && cd cmake-out && cmake ..)
cmake --build cmake-out -j10
./cmake-out/nanogpt_runner

您應該看到訊息

Enter model prompt:

為模型輸入一些種子文字，然後按 Enter。在這裡，我們使用“Hello world!” 作為範例提示

Enter model prompt: Hello world!
Hello world!

I'm not sure if you've heard of the "Curse of the Dragon" or not, but it's a very popular game in

與非委派模型相比，委派模型應該明顯更快。

有關後端委派的更多資訊，請參閱 ExecuTorch 指南，了解XNNPACK 後端、Core ML 後端和 Qualcomm AI Engine Direct 後端。

量化¶

量化是指使用較低精度類型執行計算和儲存張量的一組技術。與 32 位元浮點數相比，使用 8 位元整數既可以顯著提高速度，又可以減少記憶體使用量。量化模型的方法有很多種，所需的預處理量、使用的資料類型以及對模型準確性和效能的影響各不相同。

由於行動裝置上的計算和記憶體受到高度限制，因此某些形式的量化對於在消費性電子產品上發布大型模型是必要的。特別是，大型語言模型（例如 Llama2）可能需要將模型權重量化為 4 位元或更少。

利用量化需要在匯出之前轉換模型。PyTorch 提供了 pt2e (PyTorch 2 Export) API 來實現此目的。此範例針對使用 XNNPACK 委派的 CPU 加速。因此，它需要使用 XNNPACK 特定的量化器。針對不同的後端將需要使用相應的量化器。

要將 8 位元整數動態量化與 XNNPACK 委派一起使用，請呼叫 prepare_pt2e、透過使用具代表性的輸入執行來校準模型，然後呼叫 convert_pt2e。這會更新計算圖以在可用時使用量化的運算子。

# export_nanogpt.py

from executorch.backends.transforms.duplicate_dynamic_quant_chain import (
    DuplicateDynamicQuantChainPass,
)
from torch.ao.quantization.quantizer.xnnpack_quantizer import (
    get_symmetric_quantization_config,
    XNNPACKQuantizer,
)
from torch.ao.quantization.quantize_pt2e import convert_pt2e, prepare_pt2e

# Use dynamic, per-channel quantization.
xnnpack_quant_config = get_symmetric_quantization_config(
    is_per_channel=True, is_dynamic=True
)
xnnpack_quantizer = XNNPACKQuantizer()
xnnpack_quantizer.set_global(xnnpack_quant_config)

m = export_for_training(model, example_inputs).module()

# Annotate the model for quantization. This prepares the model for calibration.
m = prepare_pt2e(m, xnnpack_quantizer)

# Calibrate the model using representative inputs. This allows the quantization
# logic to determine the expected range of values in each tensor.
m(*example_inputs)

# Perform the actual quantization.
m = convert_pt2e(m, fold_quantize=False)
DuplicateDynamicQuantChainPass()(m)

traced_model = export(m, example_inputs)

此外，新增或更新 to_backend() 呼叫以使用 XnnpackPartitioner。這會指示 ExecuTorch 透過 XNNPACK 後端最佳化 CPU 執行的模型。

from executorch.backends.xnnpack.partition.xnnpack_partitioner import (
    XnnpackPartitioner,
)

edge_manager = to_edge(traced_model, compile_config=edge_config)
edge_manager = edge_manager.to_backend(XnnpackPartitioner()) # Lower to XNNPACK.
et_program = edge_manager.to_executorch()

最後，確保執行器連結到 CMakeLists.txt 中的 xnnpack_backend 目標。

add_executable(nanogpt_runner main.cpp)
target_link_libraries(
    nanogpt_runner
    PRIVATE
    executorch
    extension_module_static # Provides the Module class
    optimized_native_cpu_ops_lib # Provides baseline cross-platform kernels
    xnnpack_backend) # Provides the XNNPACK CPU acceleration backend

有關更多資訊，請參閱ExecuTorch 中的量化。

效能分析和除錯¶

透過呼叫 to_backend() 降低模型之後，您可能想查看哪些已委派以及哪些未委派。ExecuTorch 提供了實用方法來深入了解委派。您可以使用此資訊來了解底層計算並診斷潛在的效能問題。模型作者可以使用此資訊以與目標後端相容的方式建構模型。

視覺化委派¶

get_delegation_info() 方法提供了在呼叫 to_backend() 之後模型發生情況的摘要

from executorch.devtools.backend_debug import get_delegation_info
from tabulate import tabulate

# ... After call to to_backend(), but before to_executorch()
graph_module = edge_manager.exported_program().graph_module
delegation_info = get_delegation_info(graph_module)
print(delegation_info.get_summary())
df = delegation_info.get_operator_delegation_dataframe()
print(tabulate(df, headers="keys", tablefmt="fancy_grid"))

對於以 XNNPACK 後端為目標的 nanoGPT，您可能會看到以下內容（請注意，以下數字僅用於說明目的，實際值可能會有所不同）

Total  delegated  subgraphs:  145
Number  of  delegated  nodes:  350
Number  of  non-delegated  nodes:  760

	op_type	# in_delegated_graphs	# in_non_delegated_graphs
0	aten__softmax_default	12	0
1	aten_add_tensor	37	0
2	aten_addmm_default	48	0
3	aten_any_dim	0	12
	…
25	aten_view_copy_default	96	122
	…
30	總計	350	760

從表格中，運算子 aten_view_copy_default 在委派圖中出現 96 次，在非委派圖中出現 122 次。要查看更詳細的視圖，請使用 format_delegated_graph() 方法來取得整個圖的格式化 str 列印輸出，或使用 print_delegated_graph() 直接列印

from executorch.exir.backend.utils import format_delegated_graph
graph_module = edge_manager.exported_program().graph_module
print(format_delegated_graph(graph_module))

這可能會為大型模型產生大量輸出。請考慮使用「Control+F」或「Command+F」來尋找您感興趣的運算子（例如「aten_view_copy_default」）。觀察哪些實例不在降低的圖下。

在下方 nanoGPT 輸出的片段中，觀察到 transformer 模組已委派給 XNNPACK，而 where 運算子未委派。

%aten_where_self_22 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.where.self](args = (%aten_logical_not_default_33, %scalar_tensor_23, %scalar_tensor_22), kwargs = {})
%lowered_module_144 : [num_users=1] = get_attr[target=lowered_module_144]
backend_id: XnnpackBackend
lowered graph():
    %p_transformer_h_0_attn_c_attn_weight : [num_users=1] = placeholder[target=p_transformer_h_0_attn_c_attn_weight]
    %p_transformer_h_0_attn_c_attn_bias : [num_users=1] = placeholder[target=p_transformer_h_0_attn_c_attn_bias]
    %getitem : [num_users=1] = placeholder[target=getitem]
    %sym_size : [num_users=2] = placeholder[target=sym_size]
    %aten_view_copy_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.view_copy.default](args = (%getitem, [%sym_size, 768]), kwargs = {})
    %aten_permute_copy_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.permute_copy.default](args = (%p_transformer_h_0_attn_c_attn_weight, [1, 0]), kwargs = {})
    %aten_addmm_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.addmm.default](args = (%p_transformer_h_0_attn_c_attn_bias, %aten_view_copy_default, %aten_permute_copy_default), kwargs = {})
    %aten_view_copy_default_1 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.view_copy.default](args = (%aten_addmm_default, [1, %sym_size, 2304]), kwargs = {})
    return [aten_view_copy_default_1]

效能分析¶

透過 ExecuTorch 開發人員工具，使用者可以分析模型執行，從而提供模型中每個運算子的時序資訊。

先決條件¶

ETRecord 產生（可選）¶

ETRecord 是在匯出時產生的工件，其中包含模型圖和原始程式碼層級的元資料，將 ExecuTorch 程式連結到原始 PyTorch 模型。您可以在沒有 ETRecord 的情況下檢視所有效能分析事件，但使用 ETRecord，您還可以將每個事件連結到正在執行的運算子類型、模組層次結構和原始 PyTorch 原始碼的堆疊追蹤。有關更多資訊，請參閱ETRecord 文件。

在您的匯出腳本中，呼叫 to_edge() 和 to_executorch() 後，請使用來自 to_edge() 的 EdgeProgramManager 和來自 to_executorch() 的 ExecuTorchProgramManager 呼叫 generate_etrecord()。請務必複製 EdgeProgramManager，因為呼叫 to_backend() 會就地修改圖形。

# export_nanogpt.py

import copy
from executorch.devtools import generate_etrecord

# Make the deep copy immediately after to to_edge()
edge_manager_copy = copy.deepcopy(edge_manager)

# ...
# Generate ETRecord right after to_executorch()
etrecord_path = "etrecord.bin"
generate_etrecord(etrecord_path, edge_manager_copy, et_program)

執行匯出腳本，ETRecord 將會產生為 etrecord.bin。

ETDump 產生¶

ETDump 是一個在運行時產生的構件，包含模型執行的追蹤。欲知更多資訊，請參閱ETDump 文件。

在您的程式碼中包含 ETDump 標頭。

// main.cpp

#include <executorch/devtools/etdump/etdump_flatcc.h>

建立 ETDumpGen 類別的實例，並將其傳遞給模組建構函式。

std::unique_ptr<ETDumpGen> etdump_gen_ = std::make_unique<ETDumpGen>();
Module model("nanogpt.pte", Module::LoadMode::MmapUseMlockIgnoreErrors, std::move(etdump_gen_));

呼叫 generate() 後，將 ETDump 儲存到檔案。如果需要，您可以在單一追蹤中捕獲多個模型執行。

ETDumpGen* etdump_gen = static_cast<ETDumpGen*>(model.event_tracer());

ET_LOG(Info, "ETDump size: %zu blocks", etdump_gen->get_num_blocks());
etdump_result result = etdump_gen->get_etdump_data();
if (result.buf != nullptr && result.size > 0) {
    // On a device with a file system, users can just write it to a file.
    FILE* f = fopen("etdump.etdp", "w+");
    fwrite((uint8_t*)result.buf, 1, result.size, f);
    fclose(f);
    free(result.buf);
}

此外，更新 CMakeLists.txt 以使用開發人員工具進行建置，並啟用要追蹤和記錄到 ETDump 的事件。

option(EXECUTORCH_ENABLE_EVENT_TRACER "" ON)
option(EXECUTORCH_BUILD_DEVTOOLS "" ON)

# ...

target_link_libraries(
    # ... omit existing ones
    etdump) # Provides event tracing and logging

target_compile_options(executorch PUBLIC -DET_EVENT_TRACER_ENABLED)
target_compile_options(portable_ops_lib PUBLIC -DET_EVENT_TRACER_ENABLED)

建置並執行 runner，您會看到產生一個名為 “etdump.etdp” 的檔案。（請注意，這次我們以發布模式建置以繞過 flatccrt 建置限制。）

(rm -rf cmake-out && mkdir cmake-out && cd cmake-out && cmake -DCMAKE_BUILD_TYPE=Release ..)
cmake --build cmake-out -j10
./cmake-out/nanogpt_runner

使用 Inspector API 進行分析¶

一旦您收集了偵錯構件 ETDump（以及可選的 ETRecord），您可以使用 Inspector API 來查看效能資訊。

from executorch.devtools import Inspector

inspector = Inspector(etdump_path="etdump.etdp")
# If you also generated an ETRecord, then pass that in as well: `inspector = Inspector(etdump_path="etdump.etdp", etrecord="etrecord.bin")`

with open("inspector_out.txt", "w") as file:
    inspector.print_data_tabular(file)

這會在 “inspector_out.txt” 中以表格格式列印效能資料，每行代表一個效能分析事件。前幾行看起來像這樣：以完整尺寸檢視

若要了解有關 Inspector 及其提供的豐富功能的更多資訊，請參閱Inspector API 參考。

自定義 Kernel¶

透過 ExecuTorch 自定義運算子 API，自定義運算子和 kernel 作者可以輕鬆地將其 kernel 引入 PyTorch/ExecuTorch。

在 ExecuTorch 中使用自定義 kernel 有三個步驟

使用 ExecuTorch 類型編寫自定義 kernel。
編譯自定義 kernel 並將其連結到 AOT Python 環境以及運行時二進制檔案。
原始碼到原始碼的轉換，以將運算子與自定義運算子交換。

編寫自定義 Kernel¶

為函數變體（用於 AOT 編譯）和 out 變體（用於 ExecuTorch 運行時）定義您的自定義運算子 schema。該 schema 需要遵循 PyTorch ATen 慣例（請參閱native_functions.yaml）。

custom_linear(Tensor weight, Tensor input, Tensor(?) bias) -> Tensor

custom_linear.out(Tensor weight, Tensor input, Tensor(?) bias, *, Tensor(a!) out) -> Tensor(a!)

根據上面定義的 schema 編寫您的自定義 kernel。使用 EXECUTORCH_LIBRARY 巨集，使 kernel 可用於 ExecuTorch 運行時。

// custom_linear.h / custom_linear.cpp
#include <executorch/runtime/kernel/kernel_includes.h>

Tensor& custom_linear_out(const Tensor& weight, const Tensor& input, optional<Tensor> bias, Tensor& out) {
    // calculation
    return out;
}

// Register as myop::custom_linear.out
EXECUTORCH_LIBRARY(myop, "custom_linear.out", custom_linear_out);

為了使此運算子在 PyTorch 中可用，您可以定義一個 ExecuTorch 自定義 kernel 的封裝器。請注意，ExecuTorch 實作使用 ExecuTorch 張量類型，而 PyTorch 封裝器使用 ATen 張量。

// custom_linear_pytorch.cpp

#include "custom_linear.h"
#include <torch/library.h>

at::Tensor custom_linear(const at::Tensor& weight, const at::Tensor& input, std::optional<at::Tensor> bias) {

    // initialize out
    at::Tensor out = at::empty({weight.size(1), input.size(1)});

    // wrap kernel in custom_linear.cpp into ATen kernel
    WRAP_TO_ATEN(custom_linear_out, 3)(weight, input, bias, out);

    return out;
}

// Register the operator with PyTorch.
TORCH_LIBRARY(myop,  m) {
    m.def("custom_linear(Tensor weight, Tensor input, Tensor(?) bias) -> Tensor", custom_linear);
    m.def("custom_linear.out(Tensor weight, Tensor input, Tensor(?) bias, *, Tensor(a!) out) -> Tensor(a!)", WRAP_TO_ATEN(custom_linear_out, 3));
}

編譯和連結自定義 Kernel¶

為了使其可用於 ExecuTorch 運行時，請將 custom_linear.h/cpp 編譯到二進制目標中。您也可以將 kernel 建置為動態載入的程式庫（.so 或 .dylib）並將其連結。

為了使其可供 PyTorch 使用，請將 custom_linear.h、custom_linear.cpp 和 custom_linear_pytorch.cpp 打包到動態載入的程式庫（.so 或 .dylib）中，並將其載入到 Python 環境中。這是為了使 PyTorch 在匯出時知道自定義運算子。

import torch
torch.ops.load_library("libcustom_linear.so")

載入後，您可以在 PyTorch 程式碼中使用自定義運算子。

有關更多資訊，請參閱PyTorch 自定義運算子和ExecuTorch Kernel 註冊。

在模型中使用自定義運算子¶

可以在 PyTorch 模型中明確使用自定義運算子，或者您可以編寫轉換以將核心運算子的實例替換為自定義變體。對於此範例，您可以找到 torch.nn.Linear 的所有實例，並將它們替換為 CustomLinear。

def  replace_linear_with_custom_linear(module):
    for name, child in module.named_children():
        if isinstance(child, nn.Linear):
            setattr(
                module,
                name,
                CustomLinear(child.in_features,  child.out_features, child.bias),
        )
        else:
            replace_linear_with_custom_linear(child)

其餘步驟與正常流程相同。現在，您可以在 eager 模式下執行此模組，以及匯出到 ExecuTorch。

如何建置行動應用程式¶

請參閱使用 ExecuTorch 在 iOS 和 Android 上建置和執行 LLM 的說明。

Executorch 中的 LLM 簡介¶

目錄¶

先決條件¶

在本機執行大型語言模型¶

步驟 1. 匯出到 ExecuTorch¶

步驟 2. 呼叫執行階段¶

建置和執行¶

委派¶

量化¶

效能分析和除錯¶

視覺化委派¶

效能分析¶

先決條件¶

ETRecord 產生（可選）¶

ETDump 產生¶

使用 Inspector API 進行分析¶

自定義 Kernel¶

編寫自定義 Kernel¶

編譯和連結自定義 Kernel¶

在模型中使用自定義運算子¶

如何建置行動應用程式¶

文件

教學

資源