Torch-TensorRT (FX 前端) 使用者指南¶

Torch-TensorRT (FX 前端) 是一個工具，可以透過 torch.fx 將 PyTorch 模型轉換為針對在 Nvidia GPU 上執行而最佳化的 TensorRT 引擎。TensorRT 是 NVIDIA 開發的推論引擎，它由各種最佳化組成，包括核心融合、圖形最佳化、低精度等。此工具是在 Python 環境中開發的，這使得研究人員和工程師可以非常容易地使用此工作流程。使用者想要使用此工具有幾個階段，我們將在此介紹。

> Torch-TensorRT (FX 前端) 處於 Beta 階段，目前建議使用 PyTorch nightly。

# Test an example by
$ python py/torch_tensorrt/fx/example/lower_example.py

將 PyTorch 模型轉換為 TensorRT 引擎¶

一般來說，歡迎使用者使用 compile() 來完成從模型到 TensorRT 引擎的轉換。它是一個包裝器 API，包含完成此轉換所需的主要步驟。請參閱 examples/fx 下的 lower_example.py 檔案中的範例用法。

def compile(
    module: nn.Module,
    input,
    max_batch_size=2048,
    max_workspace_size=33554432,
    explicit_batch_dimension=False,
    lower_precision=LowerPrecision.FP16,
    verbose_log=False,
    timing_cache_prefix="",
    save_timing_cache=False,
    cuda_graph_batch_size=-1,
    dynamic_batch=True,
) -> nn.Module:

    """
    Takes in original module, input and lowering setting, run lowering workflow to turn module
    into lowered module, or so called TRTModule.

    Args:
        module: Original module for lowering.
        input: Input for module.
        max_batch_size: Maximum batch size (must be >= 1 to be set, 0 means not set)
        max_workspace_size: Maximum size of workspace given to TensorRT.
        explicit_batch_dimension: Use explicit batch dimension in TensorRT if set True, otherwise use implicit batch dimension.
        lower_precision: lower_precision config given to TRTModule.
        verbose_log: Enable verbose log for TensorRT if set True.
        timing_cache_prefix: Timing cache file name for timing cache used by fx2trt.
        save_timing_cache: Update timing cache with current timing cache data if set to True.
        cuda_graph_batch_size: Cuda graph batch size, default to be -1.
        dynamic_batch: batch dimension (dim=0) is dynamic.
    Returns:
        A torch.nn.Module lowered by TensorRT.
    """

在本節中，我們將透過一個範例來說明 fx 路徑使用的主要步驟。使用者可以參考 examples/fx 中的 fx2trt_example.py 檔案。

步驟 1：使用 acc_tracer 追蹤模型

Acc_tracer 是一個繼承自 FX 追蹤器的追蹤器。它附帶參數正規化器，可以將所有參數轉換為 kwargs 並傳遞給 TRT 轉換器。

import torch_tensorrt.fx.tracer.acc_tracer.acc_tracer as acc_tracer

# Build the model which needs to be a PyTorch nn.Module.
my_pytorch_model = build_model()

# Prepare inputs to the model. Inputs have to be a List of Tensors
inputs = [Tensor, Tensor, ...]

# Trace the model with acc_tracer.
acc_mod = acc_tracer.trace(my_pytorch_model, inputs)

常見錯誤

符號追蹤的變數不能用作控制流程的輸入這表示模型包含動態控制流程。請參閱 FX 指南中的「動態控制流程」章節。

步驟 2：建置 TensorRT 引擎

TensorRT 處理批次維度的方式有兩種不同的模式，明確批次維度和隱含批次維度。此模式由早期版本的 TensorRT 使用，現在已被棄用，但為了向後相容性而繼續支援。在明確批次模式中，所有維度都是明確的，並且可以是動態的，也就是說它們的長度可以在執行時更改。許多新功能（例如動態形狀和迴圈）僅在此模式下可用。當使用者在 compile() 中設定 explicit_batch_dimension=False 時，仍然可以选择使用隱含批次模式。我們不建議使用它，因為它在未來的 TensorRT 版本中將缺乏支援。

明確批次是預設模式，並且必須為動態形狀設定。對於大多數視覺任務，如果使用者希望獲得與隱含模式（僅批次維度更改）類似的效果，則可以在 compile() 中選擇啟用 dynamic_batch。它有一些要求：1. 輸入、輸出和激活的形狀是固定的，除了批次維度。2. 輸入、輸出和激活的批次維度作為主要維度。3. 模型中的所有運算子都不會修改批次維度（permute、transpose、split 等）或計算批次維度（sum、softmax 等）。

對於最後一條路徑的範例，如果我們有一個形狀為 (batch, sequence, dimension) 的 3D 張量 t，則可以使用 torch.transpose(0, 2) 等運算。如果這三個條件中的任何一個不滿足，我們需要將 InputTensorSpec 指定為具有動態範圍的輸入。

import deeplearning.trt.fx2trt.converter.converters
from torch.fx.experimental.fx2trt.fx2trt import InputTensorSpec, TRTInterpreter

# InputTensorSpec is a dataclass we use to store input information.
# There're two ways we can build input_specs.
# Option 1, build it manually.
input_specs = [
  InputTensorSpec(shape=(1, 2, 3), dtype=torch.float32),
  InputTensorSpec(shape=(1, 4, 5), dtype=torch.float32),
]
# Option 2, build it using sample_inputs where user provide a sample
inputs = [
torch.rand((1,2,3), dtype=torch.float32),
torch.rand((1,4,5), dtype=torch.float32),
]
input_specs = InputTensorSpec.from_tensors(inputs)

# IMPORTANT: If dynamic shape is needed, we need to build it slightly differently.
input_specs = [
    InputTensorSpec(
        shape=(-1, 2, 3),
        dtype=torch.float32,
        # Currently we only support one set of dynamic range. User may set other dimensions but it is not promised to work for any models
        # (min_shape, optimize_target_shape, max_shape)
        # For more information refer to fx/input_tensor_spec.py
        shape_ranges = [
            ((1, 2, 3), (4, 2, 3), (100, 2, 3)),
        ],
    ),
    InputTensorSpec(shape=(1, 4, 5), dtype=torch.float32),
]

# Build a TRT interpreter. Set explicit_batch_dimension accordingly.
interpreter = TRTInterpreter(
    acc_mod, input_specs, explicit_batch_dimension=True/False
)

# The output of TRTInterpreter run() is wrapped as TRTInterpreterResult.
# The TRTInterpreterResult contains required parameter to build TRTModule,
# and other informational output from TRTInterpreter run.
class TRTInterpreterResult(NamedTuple):
    engine: Any
    input_names: Sequence[str]
    output_names: Sequence[str]
    serialized_cache: bytearray

#max_batch_size: set accordingly for maximum batch size you will use.
#max_workspace_size: set to the maximum size we can afford for temporary buffer
#lower_precision: the precision model layers are running on (TensorRT will choose the best perforamnce precision).
#sparse_weights: allow the builder to examine weights and use optimized functions when weights have suitable sparsity
#force_fp32_output: force output to be fp32
#strict_type_constraints: Usually we should set it to False unless we want to control the precision of certain layer for numeric #reasons.
#algorithm_selector: set up algorithm selection for certain layer
#timing_cache: enable timing cache for TensorRT
#profiling_verbosity: TensorRT logging level
trt_interpreter_result = interpreter.run(
    max_batch_size=64,
    max_workspace_size=1 << 25,
    sparse_weights=False,
    force_fp32_output=False,
    strict_type_constraints=False,
    algorithm_selector=None,
    timing_cache=None,
    profiling_verbosity=None,
)

常見錯誤

RuntimeError：目前不支援函數 xxx 的轉換！ - 這表示我們不支援此 xxx 運算子。有關進一步的說明，請參閱下面的「如何新增遺漏的運算子」章節。

步驟 3：執行模型

一種方法是使用 TRTModule，它基本上是一個 PyTorch nn.Module。

from torch_tensorrt.fx import TRTModule
mod = TRTModule(
    trt_interpreter_result.engine,
    trt_interpreter_result.input_names,
    trt_interpreter_result.output_names)
# Just like all other PyTorch modules
outputs = mod(*inputs)
torch.save(mod, "trt.pt")
reload_trt_mod = torch.load("trt.pt")
reload_model_output = reload_trt_mod(*inputs)

到目前為止，我們詳細說明了將 PyTorch 模型轉換為 TensorRT 引擎的主要步驟。歡迎使用者參考原始程式碼以獲取一些參數說明。在轉換方案中，有兩個重要的動作。一個是 acc 追蹤器，它幫助我們將 PyTorch 模型轉換為 acc 圖形。另一個是 FX 路徑轉換器，它幫助將 acc 圖形的運算轉換為對應的 TensorRT 運算，並為其建置 TensoRT 引擎。

Acc 追蹤器¶

Acc 追蹤器是一個自訂的 FX 符號追蹤器。與普通的 FX 符號追蹤器相比，它還做了一些事情。我們主要依靠它將 PyTorch 運算子或內建運算子轉換為 acc 運算子。fx2trt 使用 acc 運算子有兩個主要目的

在 PyTorch 運算子和內建運算子中，有許多運算子做的事情很相似，例如 torch.add、builtin.add 和 torch.Tensor.add。使用 acc 追蹤器，我們將這三個運算子正規化為單個 acc_ops.add。這有助於減少我們需要撰寫的轉換器數量。
acc 運算子只有 kwargs，這使得撰寫轉換器更容易，因為我們不需要新增額外的邏輯來在 args 和 kwargs 中尋找參數。

FX2TRT¶

在符號追蹤之後，我們得到了 PyTorch 模型的圖形表示。fx2trt 利用了 fx.Interpreter 的力量。fx.Interpreter 會逐個遍歷整個圖形節點，並呼叫該節點表示的函數。fx2trt 使用為每個節點呼叫對應的轉換器來覆蓋呼叫函數的原始行為。每個轉換器函數都會新增對應的 TensorRT 層。

以下是一個轉換器函數的範例。裝飾器用於使用對應的節點註冊此轉換器函數。在此範例中，我們將此轉換器註冊到目標為 acc_ops.sigmoid 的 fx 節點。

@tensorrt_converter(acc_ops.sigmoid)
def acc_ops_sigmoid(network, target, args, kwargs, name):
    """
    network: TensorRT network. We'll be adding layers to it.

    The rest arguments are attributes of fx node.
    """
    input_val = kwargs['input']

    if not isinstance(input_val, trt.tensorrt.ITensor):
        raise RuntimeError(f'Sigmoid received input {input_val} that is not part '
                        'of the TensorRT region!')

    layer = network.add_activation(input=input_val, type=trt.ActivationType.SIGMOID)
    layer.name = name
    return layer.get_output(0)

如何新增遺漏的運算子¶

您可以將其新增到任何您想要的位置，只需記住匯入檔案，以便在使用 acc_tracer 進行追蹤之前註冊所有 acc 運算子和對應器。

步驟 1. 新增新的 acc 運算子

TODO：需要更多關於 acc 運算子邏輯的說明，例如我們想要何時分解運算子以及何時想要重複使用其他運算子。

在 acc 追蹤器中，如果為節點註冊了到 acc 運算子的對應，我們會將圖形中的節點轉換為 acc 運算子。

為了使轉換為 acc 運算子發生，需要滿足兩個條件。一個是應該定義 acc 運算子函數，另一個是應該註冊對應。

定義 acc 運算子很簡單，我們首先只需要一個函數，並透過此裝飾器 acc_normalizer.py 將該函數註冊為 acc 運算子。例如，以下程式碼新增了一個名為 foo() 的 acc 運算子，它會將兩個給定的輸入相加。

# NOTE: all acc ops should only take kwargs as inputs, therefore we need the "*"
# at the beginning.
@register_acc_op
def foo(*, input, other, alpha):
    return input + alpha * other

有兩種方法可以註冊對應。一種是 register_acc_op_mapping()。讓我們註冊從 torch.add 到我們剛才建立的 foo() 的對應。我們需要為其新增裝飾器 register_acc_op_mapping。

this_arg_is_optional = True

@register_acc_op_mapping(
    op_and_target=("call_function", torch.add),
    arg_replacement_tuples=[
        ("input", "input"),
        ("other", "other"),
        ("alpha", "alpha", this_arg_is_optional),
    ],
)
@register_acc_op
def foo(*, input, other, alpha=1.0):
    return input + alpha * other

op_and_target 決定哪個節點將觸發此映射。op 和 target 是 FX 節點的屬性。在 acc_normalization 中，當我們看到一個節點的 op 和 target 與 op_and_target 中設定的相同時，我們將觸發映射。由於我們想要從 torch.add 映射，因此 op 將是 call_function，而 target 將是 torch.add。arg_replacement_tuples 決定我們如何使用原始節點的 args 和 kwargs 來構造新的 acc op 節點的 kwargs。arg_replacement_tuples 中的每個元組代表一個參數映射規則。它包含兩個或三個元素。第三個元素是一個布林變數，用於確定此 kwarg 在*原始節點*中是否為可選的。只有當它是 True 時，我們才需要指定第三個元素。第一個元素是原始節點中的參數名稱，它將用作 acc op 節點的參數，其名稱是元組中的第二個元素。元組的順序很重要，因為元組的位置決定了參數在原始節點的 args 中的位置。我們使用此資訊將 args 從原始節點映射到 acc op 節點中的 kwargs。如果以下情況都不存在，我們就不必指定 arg_replacement_tuples。

原始節點和 acc op 節點的 kwargs 具有不同的名稱。
有可選的參數。

另一種註冊映射的方法是透過 register_custom_acc_mapper_fn()。這個方法旨在減少冗餘的 op 註冊，因為它允許您使用一個函數透過某些組合映射到一個或多個現有的 acc op。在函數中，您基本上可以做任何您想做的事情。讓我們用一個例子來說明它是如何工作的。

@register_acc_op
def foo(*, input, other, alpha=1.0):
    return input + alpha * other

@register_custom_acc_mapper_fn(
    op_and_target=("call_function", torch.add),
    arg_replacement_tuples=[
        ("input", "input"),
        ("other", "other"),
        ("alpha", "alpha", this_arg_is_optional),
    ],
)
def custom_mapper(node: torch.fx.Node, _: nn.Module) -> torch.fx.Node:
    """
    `node` is original node, which is a call_function node with target
    being torch.add.
    """
    alpha = 1
    if "alpha" in node.kwargs:
        alpha = node.kwargs["alpha"]
    foo_kwargs = {"input": node["input"], "other": node["other"], "alpha": alpha}
    with node.graph.inserting_before(node):
        foo_node = node.graph.call_function(foo, kwargs=foo_kwargs)
        foo_node.meta = node.meta.copy()
        return foo_node

在自訂映射器函數中，我們構造一個 acc op 節點並返回它。我們在此返回的節點將接管原始節點的所有子節點 acc_normalizer.py。

最後一步是為我們添加的新 acc op 或映射器函數*添加單元測試*。添加單元測試的位置在這裡 test_acc_tracer.py。

步驟 2. 添加新的轉換器

所有為 acc op 開發的轉換器都在 acc_op_converter.py 中。它可以為您提供一個如何添加轉換器的良好範例。

本質上，轉換器是將 acc op 映射到 TensorRT 層的映射機制。如果我們能夠找到我們需要的所有 TensorRT 層，我們就可以開始使用 TensorRT API 為節點添加轉換器。

@tensorrt_converter(acc_ops.sigmoid)
def acc_ops_sigmoid(network, target, args, kwargs, name):
    """
    network: TensorRT network. We'll be adding layers to it.

    The rest arguments are attributes of fx node.
    """
    input_val = kwargs['input']

    if not isinstance(input_val, trt.tensorrt.ITensor):
        raise RuntimeError(f'Sigmoid received input {input_val} that is not part '
                        'of the TensorRT region!')

    layer = network.add_activation(input=input_val, type=trt.ActivationType.SIGMOID)
    layer.name = name
    return layer.get_output(0)

我們需要使用 tensorrt_converter 裝飾器來註冊轉換器。裝飾器的參數是我們需要轉換的 fx 節點的目標。在轉換器中，我們可以在 kwargs 中找到 fx 節點的輸入。如示例中所示，原始節點是 acc_ops.sigmoid，它在 acc_ops.py 中只有一個參數“input”。我們獲取輸入並檢查它是否是 TensorRT 張量。之後，我們將 sigmoid 層添加到 TensorRT 網絡並返回該層的輸出。我們返回的輸出將由 fx.Interpreter 傳遞給 acc_ops.sigmoid 的子節點。

如果我們在 TensorRT 中找不到與節點功能相同的對應層，該怎麼辦？

在這種情況下，我們需要做更多的工作。TensorRT 提供了充當自訂層的插件。*我們尚未實現此功能。我們將在啟用後立即更新*。

最後一步是為我們添加的新轉換器添加單元測試。用戶可以在此資料夾中添加相應的單元測試。

Torch-TensorRT (FX 前端) 使用者指南¶

將 PyTorch 模型轉換為 TensorRT 引擎¶

Acc 追蹤器¶

FX2TRT¶

如何新增遺漏的運算子¶

文件

教學課程

資源