Torch-TensorRT (FX 前端) 使用者指南¶

Torch-TensorRT (FX 前端) 是一個工具，可透過 torch.fx 將 PyTorch 模型轉換為 TensorRT 引擎，並針對在 Nvidia GPU 上執行進行最佳化。TensorRT 是 NVIDIA 開發的推論引擎，由各種最佳化組成，包括核心融合、圖形最佳化、低精度等等。此工具在 Python 環境中開發，讓研究人員和工程師非常容易使用此工作流程。使用者若想使用此工具，需要經過幾個階段，我們將在此處介紹這些階段。

> Torch-TensorRT (FX 前端) 處於 Beta 階段，目前建議與 PyTorch nightly 版本搭配使用。

# Test an example by
$ python py/torch_tensorrt/fx/example/lower_example.py

將 PyTorch 模型轉換為 TensorRT 引擎¶

一般而言，歡迎使用者使用 compile() 完成從模型到 tensorRT 引擎的轉換。它是一個包裝函式 API，包含完成此轉換所需的主要步驟。請參考 examples/fx 資料夾下 lower_example.py 檔案中的範例用法。

def compile(
    module: nn.Module,
    input,
    max_batch_size=2048,
    max_workspace_size=33554432,
    explicit_batch_dimension=False,
    lower_precision=LowerPrecision.FP16,
    verbose_log=False,
    timing_cache_prefix="",
    save_timing_cache=False,
    cuda_graph_batch_size=-1,
    dynamic_batch=True,
) -> nn.Module:

    """
    Takes in original module, input and lowering setting, run lowering workflow to turn module
    into lowered module, or so called TRTModule.

    Args:
        module: Original module for lowering.
        input: Input for module.
        max_batch_size: Maximum batch size (must be >= 1 to be set, 0 means not set)
        max_workspace_size: Maximum size of workspace given to TensorRT.
        explicit_batch_dimension: Use explicit batch dimension in TensorRT if set True, otherwise use implicit batch dimension.
        lower_precision: lower_precision config given to TRTModule.
        verbose_log: Enable verbose log for TensorRT if set True.
        timing_cache_prefix: Timing cache file name for timing cache used by fx2trt.
        save_timing_cache: Update timing cache with current timing cache data if set to True.
        cuda_graph_batch_size: Cuda graph batch size, default to be -1.
        dynamic_batch: batch dimension (dim=0) is dynamic.
    Returns:
        A torch.nn.Module lowered by TensorRT.
    """

在本節中，我們將透過一個範例來說明 fx 路徑使用的主要步驟。使用者可以參考 examples/fx 資料夾中的 fx2trt_example.py 檔案。

步驟 1：使用 acc_tracer 追蹤模型

Acc_tracer 是一個繼承自 FX tracer 的追蹤器。它配備了引數正規化器，可將所有引數轉換為 kwargs 並傳遞給 TRT 轉換器。

import torch_tensorrt.fx.tracer.acc_tracer.acc_tracer as acc_tracer

# Build the model which needs to be a PyTorch nn.Module.
my_pytorch_model = build_model()

# Prepare inputs to the model. Inputs have to be a List of Tensors
inputs = [Tensor, Tensor, ...]

# Trace the model with acc_tracer.
acc_mod = acc_tracer.trace(my_pytorch_model, inputs)

常見錯誤

符號追蹤的變數不能用作控制流程的輸入這表示模型包含動態控制流程。請參考 FX 指南中的「動態控制流程」章節。

步驟 2：建置 TensorRT 引擎

TensorRT 處理批次維度的方式有兩種不同的模式，即顯式批次維度和隱式批次維度。此模式由早期版本的 TensorRT 使用，現在已棄用，但為了向後相容性而繼續支援。在顯式批次模式下，所有維度都是顯式的且可以是動態的，也就是說它們的長度可以在執行時改變。許多新功能，例如動態形狀和迴圈，僅在此模式下可用。當使用者在 compile() 中設定 explicit_batch_dimension=False 時，仍然可以選擇使用隱式批次模式。我們不建議使用它，因為它將在未來的 TensorRT 版本中缺乏支援。

顯式批次是預設模式，且必須針對動態形狀進行設定。對於大多數視覺任務，如果使用者想要獲得與隱式模式類似的效果 (僅批次維度會改變)，可以選擇在 compile() 中啟用 dynamic_batch。它有一些要求：1. 除了批次維度外，輸入、輸出和啟動的形狀是固定的。2. 輸入、輸出和啟動具有批次維度作為主要維度。3. 模型中的所有運算子都不會修改批次維度 (permute、transpose、split 等) 或在批次維度上計算 (sum、softmax 等)。

對於最後一種路徑的範例，如果我們有一個形狀為 (批次、序列、維度) 的 3D 張量 t，則運算如 torch.transpose(0, 2)。如果這三個條件中的任何一個不滿足，我們就需要將 InputTensorSpec 指定為具有動態範圍的輸入。

import deeplearning.trt.fx2trt.converter.converters
from torch.fx.experimental.fx2trt.fx2trt import InputTensorSpec, TRTInterpreter

# InputTensorSpec is a dataclass we use to store input information.
# There're two ways we can build input_specs.
# Option 1, build it manually.
input_specs = [
  InputTensorSpec(shape=(1, 2, 3), dtype=torch.float32),
  InputTensorSpec(shape=(1, 4, 5), dtype=torch.float32),
]
# Option 2, build it using sample_inputs where user provide a sample
inputs = [
torch.rand((1,2,3), dtype=torch.float32),
torch.rand((1,4,5), dtype=torch.float32),
]
input_specs = InputTensorSpec.from_tensors(inputs)

# IMPORTANT: If dynamic shape is needed, we need to build it slightly differently.
input_specs = [
    InputTensorSpec(
        shape=(-1, 2, 3),
        dtype=torch.float32,
        # Currently we only support one set of dynamic range. User may set other dimensions but it is not promised to work for any models
        # (min_shape, optimize_target_shape, max_shape)
        # For more information refer to fx/input_tensor_spec.py
        shape_ranges = [
            ((1, 2, 3), (4, 2, 3), (100, 2, 3)),
        ],
    ),
    InputTensorSpec(shape=(1, 4, 5), dtype=torch.float32),
]

# Build a TRT interpreter. Set explicit_batch_dimension accordingly.
interpreter = TRTInterpreter(
    acc_mod, input_specs, explicit_batch_dimension=True/False
)

# The output of TRTInterpreter run() is wrapped as TRTInterpreterResult.
# The TRTInterpreterResult contains required parameter to build TRTModule,
# and other informational output from TRTInterpreter run.
class TRTInterpreterResult(NamedTuple):
    engine: Any
    input_names: Sequence[str]
    output_names: Sequence[str]
    serialized_cache: bytearray

#max_batch_size: set accordingly for maximum batch size you will use.
#max_workspace_size: set to the maximum size we can afford for temporary buffer
#lower_precision: the precision model layers are running on (TensorRT will choose the best perforamnce precision).
#sparse_weights: allow the builder to examine weights and use optimized functions when weights have suitable sparsity
#force_fp32_output: force output to be fp32
#strict_type_constraints: Usually we should set it to False unless we want to control the precision of certain layer for numeric #reasons.
#algorithm_selector: set up algorithm selection for certain layer
#timing_cache: enable timing cache for TensorRT
#profiling_verbosity: TensorRT logging level
trt_interpreter_result = interpreter.run(
    max_batch_size=64,
    max_workspace_size=1 << 25,
    sparse_weights=False,
    force_fp32_output=False,
    strict_type_constraints=False,
    algorithm_selector=None,
    timing_cache=None,
    profiling_verbosity=None,
)

常見錯誤

RuntimeError: Conversion of function xxx not currently supported! - 這表示我們不支援此 xxx 運算子。請參考下方「如何新增遺失的運算元」章節以獲取更多指示。

步驟 3：執行模型

一種方法是使用 TRTModule，它基本上是一個 PyTorch nn.Module。

from torch_tensorrt.fx import TRTModule
mod = TRTModule(
    trt_interpreter_result.engine,
    trt_interpreter_result.input_names,
    trt_interpreter_result.output_names)
# Just like all other PyTorch modules
outputs = mod(*inputs)
torch.save(mod, "trt.pt")
reload_trt_mod = torch.load("trt.pt")
reload_model_output = reload_trt_mod(*inputs)

到目前為止，我們詳細解釋了將 PyTorch 模型轉換為 TensorRT 引擎的主要步驟。歡迎使用者參考原始碼以了解一些參數說明。在轉換方案中，有兩個重要的動作。一個是 acc tracer，它幫助我們將 PyTorch 模型轉換為 acc 圖形。另一個是 FX 路徑轉換器，它幫助將 acc 圖形的運算轉換為對應的 TensorRT 運算，並為其建立 TensoRT 引擎。

Acc Tracer¶

Acc tracer 是一個自訂 FX 符號追蹤器。與原始 FX 符號追蹤器相比，它做了更多的事情。我們主要依靠它將 PyTorch 運算或內建運算轉換為 acc 運算。fx2trt 使用 acc 運算主要有兩個目的

在 PyTorch 運算和內建運算中，有許多運算執行類似的操作，例如 torch.add、builtin.add 和 torch.Tensor.add。使用 acc tracer，我們將這三個運算正規化為單一 acc_ops.add。這有助於減少我們需要撰寫的轉換器數量。
acc 運算只有 kwargs，這使得撰寫轉換器更容易，因為我們不需要新增額外的邏輯來尋找 args 和 kwargs 中的引數。

FX2TRT¶

在符號追蹤之後，我們有了 PyTorch 模型的圖形表示。fx2trt 利用了 fx.Interpreter 的強大功能。fx.Interpreter 逐節點遍歷整個圖形，並呼叫該節點代表的函式。fx2trt 覆寫了原始的呼叫函式行為，而是為每個節點調用對應的轉換器。每個轉換器函式都會新增對應的 TensorRT 層。

以下是一個轉換器函式的範例。裝飾器用於向對應的節點註冊此轉換器函式。在此範例中，我們將此轉換器註冊到目標為 acc_ops.sigmoid 的 fx 節點。

@tensorrt_converter(acc_ops.sigmoid)
def acc_ops_sigmoid(network, target, args, kwargs, name):
    """
    network: TensorRT network. We'll be adding layers to it.

    The rest arguments are attributes of fx node.
    """
    input_val = kwargs['input']

    if not isinstance(input_val, trt.tensorrt.ITensor):
        raise RuntimeError(f'Sigmoid received input {input_val} that is not part '
                        'of the TensorRT region!')

    layer = network.add_activation(input=input_val, type=trt.ActivationType.SIGMOID)
    layer.name = name
    return layer.get_output(0)

如何新增遺失的運算元¶

您實際上可以將它新增到任何您想要的位置，只需記住匯入檔案，以便在使用 acc_tracer 追蹤之前註冊所有 acc 運算和映射器。

步驟 1. 新增新的 acc 運算

待辦事項：需要更詳細地解釋 acc 運算的邏輯，例如我們何時想要分解運算，以及何時想要重複使用其他運算。

在 acc tracer 中，如果節點有註冊到 acc 運算的映射，我們會將圖形中的節點轉換為 acc 運算。

為了使轉換為 acc 運算發生，需要滿足兩個條件。一是應該定義一個 acc 運算函式，二是應該註冊一個映射。

定義 acc 運算很簡單，我們首先只需要一個函式，並透過此裝飾器 acc_normalizer.py 將該函式註冊為 acc 運算。例如，以下程式碼新增了一個名為 foo() 的 acc 運算，它會新增兩個給定的輸入。

# NOTE: all acc ops should only take kwargs as inputs, therefore we need the "*"
# at the beginning.
@register_acc_op
def foo(*, input, other, alpha):
    return input + alpha * other

有兩種方法可以註冊映射。一種是 register_acc_op_mapping()。讓我們註冊一個從 torch.add 到我們剛才建立的 foo() 的映射。我們需要新增裝飾器 register_acc_op_mapping 到它。

this_arg_is_optional = True

@register_acc_op_mapping(
    op_and_target=("call_function", torch.add),
    arg_replacement_tuples=[
        ("input", "input"),
        ("other", "other"),
        ("alpha", "alpha", this_arg_is_optional),
    ],
)
@register_acc_op
def foo(*, input, other, alpha=1.0):
    return input + alpha * other

op_and_target 決定哪個節點將觸發此映射。op 和 target 是 FX 節點的屬性。在 acc_normalization 中，當我們看到一個節點的 op 和 target 與 op_and_target 中設定的相同時，我們將觸發映射。由於我們想要從 torch.add 映射，因此 op 將為 call_function，而 target 將為 torch.add。arg_replacement_tuples 決定了我們如何使用原始節點的 args 和 kwargs 為新的 acc 運算節點建構 kwargs。 arg_replacement_tuples 中的每個元組代表一個引數映射規則。它包含兩個或三個元素。第三個元素是一個布林變數，決定此 kwarg 在原始節點中是否為選用。只有在為 True 時，我們才需要指定第三個元素。第一個元素是原始節點中的引數名稱，它將用作 acc 運算節點的引數 (其名稱是元組中的第二個元素)，元組的順序很重要，因為元組的位置決定了引數在原始節點的 args 中的位置。我們使用此資訊將原始節點的 args 映射到 acc 運算節點中的 kwargs。如果以下情況皆非真，我們不必指定 arg_replacement_tuples。

原始節點和 acc 運算節點的 kwargs 具有不同的名稱。
有選用引數。

另一種註冊映射的方法是透過 register_custom_acc_mapper_fn()。此方法旨在減少多餘的運算註冊，因為它允許您使用函式透過某些組合映射到一個或多個現有的 acc 運算。在函式中，您基本上可以做任何您想做的事情。讓我們用一個範例來說明它的運作方式。

@register_acc_op
def foo(*, input, other, alpha=1.0):
    return input + alpha * other

@register_custom_acc_mapper_fn(
    op_and_target=("call_function", torch.add),
    arg_replacement_tuples=[
        ("input", "input"),
        ("other", "other"),
        ("alpha", "alpha", this_arg_is_optional),
    ],
)
def custom_mapper(node: torch.fx.Node, _: nn.Module) -> torch.fx.Node:
    """
    `node` is original node, which is a call_function node with target
    being torch.add.
    """
    alpha = 1
    if "alpha" in node.kwargs:
        alpha = node.kwargs["alpha"]
    foo_kwargs = {"input": node["input"], "other": node["other"], "alpha": alpha}
    with node.graph.inserting_before(node):
        foo_node = node.graph.call_function(foo, kwargs=foo_kwargs)
        foo_node.meta = node.meta.copy()
        return foo_node

在自訂映射器函式中，我們建構一個 acc 運算節點並傳回它。我們在此處傳回的節點將接管原始節點的所有子節點 acc_normalizer.py。

最後一步是為我們新增的新 acc 運算或映射器函式新增單元測試。新增單元測試的位置在這裡 test_acc_tracer.py。

步驟 2. 新增新的轉換器

所有為 acc 運算開發的轉換器都位於 acc_op_converter.py 中。它可以為您提供關於如何新增轉換器的良好範例。

基本上，轉換器是一種映射機制，可將 acc 運算映射到 TensorRT 層。如果我們能夠找到我們需要的所有 TensorRT 層，我們就可以開始使用 TensorRT API 為節點新增轉換器。

@tensorrt_converter(acc_ops.sigmoid)
def acc_ops_sigmoid(network, target, args, kwargs, name):
    """
    network: TensorRT network. We'll be adding layers to it.

    The rest arguments are attributes of fx node.
    """
    input_val = kwargs['input']

    if not isinstance(input_val, trt.tensorrt.ITensor):
        raise RuntimeError(f'Sigmoid received input {input_val} that is not part '
                        'of the TensorRT region!')

    layer = network.add_activation(input=input_val, type=trt.ActivationType.SIGMOID)
    layer.name = name
    return layer.get_output(0)

我們需要使用 tensorrt_converter 裝飾器來註冊轉換器。裝飾器的引數是我們需要轉換的 fx 節點的目標。在轉換器中，我們可以在 kwargs 中找到 fx 節點的輸入。如範例所示，原始節點是 acc_ops.sigmoid，它在 acc_ops.py 中只有一個引數 "input"。我們取得輸入並檢查它是否為 TensorRT 張量。之後，我們將一個 sigmoid 層新增到 TensorRT 網路，並傳回該層的輸出。我們傳回的輸出將由 fx.Interpreter 傳遞給 acc_ops.sigmoid 的子節點。

如果我們找不到 TensorRT 中執行與節點相同操作的對應層，該怎麼辦。

在這種情況下，我們需要做更多的工作。TensorRT 提供了外掛程式，可作為自訂層。我們尚未實作此功能。一旦啟用，我們將更新。

最後一步是為我們新增的新轉換器新增單元測試。使用者可以在此資料夾中新增對應的單元測試。

Torch-TensorRT (FX 前端) 使用者指南¶

將 PyTorch 模型轉換為 TensorRT 引擎¶

Acc Tracer¶

FX2TRT¶

如何新增遺失的運算元¶

文件

教學

資源