引擎快取¶

隨著模型尺寸的增加，編譯成本也會隨之增加。使用像 torch.dynamo.compile 這樣的 AOT (Ahead-of-Time) 方法，這個成本會預先支付。然而，如果權重發生變化、session 結束，或者您正在使用像 torch.compile 這樣的 JIT (Just-In-Time) 方法，當圖 (graph) 失效時，它們會被重新編譯，這個成本將會重複支付。引擎快取 (Engine caching) 是一種減輕這種成本的方法，它通過將構建的引擎儲存到磁碟上，並在可能的情況下重複使用它們。本教學將演示如何在 PyTorch 中使用 TensorRT 的引擎快取。引擎快取可以顯著加速後續的模型編譯，因為它可以重複使用先前構建的 TensorRT 引擎。

我們將探討兩種方法

使用 torch_tensorrt.dynamo.compile

使用帶有 TensorRT 後端的 torch.compile

該範例使用預訓練的 ResNet18 模型，並展示了沒有快取、啟用快取以及重複使用快取引擎之間的編譯差異。

import os
from typing import Dict, Optional

import numpy as np
import torch
import torch_tensorrt as torch_trt
import torchvision.models as models
from torch_tensorrt.dynamo._defaults import TIMING_CACHE_PATH
from torch_tensorrt.dynamo._engine_cache import BaseEngineCache

np.random.seed(0)
torch.manual_seed(0)

model = models.resnet18(pretrained=True).eval().to("cuda")
enabled_precisions = {torch.float}
debug = False
min_block_size = 1
use_python_runtime = False


def remove_timing_cache(path=TIMING_CACHE_PATH):
    if os.path.exists(path):
        os.remove(path)

JIT 編譯的引擎快取¶

引擎快取的主要目標是幫助加速 JIT 工作流程。torch.compile 在模型構建方面提供了很大的靈活性，使其成為嘗試加速您的工作流程的首選工具。然而，從歷史上看，編譯成本，尤其是重新編譯成本，一直是許多用戶的進入障礙。如果由於某些原因，子圖失效，則在添加引擎快取之前，該圖會從頭開始重建。現在，當構建引擎時，通過 cache_built_engines=True，引擎會保存到磁碟，並與其對應的 PyTorch 子圖的雜湊值相關聯。如果在後續編譯中，無論是作為此 session 的一部分還是新的 session，快取都會提取已構建的引擎並重新擬合權重，這可以將編譯時間減少幾個數量級。因此，為了將新引擎插入快取中（即 cache_built_engines=True），引擎必須是可重新擬合的 (immutable_weights=False)。有關更多詳細信息，請參閱使用新權重重新擬合 Torch-TensorRT 程式。

def torch_compile(iterations=3):
    times = []
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    # The 1st iteration is to measure the compilation time without engine caching
    # The 2nd and 3rd iterations are to measure the compilation time with engine caching.
    # Since the 2nd iteration needs to compile and save the engine, it will be slower than the 1st iteration.
    # The 3rd iteration should be faster than the 1st iteration because it loads the cached engine.
    for i in range(iterations):
        inputs = [torch.rand((100, 3, 224, 224)).to("cuda")]
        # remove timing cache and reset dynamo just for engine caching messurement
        remove_timing_cache()
        torch._dynamo.reset()

        if i == 0:
            cache_built_engines = False
            reuse_cached_engines = False
        else:
            cache_built_engines = True
            reuse_cached_engines = True

        start.record()
        compiled_model = torch.compile(
            model,
            backend="tensorrt",
            options={
                "use_python_runtime": True,
                "enabled_precisions": enabled_precisions,
                "debug": debug,
                "min_block_size": min_block_size,
                "immutable_weights": False,
                "cache_built_engines": cache_built_engines,
                "reuse_cached_engines": reuse_cached_engines,
            },
        )
        compiled_model(*inputs)  # trigger the compilation
        end.record()
        torch.cuda.synchronize()
        times.append(start.elapsed_time(end))

    print("----------------torch_compile----------------")
    print("disable engine caching, used:", times[0], "ms")
    print("enable engine caching to cache engines, used:", times[1], "ms")
    print("enable engine caching to reuse engines, used:", times[2], "ms")


torch_compile()

AOT 編譯的引擎快取¶

與 JIT 工作流程類似，AOT 工作流程也可以從引擎快取中受益。當相同的架構或常見的子圖被重新編譯時，快取將提取先前構建的引擎並重新擬合權重。

def dynamo_compile(iterations=3):
    times = []
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    example_inputs = (torch.randn((100, 3, 224, 224)).to("cuda"),)
    # Mark the dim0 of inputs as dynamic
    batch = torch.export.Dim("batch", min=1, max=200)
    exp_program = torch.export.export(
        model, args=example_inputs, dynamic_shapes={"x": {0: batch}}
    )

    # The 1st iteration is to measure the compilation time without engine caching
    # The 2nd and 3rd iterations are to measure the compilation time with engine caching.
    # Since the 2nd iteration needs to compile and save the engine, it will be slower than the 1st iteration.
    # The 3rd iteration should be faster than the 1st iteration because it loads the cached engine.
    for i in range(iterations):
        inputs = [torch.rand((100 + i, 3, 224, 224)).to("cuda")]
        remove_timing_cache()  # remove timing cache just for engine caching messurement
        if i == 0:
            cache_built_engines = False
            reuse_cached_engines = False
        else:
            cache_built_engines = True
            reuse_cached_engines = True

        start.record()
        trt_gm = torch_trt.dynamo.compile(
            exp_program,
            tuple(inputs),
            use_python_runtime=use_python_runtime,
            enabled_precisions=enabled_precisions,
            debug=debug,
            min_block_size=min_block_size,
            immutable_weights=False,
            cache_built_engines=cache_built_engines,
            reuse_cached_engines=reuse_cached_engines,
            engine_cache_size=1 << 30,  # 1GB
        )
        # output = trt_gm(*inputs)
        end.record()
        torch.cuda.synchronize()
        times.append(start.elapsed_time(end))

    print("----------------dynamo_compile----------------")
    print("disable engine caching, used:", times[0], "ms")
    print("enable engine caching to cache engines, used:", times[1], "ms")
    print("enable engine caching to reuse engines, used:", times[2], "ms")


dynamo_compile()

自定義引擎快取¶

默認情況下，引擎快取儲存在系統的臨時目錄中。可以通過傳遞 engine_cache_dir 和 engine_cache_size 來定制快取目錄和大小限制。用戶還可以通過擴展 BaseEngineCache 類來定義自己的引擎快取實現。如果需要，這允許遠端或共享快取。

自定義引擎快取應實現以下方法

save：將引擎 blob 儲存到快取。
load：從快取載入引擎 blob。

快取系統提供的雜湊值是源自 PyTorch 子圖（降低後）的與權重無關的雜湊值。該 blob 包含序列化的引擎、呼叫規範資料和 pickle 格式的權重映射信息

以下是一個自定義引擎快取實現的範例，該範例實現了 RAMEngineCache。

class RAMEngineCache(BaseEngineCache):
    def __init__(
        self,
    ) -> None:
        """
        Constructs a user held engine cache in memory.
        """
        self.engine_cache: Dict[str, bytes] = {}

    def save(
        self,
        hash: str,
        blob: bytes,
    ):
        """
        Insert the engine blob to the cache.

        Args:
            hash (str): The hash key to associate with the engine blob.
            blob (bytes): The engine blob to be saved.

        Returns:
            None
        """
        self.engine_cache[hash] = blob

    def load(self, hash: str) -> Optional[bytes]:
        """
        Load the engine blob from the cache.

        Args:
            hash (str): The hash key of the engine to load.

        Returns:
            Optional[bytes]: The engine blob if found, None otherwise.
        """
        if hash in self.engine_cache:
            return self.engine_cache[hash]
        else:
            return None


def torch_compile_my_cache(iterations=3):
    times = []
    engine_cache = RAMEngineCache()
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    # The 1st iteration is to measure the compilation time without engine caching
    # The 2nd and 3rd iterations are to measure the compilation time with engine caching.
    # Since the 2nd iteration needs to compile and save the engine, it will be slower than the 1st iteration.
    # The 3rd iteration should be faster than the 1st iteration because it loads the cached engine.
    for i in range(iterations):
        inputs = [torch.rand((100, 3, 224, 224)).to("cuda")]
        # remove timing cache and reset dynamo just for engine caching messurement
        remove_timing_cache()
        torch._dynamo.reset()

        if i == 0:
            cache_built_engines = False
            reuse_cached_engines = False
        else:
            cache_built_engines = True
            reuse_cached_engines = True

        start.record()
        compiled_model = torch.compile(
            model,
            backend="tensorrt",
            options={
                "use_python_runtime": True,
                "enabled_precisions": enabled_precisions,
                "debug": debug,
                "min_block_size": min_block_size,
                "immutable_weights": False,
                "cache_built_engines": cache_built_engines,
                "reuse_cached_engines": reuse_cached_engines,
                "custom_engine_cache": engine_cache,
            },
        )
        compiled_model(*inputs)  # trigger the compilation
        end.record()
        torch.cuda.synchronize()
        times.append(start.elapsed_time(end))

    print("----------------torch_compile----------------")
    print("disable engine caching, used:", times[0], "ms")
    print("enable engine caching to cache engines, used:", times[1], "ms")
    print("enable engine caching to reuse engines, used:", times[2], "ms")


torch_compile_my_cache()

腳本的總運行時間： ( 0 分鐘 0.000 秒)

Sphinx-Gallery 生成的圖庫

引擎快取¶

JIT 編譯的引擎快取¶

AOT 編譯的引擎快取¶

自定義引擎快取¶

文件

教學

資源