注意
前往結尾以下載完整範例程式碼
使用 Cudagraphs 進行 Torch 匯出¶
此互動式腳本旨在概述在 ir=”dynamo” 路徑中使用 Torch-TensorRT Cudagraphs 整合的流程。該功能在 torch.compile 路徑中的運作方式類似。
導入和模型定義¶
import torch
import torch_tensorrt
import torchvision.models as models
使用預設設定透過 torch_tensorrt.compile 進行編譯¶
# We begin by defining and initializing a model
model = models.resnet18(pretrained=True).eval().to("cuda")
# Define sample inputs
inputs = torch.randn((16, 3, 224, 224)).cuda()
# Next, we compile the model using torch_tensorrt.compile
# We use the `ir="dynamo"` flag here, and `ir="torch_compile"` should
# work with cudagraphs as well.
opt = torch_tensorrt.compile(
model,
ir="dynamo",
inputs=torch_tensorrt.Input(
min_shape=(1, 3, 224, 224),
opt_shape=(8, 3, 224, 224),
max_shape=(16, 3, 224, 224),
dtype=torch.float,
name="x",
),
)
使用 Cudagraphs 整合進行推論¶
# We can enable the cudagraphs API with a context manager
with torch_tensorrt.runtime.enable_cudagraphs(opt) as cudagraphs_module:
out_trt = cudagraphs_module(inputs)
# Alternatively, we can set the cudagraphs mode for the session
torch_tensorrt.runtime.set_cudagraphs_mode(True)
out_trt = opt(inputs)
# We can also turn off cudagraphs mode and perform inference as normal
torch_tensorrt.runtime.set_cudagraphs_mode(False)
out_trt = opt(inputs)
# If we provide new input shapes, cudagraphs will re-record the graph
inputs_2 = torch.randn((8, 3, 224, 224)).cuda()
inputs_3 = torch.randn((4, 3, 224, 224)).cuda()
with torch_tensorrt.runtime.enable_cudagraphs(opt) as cudagraphs_module:
out_trt_2 = cudagraphs_module(inputs_2)
out_trt_3 = cudagraphs_module(inputs_3)
具有包含圖形中斷的模組的 Cuda 圖形¶
當 CUDA 圖形應用於包含圖形中斷的 TensorRT 模型時,每個中斷都會引入額外的 overhead。這是因為圖形中斷阻止整個模型作為單個、連續的最佳化單元執行。因此,CUDA 圖形通常提供的一些效能優勢,例如減少核心啟動 overhead 並提高執行效率,可能會降低。將包裝的運行時模組與 CUDA 圖形一起使用,可讓您將操作序列封裝到可以有效執行的圖形中,即使存在圖形中斷也是如此。如果 TensorRT 模組有圖形中斷,CUDA Graph context manager 會返回一個 wrapped_module。這個模組捕獲整個執行圖,通過減少核心啟動 overhead 並提高效能,從而在後續的推論期間實現有效的重播。請注意,使用包裝器模組進行初始化涉及一個熱身階段,該模組將執行多次。此熱身可確保記憶體分配和初始化不會記錄在 CUDA 圖形中,這有助於維持一致的執行路徑並優化效能。
class SampleModel(torch.nn.Module):
def forward(self, x):
return torch.relu((x + 2) * 0.5)
model = SampleModel().eval().cuda()
input = torch.randn((1, 3, 224, 224)).to("cuda")
# The 'torch_executed_ops' compiler option is used in this example to intentionally introduce graph breaks within the module.
# Note: The Dynamo backend is required for the CUDA Graph context manager to handle modules in an Ahead-Of-Time (AOT) manner.
opt_with_graph_break = torch_tensorrt.compile(
model,
ir="dynamo",
inputs=[input],
min_block_size=1,
pass_through_build_failures=True,
torch_executed_ops={"torch.ops.aten.mul.Tensor"},
)
如果模組有圖形中斷,則整個子模組將被 cuda 圖形記錄和重播
with torch_tensorrt.runtime.enable_cudagraphs(
opt_with_graph_break
) as cudagraphs_module:
cudagraphs_module(input)
腳本總運行時間: (0 分鐘 0.000 秒)