部署 Torch-TensorRT 程式¶

在編譯和儲存 Torch-TensorRT 程式之後，不再嚴格依賴完整的 Torch-TensorRT 程式庫。運行編譯程式所需的一切都是運行時。因此，除了隨應用程式一起提供完整的 Torch-TensorRT 編譯器之外，還有幾種部署程式的選項。

Torch-TensorRT 套件 / libtorchtrt.so¶

程式編譯完成後，您可以使用標準 PyTorch API 執行它。所需的一切是在 Python 中匯入套件或在 C++ 中連結。

運行時程式庫¶

C++ 發行版中包含 libtorchtrt_runtime.so。這個函式庫僅包含執行 Torch-TensorRT 程式所需的元件。您可以使用連結 libtorchtrt_runtime.so，而非連結 libtorchtrt.so 或匯入 torch_tensorrt 到您的部署程式中，或者使用 DL_OPEN 或 LD_PRELOAD。對於 Python，您可以使用 torch.ops.load_library("libtorchtrt_runtime.so") 載入執行階段。然後，您可以像往常一樣透過 PyTorch API 繼續使用程式。

注意

如果您在 x86 上的 Python 中使用 PyTorch 的標準發行版本，您可能需要 libtorchtrt_runtime.so 的 pre-cxx11-abi 變體，請查看安裝文件以獲取更多詳細資訊。

注意

如果您正在連結 libtorchtrt_runtime.so，則以下標誌可能會有所幫助：-Wl,--no-as-needed -ltorchtrt -Wl,--as-needed，因為對於大多數 Torch-TensorRT 執行階段應用程式來說，沒有直接的符號依賴於 Torch-TensorRT 執行階段中的任何內容。

有關如何使用 libtorchtrt_runtime.so 的範例，請參見：https://github.com/pytorch/TensorRT/tree/master/examples/torchtrt_runtime_example

外掛程式庫¶

如果您使用 Torch-TensorRT 作為 TensorRT 引擎的轉換器，並且您的引擎使用 Torch-TensorRT 提供的外掛程式，Torch-TensorRT 會提供函式庫 libtorchtrt_plugins.so，其中包含 Torch-TensorRT 在編譯期間使用的 TensorRT 外掛程式的實作。此函式庫可以像其他 TensorRT 外掛程式庫一樣透過 DL_OPEN 或 LD_PRELOAD 載入。

多裝置安全模式¶

多裝置安全模式是 Torch-TensorRT 中的一個設定，使用者可以決定執行階段是否在每次推論呼叫之前檢查裝置一致性。

啟用多裝置安全模式時，每次推論呼叫都會產生不可忽略的固定成本，這就是為什麼現在預設情況下會停用它。可以通過以下便捷函式進行控制，該函式同時也是一個上下文管理器。

# Enables Multi Device Safe Mode
torch_tensorrt.runtime.set_multi_device_safe_mode(True)

# Disables Multi Device Safe Mode [Default Behavior]
torch_tensorrt.runtime.set_multi_device_safe_mode(False)

# Enables Multi Device Safe Mode, then resets the safe mode to its prior setting
with torch_tensorrt.runtime.set_multi_device_safe_mode(True):
    ...

TensorRT 要求每個引擎都與調用它的活動執行緒中的 CUDA context 相關聯。因此，如果活動執行緒中的裝置發生變化，這可能是在同一個 Python 程序中在多個 GPU 上調用引擎時的情況，安全模式將導致 Torch-TensorRT 顯示警報並相應地切換 GPU。如果未啟用安全模式，則引擎裝置和 CUDA context 裝置可能會不匹配，這可能導致程式崩潰。

一種在不同 GPU 上管理多個 TRT 引擎，同時又不犧牲多裝置安全模式的效能的技術是使用 Python 執行緒。每個執行緒負責單個 GPU 上的所有 TRT 引擎，並且每個執行緒上的預設 CUDA 裝置對應於它負責的 GPU（可以通過 torch.cuda.set_device(...) 進行設定）。這樣，可以在同一個 Python 腳本中使用多個執行緒，而無需切換 CUDA context 並產生效能開銷。

Cudagraphs 模式¶

Cudagraphs 模式是 Torch-TensorRT 中的一個設定，使用者可以決定執行階段是否在某些情況下使用 cudagraphs 來加速推論。

Cudagraphs 可以通過減少核心開銷來加速某些模型，如[此處](https://pytorch.dev.org.tw/blog/accelerating-pytorch-with-cuda-graphs/) 所示。

# Enables Cudagraphs Mode
torch_tensorrt.runtime.set_cudagraphs_mode(True)

# Disables Cudagraphs Mode [Default Behavior]
torch_tensorrt.runtime.set_cudagraphs_mode(False)

# Enables Cudagraphs Mode, then resets the mode to its prior setting
with torch_tensorrt.runtime.enable_cudagraphs(trt_module):
    ...

在目前的實作中，使用新的輸入形狀（例如在動態形狀案例中）將導致重新記錄 cudagraph。 Cudagraph 記錄通常不是延遲密集型的，未來的改進包括緩存多個輸入形狀的 cudagraph。