注意

點擊此處以下載完整的範例程式碼

分析您的 PyTorch 模組¶

建立時間：2020 年 12 月 30 日 | 最後更新：2024 年 1 月 19 日 | 最後驗證：2024 年 11 月 05 日

作者： Suraj Subramanian

PyTorch 包含一個效能分析器 API，可用於識別程式碼中各種 PyTorch 運算的執行時間和記憶體成本。效能分析器可以輕鬆整合到您的程式碼中，並且可以將結果列印為表格或以 JSON 追蹤檔案傳回。

注意

效能分析器支援多執行緒模型。效能分析器在與運算相同的執行緒中執行，但它也會分析可能在另一個執行緒中執行的子運算元。並行執行的效能分析器將限定在它們自己的執行緒中，以防止結果混淆。

注意

PyTorch 1.8 引入了新的 API，該 API 將在未來的版本中取代舊的效能分析器 API。請在此處查看新的 API：此頁面。

前往此配方快速瀏覽效能分析器 API 的用法。

import torch
import numpy as np
from torch import nn
import torch.autograd.profiler as profiler

使用效能分析器進行效能除錯¶

效能分析器可用於識別模型中的效能瓶頸。在此範例中，我們建立一個執行兩個子任務的自定義模組

對輸入進行線性轉換，以及
使用轉換結果來取得遮罩張量上的索引。

我們使用 profiler.record_function("label") 將每個子任務的程式碼包裝在單獨標記的上下文管理器中。在效能分析器輸出中，子任務中所有運算的總體效能指標將顯示在其對應的標籤下。

請注意，使用效能分析器會產生一些額外負擔，最好僅用於調查程式碼。如果正在評估執行時間，請記住將其移除。

class MyModule(nn.Module):
    def __init__(self, in_features: int, out_features: int, bias: bool = True):
        super(MyModule, self).__init__()
        self.linear = nn.Linear(in_features, out_features, bias)

    def forward(self, input, mask):
        with profiler.record_function("LINEAR PASS"):
            out = self.linear(input)

        with profiler.record_function("MASK INDICES"):
            threshold = out.sum(axis=1).mean().item()
            hi_idx = np.argwhere(mask.cpu().numpy() > threshold)
            hi_idx = torch.from_numpy(hi_idx).cuda()

        return out, hi_idx

分析正向傳遞¶

我們初始化隨機輸入和遮罩張量，以及模型。

在我們執行效能分析器之前，我們預熱 CUDA 以確保準確的效能評估。我們將模組的正向傳遞包裝在 profiler.profile 上下文管理器中。with_stack=True 參數將運算的文件和行號附加到追蹤中。

警告

with_stack=True 會產生額外的負擔，更適合用於調查程式碼。如果正在評估效能，請記住將其移除。

model = MyModule(500, 10).cuda()
input = torch.rand(128, 500).cuda()
mask = torch.rand((500, 500, 500), dtype=torch.double).cuda()

# warm-up
model(input, mask)

with profiler.profile(with_stack=True, profile_memory=True) as prof:
    out, idx = model(input, mask)

列印效能分析器結果¶

最後，我們列印效能分析器結果。profiler.key_averages 按運算元名稱以及可選的輸入形狀和/或堆疊追蹤事件來彙總結果。按輸入形狀分組有助於識別模型使用的張量形狀。

在此，我們使用 group_by_stack_n=5，它按運算及其追蹤 (截斷為最近的 5 個事件) 彙總執行時間，並按註冊事件的順序顯示事件。也可以透過傳遞 sort_by 參數對表格進行排序 (請參閱文件以取得有效的排序鍵)。

注意

在筆記本中執行效能分析器時，您可能會在堆疊追蹤中看到類似 <ipython-input-18-193a910735e8>(13): forward 的條目，而不是檔案名稱。這些對應於 <notebook-cell>(行號): 呼叫函數。

print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))

"""
(Some columns are omitted)

-------------  ------------  ------------  ------------  ---------------------------------
         Name    Self CPU %      Self CPU  Self CPU Mem   Source Location
-------------  ------------  ------------  ------------  ---------------------------------
 MASK INDICES        87.88%        5.212s    -953.67 Mb  /mnt/xarfuse/.../torch/au
                                                         <ipython-input-...>(10): forward
                                                         /mnt/xarfuse/.../torch/nn
                                                         <ipython-input-...>(9): <module>
                                                         /mnt/xarfuse/.../IPython/

  aten::copy_        12.07%     715.848ms           0 b  <ipython-input-...>(12): forward
                                                         /mnt/xarfuse/.../torch/nn
                                                         <ipython-input-...>(9): <module>
                                                         /mnt/xarfuse/.../IPython/
                                                         /mnt/xarfuse/.../IPython/

  LINEAR PASS         0.01%     350.151us         -20 b  /mnt/xarfuse/.../torch/au
                                                         <ipython-input-...>(7): forward
                                                         /mnt/xarfuse/.../torch/nn
                                                         <ipython-input-...>(9): <module>
                                                         /mnt/xarfuse/.../IPython/

  aten::addmm         0.00%     293.342us           0 b  /mnt/xarfuse/.../torch/nn
                                                         /mnt/xarfuse/.../torch/nn
                                                         /mnt/xarfuse/.../torch/nn
                                                         <ipython-input-...>(8): forward
                                                         /mnt/xarfuse/.../torch/nn

   aten::mean         0.00%     235.095us           0 b  <ipython-input-...>(11): forward
                                                         /mnt/xarfuse/.../torch/nn
                                                         <ipython-input-...>(9): <module>
                                                         /mnt/xarfuse/.../IPython/
                                                         /mnt/xarfuse/.../IPython/

-----------------------------  ------------  ---------- ----------------------------------
Self CPU time total: 5.931s

"""

提升記憶體效能¶

請注意，在記憶體和時間方面，最昂貴的運算位於 forward (10)，表示 MASK INDICES 中的運算。讓我們嘗試首先解決記憶體消耗問題。我們可以看到第 12 行的 .to() 運算消耗了 953.67 Mb。此運算將 mask 複製到 CPU。mask 使用 torch.double 資料類型初始化。我們是否可以透過將其轉換為 torch.float 來減少記憶體佔用量？

model = MyModule(500, 10).cuda()
input = torch.rand(128, 500).cuda()
mask = torch.rand((500, 500, 500), dtype=torch.float).cuda()

# warm-up
model(input, mask)

with profiler.profile(with_stack=True, profile_memory=True) as prof:
    out, idx = model(input, mask)

print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))

"""
(Some columns are omitted)

-----------------  ------------  ------------  ------------  --------------------------------
             Name    Self CPU %      Self CPU  Self CPU Mem   Source Location
-----------------  ------------  ------------  ------------  --------------------------------
     MASK INDICES        93.61%        5.006s    -476.84 Mb  /mnt/xarfuse/.../torch/au
                                                             <ipython-input-...>(10): forward
                                                             /mnt/xarfuse/  /torch/nn
                                                             <ipython-input-...>(9): <module>
                                                             /mnt/xarfuse/.../IPython/

      aten::copy_         6.34%     338.759ms           0 b  <ipython-input-...>(12): forward
                                                             /mnt/xarfuse/.../torch/nn
                                                             <ipython-input-...>(9): <module>
                                                             /mnt/xarfuse/.../IPython/
                                                             /mnt/xarfuse/.../IPython/

 aten::as_strided         0.01%     281.808us           0 b  <ipython-input-...>(11): forward
                                                             /mnt/xarfuse/.../torch/nn
                                                             <ipython-input-...>(9): <module>
                                                             /mnt/xarfuse/.../IPython/
                                                             /mnt/xarfuse/.../IPython/

      aten::addmm         0.01%     275.721us           0 b  /mnt/xarfuse/.../torch/nn
                                                             /mnt/xarfuse/.../torch/nn
                                                             /mnt/xarfuse/.../torch/nn
                                                             <ipython-input-...>(8): forward
                                                             /mnt/xarfuse/.../torch/nn

      aten::_local        0.01%     268.650us           0 b  <ipython-input-...>(11): forward
      _scalar_dense                                          /mnt/xarfuse/.../torch/nn
                                                             <ipython-input-...>(9): <module>
                                                             /mnt/xarfuse/.../IPython/
                                                             /mnt/xarfuse/.../IPython/

-----------------  ------------  ------------  ------------  --------------------------------
Self CPU time total: 5.347s

"""

此運算的 CPU 記憶體佔用量已減半。

提升時間效能¶

雖然所消耗的時間也稍微減少了一些，但仍然太高。結果發現將矩陣從 CUDA 複製到 CPU 相當耗費資源！forward (12) 中的 aten::copy_ 運算子將 mask 複製到 CPU，以便它可以使用 NumPy 的 argwhere 函式。forward(13) 中的 aten::copy_ 將陣列複製回 CUDA 作為張量。如果我們在這裡改用 torch 函式 nonzero()，就可以消除這兩個複製步驟。

class MyModule(nn.Module):
    def __init__(self, in_features: int, out_features: int, bias: bool = True):
        super(MyModule, self).__init__()
        self.linear = nn.Linear(in_features, out_features, bias)

    def forward(self, input, mask):
        with profiler.record_function("LINEAR PASS"):
            out = self.linear(input)

        with profiler.record_function("MASK INDICES"):
            threshold = out.sum(axis=1).mean()
            hi_idx = (mask > threshold).nonzero(as_tuple=True)

        return out, hi_idx


model = MyModule(500, 10).cuda()
input = torch.rand(128, 500).cuda()
mask = torch.rand((500, 500, 500), dtype=torch.float).cuda()

# warm-up
model(input, mask)

with profiler.profile(with_stack=True, profile_memory=True) as prof:
    out, idx = model(input, mask)

print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))

"""
(Some columns are omitted)

--------------  ------------  ------------  ------------  ---------------------------------
          Name    Self CPU %      Self CPU  Self CPU Mem   Source Location
--------------  ------------  ------------  ------------  ---------------------------------
      aten::gt        57.17%     129.089ms           0 b  <ipython-input-...>(12): forward
                                                          /mnt/xarfuse/.../torch/nn
                                                          <ipython-input-...>(25): <module>
                                                          /mnt/xarfuse/.../IPython/
                                                          /mnt/xarfuse/.../IPython/

 aten::nonzero        37.38%      84.402ms           0 b  <ipython-input-...>(12): forward
                                                          /mnt/xarfuse/.../torch/nn
                                                          <ipython-input-...>(25): <module>
                                                          /mnt/xarfuse/.../IPython/
                                                          /mnt/xarfuse/.../IPython/

   INDEX SCORE         3.32%       7.491ms    -119.21 Mb  /mnt/xarfuse/.../torch/au
                                                          <ipython-input-...>(10): forward
                                                          /mnt/xarfuse/.../torch/nn
                                                          <ipython-input-...>(25): <module>
                                                          /mnt/xarfuse/.../IPython/

aten::as_strided         0.20%    441.587us          0 b  <ipython-input-...>(12): forward
                                                          /mnt/xarfuse/.../torch/nn
                                                          <ipython-input-...>(25): <module>
                                                          /mnt/xarfuse/.../IPython/
                                                          /mnt/xarfuse/.../IPython/

 aten::nonzero
     _numpy             0.18%     395.602us           0 b  <ipython-input-...>(12): forward
                                                          /mnt/xarfuse/.../torch/nn
                                                          <ipython-input-...>(25): <module>
                                                          /mnt/xarfuse/.../IPython/
                                                          /mnt/xarfuse/.../IPython/
--------------  ------------  ------------  ------------  ---------------------------------
Self CPU time total: 225.801ms

"""

延伸閱讀¶

我們已經了解如何使用 Profiler 來調查 PyTorch 模型中的時間和記憶體瓶頸。在此處閱讀更多關於 Profiler 的資訊

腳本的總執行時間： ( 0 分鐘 0.000 秒)

由 Sphinx-Gallery 產生