torch.profiler¶

概述¶

PyTorch Profiler 是一個工具，允許在訓練和推論期間收集效能指標。Profiler 的上下文管理器 API 可用於更好地了解哪些模型運算子最耗費資源，檢查它們的輸入形狀和堆疊追蹤，研究設備核心活動並視覺化執行追蹤。

注意

torch.autograd 模組中早期版本的 API 被視為傳統 API，將會被棄用。

API 參考¶

class torch.profiler._KinetoProfile(*, activities=None, record_shapes=False, profile_memory=False, with_stack=False, with_flops=False, with_modules=False, experimental_config=None, execution_trace_observer=None, acc_events=False, custom_trace_id_callback=None)[source][source]¶

封裝 autograd profile 的底層分析器 (profiler)。

參數

activities (iterable) – 用於分析 (profiling) 的活動群組 (activity groups) 清單 (CPU、CUDA)，支援的值：torch.profiler.ProfilerActivity.CPU、torch.profiler.ProfilerActivity.CUDA、torch.profiler.ProfilerActivity.XPU。預設值：ProfilerActivity.CPU 和 (如果可用) ProfilerActivity.CUDA 或 (如果可用) ProfilerActivity.XPU。
record_shapes (bool) – 儲存關於運算元 (operator) 輸入形狀 (shape) 的資訊。
profile_memory (bool) – 追蹤張量 (tensor) 記憶體配置/釋放 (詳情請參閱 export_memory_timeline)。
with_stack (bool) – 記錄運算 (op) 的原始碼資訊 (檔案和行號)。
with_flops (bool) – 使用公式來估算特定運算元 (矩陣乘法和 2D 卷積) 的 FLOPS。
with_modules (bool) – 記錄對應於運算 (op) 呼叫堆疊 (callstack) 的模組層級結構 (module hierarchy) (包括函式名稱)。例如，如果模組 A 的 forward 呼叫了模組 B 的 forward，而模組 B 的 forward 包含一個 aten::add 運算，則 aten::add 的模組層級結構為 A.B。請注意，目前此支援僅適用於 TorchScript 模型，而不適用於 Eager 模式模型。
experimental_config (_ExperimentalConfig) – Kineto 等分析器函式庫使用的一組實驗性選項。請注意，不保證向後相容性。
execution_trace_observer (ExecutionTraceObserver) – 一個 PyTorch 執行追蹤觀察器 (Execution Trace Observer) 物件。 PyTorch 執行追蹤提供 AI/ML 工作負載的基於圖形的表示，並啟用重播基準測試、模擬器和模擬器。當包含此引數時，將針對與 PyTorch 分析器相同的時間範圍呼叫觀察器的 start() 和 stop()。
acc_events (bool) – 啟用跨多個分析週期累計 FunctionEvents

注意

此 API 為實驗性質，未來可能會變更。

啟用形狀和堆疊追蹤會產生額外的開銷。當指定 record_shapes=True 時，分析器將暫時保留對張量的引用；這可能會進一步阻止依賴引用計數的某些最佳化，並引入額外的張量複製。

add_metadata(key, value)[source][source]¶

將使用者定義的元數據與字串鍵和字串值新增到追蹤檔案中

add_metadata_json(key, value)[source][source]¶

將使用者定義的元數據與字串鍵和有效的 json 值新增到追蹤檔案中

events()[source][source]¶: 傳回未匯總的分析器事件清單，用於追蹤回呼或分析完成後

export_chrome_trace(path)[source][source]¶

以 Chrome JSON 格式匯出收集到的追蹤。如果啟用 kineto，則僅匯出排程中的最後一個週期。

export_memory_timeline(path, device=None)[原始碼][原始碼]¶

從分析器收集的樹中匯出指定裝置的記憶體事件資訊，並匯出時間軸圖表。使用 export_memory_timeline 可以匯出 3 種檔案，每種檔案由 path 的後綴控制。

對於 HTML 相容的圖表，請使用後綴 .html，記憶體時間軸圖表將作為 PNG 檔案嵌入到 HTML 檔案中。
對於由 [times, [sizes by category]] 組成的繪圖點，其中 times 是時間戳記，sizes 是每個類別的記憶體用量。記憶體時間軸圖表將根據後綴儲存為 JSON (.json) 或 gzipped JSON (.json.gz)。
對於原始記憶體點，請使用後綴 .raw.json.gz。每個原始記憶體事件將由 (timestamp, action, numbytes, category) 組成，其中 action 是 [PREEXISTING, CREATE, INCREMENT_VERSION, DESTROY] 之一，而 category 是 torch.profiler._memory_profiler.Category 中的列舉之一。

輸出：記憶體時間軸以 gzipped JSON、JSON 或 HTML 格式寫入。

export_stacks(path, metric='self_cpu_time_total')[原始碼][原始碼]¶

將堆疊追蹤儲存到檔案

參數

path (str) – 將堆疊檔案儲存到此位置；
metric (str) – 要使用的指標：「self_cpu_time_total」或「self_cuda_time_total」

key_averages(group_by_input_shape=False, group_by_stack_n=0)[原始碼][原始碼]¶

平均事件，依運算符名稱和（可選）輸入形狀和堆疊對它們進行分組。

注意

若要使用形狀/堆疊功能，請確保在建立分析器內容管理器時設定 record_shapes/with_stack。

preset_metadata_json(key, value)[原始碼][原始碼]¶

在分析器未啟動時預設使用者定義的元數據，稍後將其新增到追蹤檔案中。元數據的格式為字串鍵和有效的 json 值

toggle_collection_dynamic(enable, activities)[原始碼][原始碼]¶

在收集的任何時間點開啟/關閉活動的收集。目前支援在 Kineto 中切換 Torch Ops (CPU) 和 CUDA 活動

參數: activities (iterable) – 用於分析的活動群組清單，支援的值：torch.profiler.ProfilerActivity.CPU、torch.profiler.ProfilerActivity.CUDA

範例

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    code_to_profile_0()
    // turn off collection of all CUDA activity
    p.toggle_collection_dynamic(False, [torch.profiler.ProfilerActivity.CUDA])
    code_to_profile_1()
    // turn on collection of all CUDA activity
    p.toggle_collection_dynamic(True, [torch.profiler.ProfilerActivity.CUDA])
    code_to_profile_2()
print(p.key_averages().table(
    sort_by="self_cuda_time_total", row_limit=-1))

class torch.profiler.profile(*, activities=None, schedule=None, on_trace_ready=None, record_shapes=False, profile_memory=False, with_stack=False, with_flops=False, with_modules=False, experimental_config=None, execution_trace_observer=None, acc_events=False, use_cuda=None, custom_trace_id_callback=None)[source][source]¶

Profiler 的 context manager (上下文管理器)。

參數

activities (iterable) – 用於分析 (profiling) 的活動群組 (activity groups) 清單 (CPU、CUDA)，支援的值：torch.profiler.ProfilerActivity.CPU、torch.profiler.ProfilerActivity.CUDA、torch.profiler.ProfilerActivity.XPU。預設值：ProfilerActivity.CPU 和 (如果可用) ProfilerActivity.CUDA 或 (如果可用) ProfilerActivity.XPU。
schedule (Callable) – 可調用物件 (callable)，接受 step (int) 作為單一參數，並傳回 ProfilerAction 值，指定在每個 step 要執行的 profiler 動作。
on_trace_ready (Callable) – 可調用物件 (callable)，在每次 schedule 傳回 ProfilerAction.RECORD_AND_SAVE 時呼叫，在 profiling 期間。
record_shapes (bool) – 儲存關於運算元 (operator) 輸入形狀 (shape) 的資訊。
profile_memory (bool) – 追蹤 tensor 的記憶體配置/釋放。
with_stack (bool) – 記錄運算 (op) 的原始碼資訊 (檔案和行號)。
with_flops (bool) – 使用公式來估計特定運算子 (矩陣乘法和 2D 卷積) 的 FLOPs (浮點運算)。
with_modules (bool) – 記錄對應於運算 (op) 呼叫堆疊 (callstack) 的模組層級結構 (module hierarchy) (包括函式名稱)。例如，如果模組 A 的 forward 呼叫了模組 B 的 forward，而模組 B 的 forward 包含一個 aten::add 運算，則 aten::add 的模組層級結構為 A.B。請注意，目前此支援僅適用於 TorchScript 模型，而不適用於 Eager 模式模型。
experimental_config (_ExperimentalConfig) – 用於 Kineto 函式庫功能的一組實驗性選項。請注意，不保證向後相容性。
execution_trace_observer (ExecutionTraceObserver) – 一個 PyTorch Execution Trace Observer 物件。 PyTorch Execution Traces 提供基於圖形的 AI/ML 工作負載表示，並啟用重播基準測試、模擬器和仿真器。當包含此引數時，observer 的 start() 和 stop() 將在與 PyTorch profiler 相同的時間範圍內被呼叫。請參閱下面的範例部分以取得程式碼範例。
acc_events (bool) – 啟用跨多個分析週期累計 FunctionEvents
use_cuda (bool) –

Deprecated since version 1.8.1: 請改用 activities。

注意

使用 schedule() 來產生可調用 (callable) 的排程。當分析長時間的訓練工作時，非預設的排程很有用，並且允許使用者在訓練過程的不同迭代中獲得多個追蹤 (trace)。預設排程僅在 context manager 的持續時間內持續記錄所有事件。

注意

使用 tensorboard_trace_handler() 為 TensorBoard 產生結果檔案

on_trace_ready=torch.profiler.tensorboard_trace_handler(dir_name)

分析後，可以在指定的目錄中找到結果檔案。使用命令

tensorboard --logdir dir_name

以在 TensorBoard 中查看結果。更多資訊，請參閱 PyTorch Profiler TensorBoard Plugin

注意

啟用形狀和堆疊追蹤會產生額外的開銷。當指定 record_shapes=True 時，分析器將暫時保留對張量的引用；這可能會進一步阻止依賴引用計數的某些最佳化，並引入額外的張量複製。

範例

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    code_to_profile()
print(p.key_averages().table(
    sort_by="self_cuda_time_total", row_limit=-1))

使用 profiler 的 schedule, on_trace_ready 和 step 函式

# Non-default profiler schedule allows user to turn profiler on and off
# on different iterations of the training loop;
# trace_handler is called every time a new trace becomes available
def trace_handler(prof):
    print(prof.key_averages().table(
        sort_by="self_cuda_time_total", row_limit=-1))
    # prof.export_chrome_trace("/tmp/test_trace_" + str(prof.step_num) + ".json")

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],

    # In this example with wait=1, warmup=1, active=2, repeat=1,
    # profiler will skip the first step/iteration,
    # start warming up on the second, record
    # the third and the forth iterations,
    # after which the trace will become available
    # and on_trace_ready (when set) is called;
    # the cycle repeats starting with the next step

    schedule=torch.profiler.schedule(
        wait=1,
        warmup=1,
        active=2,
        repeat=1),
    on_trace_ready=trace_handler
    # on_trace_ready=torch.profiler.tensorboard_trace_handler('./log')
    # used when outputting for tensorboard
    ) as p:
        for iter in range(N):
            code_iteration_to_profile(iter)
            # send a signal to the profiler that the next iteration has started
            p.step()

以下範例展示了如何設定 Execution Trace Observer (execution_trace_observer)

with torch.profiler.profile(
    ...
    execution_trace_observer=(
        ExecutionTraceObserver().register_callback("./execution_trace.json")
    ),
) as p:
    for iter in range(N):
        code_iteration_to_profile(iter)
        p.step()

您也可以參考 tests/profiler/test_profiler.py 中的 test_execution_trace_with_kineto()。注意：也可以傳遞任何滿足 _ITraceObserver 介面的物件。

get_trace_id()[source][source]¶: 傳回目前的追蹤 ID。

set_custom_trace_id_callback(callback)[source][source]¶: 設定一個回呼函式 (callback)，當產生新的追蹤 ID 時呼叫。

step()[source][source]¶: 向 profiler 發出信號，表示下一個 profiling step 已經開始。

class torch.profiler.ProfilerAction(value)[source][source]¶: 可以在指定間隔採取的 Profiler 動作

class torch.profiler.ProfilerActivity¶

成員

CPU

XPU

MTIA

CUDA

PrivateUse1

屬性 name¶

torch.profiler.schedule(*, wait, warmup, active, repeat=0, skip_first=0, skip_first_wait=0)[原始碼][原始碼]¶

回傳一個 callable 物件，可以作為 profiler 的 schedule 參數使用。profiler 會跳過前 skip_first 步驟，然後等待 wait 步驟，接著執行暖機 (warmup) warmup 步驟，然後執行主動紀錄 (active recording) active 步驟，然後重複這個循環，從 wait 步驟開始。可以使用 repeat 參數指定循環次數，零值表示循環將持續到 profiling 完成為止。

skip_first_wait 參數控制是否應跳過第一個 wait 階段。如果使用者希望在週期之間等待比 skip_first 更長的時間，但不適用於第一個 profile，這會很有用。例如，如果 skip_first 為 10 且 wait 為 20，如果 skip_first_wait 為零，則第一個循環將在暖機 (warmup) 之前等待 10 + 20 = 30 步，但如果 skip_first_wait 不為零，則只會等待 10 步。之後所有循環將在最後一個主動紀錄 (active) 和暖機 (warmup) 之間等待 20 步。

回傳類型: Callable

torch.profiler.tensorboard_trace_handler(dir_name, worker_name=None, use_gzip=False)[原始碼][原始碼]¶

將追蹤檔案輸出到 dir_name 目錄，然後該目錄可以直接作為 logdir 傳遞給 tensorboard。worker_name 在分散式情境中對於每個 worker 應該是唯一的，預設情況下將設定為 '[hostname]_[pid]'。

Intel Instrumentation and Tracing Technology APIs¶

torch.profiler.itt.is_available()[原始碼][原始碼]¶: 檢查 ITT 功能是否可用

torch.profiler.itt.mark(msg)[原始碼][原始碼]¶

描述在某個時間點發生的瞬時事件。

參數: msg (str) – 要與事件關聯的 ASCII 訊息。

torch.profiler.itt.range_push(msg)[原始碼][原始碼]¶

將範圍推送到巢狀範圍跨度的堆疊上。回傳已啟動範圍的從零開始的深度。

參數: msg (str) – 要與範圍關聯的 ASCII 訊息

torch.profiler.itt.range_pop()[原始碼][原始碼]¶: 從巢狀範圍跨度的堆疊中彈出一個範圍。回傳已結束範圍的從零開始的深度。

torch.profiler¶

概述¶

API 參考¶

Intel Instrumentation and Tracing Technology APIs¶

文件

教學課程

資源