CUDA Stream Sanitizer¶

注意

這是一個原型功能，這意味著它處於早期階段，用於收集意見回饋和測試，其元件可能會變更。

概述¶

此模組引入 CUDA Sanitizer，這是一種用於偵測在不同 Stream 上執行的 Kernel 之間的同步錯誤的工具。

它儲存關於張量存取的資訊，以確定它們是否已同步。在 Python 程式中啟用後，如果偵測到可能的資料競爭，將會印出詳細的警告，並且程式將會結束。

可以透過匯入此模組並呼叫 enable_cuda_sanitizer() 或匯出 TORCH_CUDA_SANITIZER 環境變數來啟用它。

用法¶

以下是一個簡單的 PyTorch 同步錯誤範例

import torch

a = torch.rand(4, 2, device="cuda")

with torch.cuda.stream(torch.cuda.Stream()):
    torch.mul(a, 5, out=a)

a 張量在預設 Stream 上初始化，並且在沒有任何同步方法的情況下，在新的 Stream 上修改。這兩個 Kernel 將會在同一個張量上同時執行，這可能會導致第二個 Kernel 在第一個 Kernel 能夠寫入之前讀取未初始化的資料，或者第一個 Kernel 可能會覆寫第二個 Kernel 的部分結果。當這個腳本在命令列上使用以下指令執行時

TORCH_CUDA_SANITIZER=1 python example_error.py

CSAN 會印出以下輸出

============================
CSAN detected a possible data race on tensor with data pointer 139719969079296
Access by stream 94646435460352 during kernel:
aten::mul.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
writing to argument(s) self, out, and to the output
With stack trace:
  File "example_error.py", line 6, in <module>
    torch.mul(a, 5, out=a)
  ...
  File "pytorch/torch/cuda/_sanitizer.py", line 364, in _handle_kernel_launch
    stack_trace = traceback.StackSummary.extract(

Previous access by stream 0 during kernel:
aten::rand(int[] size, *, int? dtype=None, Device? device=None) -> Tensor
writing to the output
With stack trace:
  File "example_error.py", line 3, in <module>
    a = torch.rand(10000, device="cuda")
  ...
  File "pytorch/torch/cuda/_sanitizer.py", line 364, in _handle_kernel_launch
    stack_trace = traceback.StackSummary.extract(

Tensor was allocated with stack trace:
  File "example_error.py", line 3, in <module>
    a = torch.rand(10000, device="cuda")
  ...
  File "pytorch/torch/cuda/_sanitizer.py", line 420, in _handle_memory_allocation
    traceback.StackSummary.extract(

這提供了對錯誤來源的廣泛了解

從具有 ID 的 Stream (0 (預設 Stream) 和 94646435460352 (新 Stream)) 不正確地存取了張量
這個張量是透過呼叫 a = torch.rand(10000, device="cuda") 來配置的。
錯誤的存取是由下列運算子造成的：
- a = torch.rand(10000, device="cuda") 在 stream 0 上
- torch.mul(a, 5, out=a) 在 stream 94646435460352 上
錯誤訊息也會顯示被呼叫運算子的 schema，並附註指出運算子的哪些參數對應到受影響的張量。
- 在這個範例中，可以看到張量 a 對應到被呼叫的運算子 torch.mul 的參數 self、out 和 output 值。

另請參閱

支援的 torch 運算子及其 schema 的清單可以在這裡查看。

這個錯誤可以透過強制新的 stream 等待預設的 stream 來修正。

with torch.cuda.stream(torch.cuda.Stream()):
    torch.cuda.current_stream().wait_stream(torch.cuda.default_stream())
    torch.mul(a, 5, out=a)

當腳本再次執行時，就不會回報任何錯誤。

API 參考¶

torch.cuda._sanitizer.enable_cuda_sanitizer()[source][source]¶

啟用 CUDA Sanitizer。

Sanitizer 將開始分析由 torch 函式呼叫的低階 CUDA 呼叫，以查找同步錯誤。所有找到的資料競爭都會列印到標準錯誤輸出，並附上可疑原因的堆疊追蹤。為了獲得最佳效果，應該在程式的開頭就啟用 sanitizer。

CUDA Stream Sanitizer¶

概述¶

用法¶

API 參考¶

文件

教學課程

資源