CUDA 語義¶

torch.cuda 用於設定和執行 CUDA 運算。它會追蹤目前選定的 GPU，而您配置的所有 CUDA 張量預設都會在該裝置上建立。可以使用 torch.cuda.device 上下文管理器變更選定的裝置。

但是，一旦配置了張量，您就可以對其執行運算，而與選定的裝置無關，並且結果將始終放置在與張量相同的裝置上。

預設情況下不允許跨 GPU 運算，但 copy_() 和其他具有類似複製功能的方法（例如 to() 和 cuda()）除外。除非您啟用點對點記憶體存取，否則任何嘗試在分散在不同裝置上的張量上啟動運算的嘗試都會引發錯誤。

您可以在下面找到一個展示此功能的小範例

cuda = torch.device('cuda')     # Default CUDA device
cuda0 = torch.device('cuda:0')
cuda2 = torch.device('cuda:2')  # GPU 2 (these are 0-indexed)

x = torch.tensor([1., 2.], device=cuda0)
# x.device is device(type='cuda', index=0)
y = torch.tensor([1., 2.]).cuda()
# y.device is device(type='cuda', index=0)

with torch.cuda.device(1):
    # allocates a tensor on GPU 1
    a = torch.tensor([1., 2.], device=cuda)

    # transfers a tensor from CPU to GPU 1
    b = torch.tensor([1., 2.]).cuda()
    # a.device and b.device are device(type='cuda', index=1)

    # You can also use ``Tensor.to`` to transfer a tensor:
    b2 = torch.tensor([1., 2.]).to(device=cuda)
    # b.device and b2.device are device(type='cuda', index=1)

    c = a + b
    # c.device is device(type='cuda', index=1)

    z = x + y
    # z.device is device(type='cuda', index=0)

    # even within a context, you can specify the device
    # (or give a GPU index to the .cuda call)
    d = torch.randn(2, device=cuda2)
    e = torch.randn(2).to(cuda2)
    f = torch.randn(2).cuda(cuda2)
    # d.device, e.device, and f.device are all device(type='cuda', index=2)

Ampere (以及之後) 裝置上的 TensorFloat-32 (TF32)¶

從 PyTorch 1.7 開始，有一個名為 allow_tf32 的新標記。在 PyTorch 1.7 到 PyTorch 1.11 中，此標記預設為 True，而在 PyTorch 1.12 及之後的版本中，此標記預設為 False。此標記控制是否允許 PyTorch 在內部使用 TensorFloat32 (TF32) 張量核心，該核心自 Ampere 以來已在 NVIDIA GPU 上提供，以計算 matmul（矩陣乘法和批次矩陣乘法）和卷積。

TF32 張量核心旨在透過將輸入資料捨入到具有 10 位尾數，並以 FP32 精度累積結果，同時保持 FP32 動態範圍，從而在 torch.float32 張量上的 matmul 和卷積上實現更好的效能。

matmul 和卷積分別受到控制，它們對應的標記可以從以下位置存取：

# The flag below controls whether to allow TF32 on matmul. This flag defaults to False
# in PyTorch 1.12 and later.
torch.backends.cuda.matmul.allow_tf32 = True

# The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
torch.backends.cudnn.allow_tf32 = True

matmul 的精度也可以透過 set_float_32_matmul_precision() 更廣泛地設定（不限於 CUDA）。請注意，除了 matmul 和卷積本身之外，內部使用 matmul 或卷積的函式和 nn 模組也會受到影響。這些包括 nn.Linear、nn.Conv*、cdist、tensordot、仿射網格和網格採樣、自適應對數 softmax、GRU 和 LSTM。

要了解精度和速度，請參閱下面的範例程式碼和基準資料（在 A100 上）

a_full = torch.randn(10240, 10240, dtype=torch.double, device='cuda')
b_full = torch.randn(10240, 10240, dtype=torch.double, device='cuda')
ab_full = a_full @ b_full
mean = ab_full.abs().mean()  # 80.7277

a = a_full.float()
b = b_full.float()

# Do matmul at TF32 mode.
torch.backends.cuda.matmul.allow_tf32 = True
ab_tf32 = a @ b  # takes 0.016s on GA100
error = (ab_tf32 - ab_full).abs().max()  # 0.1747
relative_error = error / mean  # 0.0022

# Do matmul with TF32 disabled.
torch.backends.cuda.matmul.allow_tf32 = False
ab_fp32 = a @ b  # takes 0.11s on GA100
error = (ab_fp32 - ab_full).abs().max()  # 0.0031
relative_error = error / mean  # 0.000039

從上面的範例中，我們可以看到啟用 TF32 後，A100 上的速度快了大約 7 倍，並且與雙精度相比，相對誤差大約大了 2 個數量級。請注意，TF32 與單精度速度的確切比率取決於硬體世代，因為諸如記憶體頻寬與計算的比率以及 TF32 與 FP32 matmul 吞吐量的比率等屬性可能因世代或模型而異。如果需要完整的 FP32 精度，使用者可以透過以下方式停用 TF32：

torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False

要在 C++ 中關閉 TF32 標記，您可以執行以下操作：

at::globalContext().setAllowTF32CuBLAS(false);
at::globalContext().setAllowTF32CuDNN(false);

有關 TF32 的更多資訊，請參閱：

FP16 GEMM 中降低的精度歸約¶

fp16 GEMM 可能使用一些中間降低精度歸約來完成（例如，在 fp16 中而不是在 fp32 中）。這些有選擇性地降低精度歸約可以在某些工作負載（尤其是具有較大 k 維度的工作負載）和 GPU 架構上實現更高的效能，但會犧牲數值精度和潛在的溢位。

V100 上的一些範例基準資料

[--------------------------- bench_gemm_transformer --------------------------]
      [  m ,  k  ,  n  ]    |  allow_fp16_reduc=True  |  allow_fp16_reduc=False
1 threads: --------------------------------------------------------------------
      [4096, 4048, 4096]    |           1634.6        |           1639.8
      [4096, 4056, 4096]    |           1670.8        |           1661.9
      [4096, 4080, 4096]    |           1664.2        |           1658.3
      [4096, 4096, 4096]    |           1639.4        |           1651.0
      [4096, 4104, 4096]    |           1677.4        |           1674.9
      [4096, 4128, 4096]    |           1655.7        |           1646.0
      [4096, 4144, 4096]    |           1796.8        |           2519.6
      [4096, 5096, 4096]    |           2094.6        |           3190.0
      [4096, 5104, 4096]    |           2144.0        |           2663.5
      [4096, 5112, 4096]    |           2149.1        |           2766.9
      [4096, 5120, 4096]    |           2142.8        |           2631.0
      [4096, 9728, 4096]    |           3875.1        |           5779.8
      [4096, 16384, 4096]   |           6182.9        |           9656.5
(times in microseconds).

如果需要完整的精度歸約，使用者可以使用以下方式停用 fp16 GEMM 中降低的精度歸約：

torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False

要在 C++ 中切換降低的精度歸約標記，可以執行以下操作：

at::globalContext().setAllowFP16ReductionCuBLAS(false);

BF16 GEMM 中降低的精度歸約¶

BFloat16 GEMM 存在一個類似的標記（如上所述）。請注意，如果您的工作負載中觀察到數值不穩定，則預設情況下，此開關對於 BF16 設定為 True，您可能希望將其設定為 False。

如果不需要降低的精度歸約，使用者可以使用以下方式停用 bf16 GEMM 中降低的精度歸約：

torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction = False

要在 C++ 中切換降低的精度歸約標記，可以執行以下操作：

at::globalContext().setAllowBF16ReductionCuBLAS(true);

非同步執行¶

預設情況下，GPU 操作是非同步的。當您呼叫使用 GPU 的函式時，操作會排隊到特定裝置，但不一定會立即執行。這使我們可以並行執行更多計算，包括 CPU 或其他 GPU 上的操作。

通常，非同步計算的效果對呼叫者是不可見的，因為 (1) 每個裝置都按照它們排隊的順序執行操作，並且 (2) 當在 CPU 和 GPU 之間或在兩個 GPU 之間複製資料時，PyTorch 會自動執行必要的同步。因此，計算將像每個操作都同步執行一樣進行。

您可以透過設定環境變數 CUDA_LAUNCH_BLOCKING=1 來強制同步計算。當 GPU 上發生錯誤時，這可能會很有用。（使用非同步執行時，此類錯誤直到操作實際執行後才會報告，因此堆疊追蹤不會顯示請求的位置。）

非同步計算的一個結果是，沒有同步的時間測量是不準確的。為了獲得精確的測量，應該在測量之前呼叫 torch.cuda.synchronize()，或者使用 torch.cuda.Event 來記錄時間，如下所示：

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()

# Run some things here

end_event.record()
torch.cuda.synchronize()  # Wait for the events to be recorded!
elapsed_time_ms = start_event.elapsed_time(end_event)

作為例外，某些函式（例如 to() 和 copy_()）接受顯式的 non_blocking 參數，允許呼叫者在不需要時繞過同步。另一個例外是 CUDA streams，如下所述。

CUDA streams¶

CUDA stream 是屬於特定裝置的線性執行序列。通常，您不需要明確建立一個：預設情況下，每個裝置都使用其自己的「預設」stream。

每個 stream 內的操作都按照它們建立的順序序列化，但是來自不同 stream 的操作可以以任何相對順序並行執行，除非使用顯式的同步函式（例如 synchronize() 或 wait_stream()）。例如，以下程式碼不正確：

cuda = torch.device('cuda')
s = torch.cuda.Stream()  # Create a new stream.
A = torch.empty((100, 100), device=cuda).normal_(0.0, 1.0)
with torch.cuda.stream(s):
    # sum() may start execution before normal_() finishes!
    B = torch.sum(A)

當「目前 stream」是預設 stream 時，PyTorch 會在資料移動時自動執行必要的同步，如上所述。但是，當使用非預設 stream 時，使用者有責任確保正確的同步。此範例的修正版本是：

cuda = torch.device('cuda')
s = torch.cuda.Stream()  # Create a new stream.
A = torch.empty((100, 100), device=cuda).normal_(0.0, 1.0)
s.wait_stream(torch.cuda.default_stream(cuda))  # NEW!
with torch.cuda.stream(s):
    B = torch.sum(A)
A.record_stream(s)  # NEW!

有兩個新增項目。torch.cuda.Stream.wait_stream() 呼叫確保在我們開始在側 stream 上執行 sum(A) 之前，normal_() 執行已完成。torch.Tensor.record_stream()（有關更多詳細資訊，請參閱）確保我們在 sum(A) 完成之前不會釋放 A。您也可以在以後的某個時間點使用 torch.cuda.default_stream(cuda).wait_stream(s) 手動等待 stream （請注意，立即等待是沒有意義的，因為這會阻止 stream 執行與預設 stream 上的其他工作並行運行。）有關何時使用其中一個或另一個的更多詳細資訊，請參閱 torch.Tensor.record_stream() 的文件。

請注意，即使沒有讀取依賴關係，這種同步也是必要的，例如，如下面的範例所示：

cuda = torch.device('cuda')
s = torch.cuda.Stream()  # Create a new stream.
A = torch.empty((100, 100), device=cuda)
s.wait_stream(torch.cuda.default_stream(cuda))  # STILL REQUIRED!
with torch.cuda.stream(s):
    A.normal_(0.0, 1.0)
    A.record_stream(s)

儘管對 s 的計算沒有讀取 A 的內容，且 A 沒有其他用途，但仍然需要同步，因為 A 可能對應於 CUDA 緩存分配器重新分配的記憶體，其中包含來自舊（已釋放）記憶體的未完成操作。

反向傳播的流語義¶

每個反向 CUDA 操作都在對應的前向操作所使用的同一個流上執行。如果您的前向傳播在不同的流上並行執行獨立的操作，這有助於反向傳播利用相同的並行性。

反向呼叫相對於周圍操作的流語義與任何其他呼叫相同。即使反向操作如上一段所述在多個流上執行，反向傳播也會插入內部同步以確保這一點。更具體地說，當呼叫 autograd.backward、autograd.grad 或 tensor.backward，並且選擇性地提供 CUDA 張量作為初始梯度（例如，autograd.backward(..., grad_tensors=initial_grads)、autograd.grad(..., grad_outputs=initial_grads) 或 tensor.backward(..., gradient=initial_grad)），以下行為

選擇性地填充初始梯度，
調用反向傳播，以及
使用梯度

具有與任何操作組相同的流語義關係。

s = torch.cuda.Stream()

# Safe, grads are used in the same stream context as backward()
with torch.cuda.stream(s):
    loss.backward()
    use grads

# Unsafe
with torch.cuda.stream(s):
    loss.backward()
use grads

# Safe, with synchronization
with torch.cuda.stream(s):
    loss.backward()
torch.cuda.current_stream().wait_stream(s)
use grads

# Safe, populating initial grad and invoking backward are in the same stream context
with torch.cuda.stream(s):
    loss.backward(gradient=torch.ones_like(loss))

# Unsafe, populating initial_grad and invoking backward are in different stream contexts,
# without synchronization
initial_grad = torch.ones_like(loss)
with torch.cuda.stream(s):
    loss.backward(gradient=initial_grad)

# Safe, with synchronization
initial_grad = torch.ones_like(loss)
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    initial_grad.record_stream(s)
    loss.backward(gradient=initial_grad)

BC 注意事項：在預設流上使用梯度¶

在早期版本的 PyTorch（1.9 及更早版本）中，autograd 引擎始終將預設流與所有反向操作同步，因此以下模式

with torch.cuda.stream(s):
    loss.backward()
use grads

只要 use grads 發生在預設流上，就是安全的。在目前的 PyTorch 中，該模式不再安全。如果 backward() 和 use grads 位於不同的流上下文中，則必須同步這些流，

with torch.cuda.stream(s):
    loss.backward()
torch.cuda.current_stream().wait_stream(s)
use grads

即使 use grads 位於預設流上。

記憶體管理¶

PyTorch 使用緩存記憶體分配器來加速記憶體分配。這允許快速記憶體釋放，而無需設備同步。但是，分配器管理的未使用記憶體仍然會顯示為 nvidia-smi 中已使用。您可以使用 memory_allocated() 和 max_memory_allocated() 來監控張量佔用的記憶體，並使用 memory_reserved() 和 max_memory_reserved() 來監控緩存分配器管理的總記憶體量。呼叫 empty_cache() 會從 PyTorch 釋放所有未使用的緩存記憶體，以便其他 GPU 應用程式可以使用。但是，張量佔用的 GPU 記憶體不會被釋放，因此它無法增加 PyTorch 可用的 GPU 記憶體量。

為了更好地了解 CUDA 記憶體如何隨時間推移被使用，了解 CUDA 記憶體使用情況描述了用於捕獲和視覺化記憶體使用追蹤的工具。

對於更高級的用戶，我們通過 memory_stats() 提供更全面的記憶體基準測試。我們還提供通過 memory_snapshot() 捕獲記憶體分配器狀態的完整快照的功能，這可以幫助您了解代碼產生的底層分配模式。

使用 `PYTORCH_CUDA_ALLOC_CONF` 優化記憶體使用量¶

使用緩存分配器可能會干擾記憶體檢查工具，例如 cuda-memcheck。要使用 cuda-memcheck 調試記憶體錯誤，請在您的環境中設置 PYTORCH_NO_CUDA_MEMORY_CACHING=1 以禁用緩存。

緩存分配器的行為可以通過環境變量 PYTORCH_CUDA_ALLOC_CONF 控制。格式為 PYTORCH_CUDA_ALLOC_CONF=<option>:<value>,<option2>:<value2>... 可用選項

backend 允許選擇底層分配器實現。目前，有效的選項有 native，它使用 PyTorch 的原生實現，以及 cudaMallocAsync，它使用 CUDA 的內建非同步分配器。cudaMallocAsync 需要 CUDA 11.4 或更高版本。預設值為 native。backend 適用於進程使用的所有設備，並且不能在每個設備的基礎上指定。
max_split_size_mb 可以防止原生分配器分割大於此大小（以 MB 為單位）的區塊。這可以減少記憶體碎片，並可能允許一些臨界工作負載在不耗盡記憶體的情況下完成。效能成本可能從「零」到「相當大」，具體取決於分配模式。預設值為無限制，即所有區塊都可以分割。 memory_stats() 和 memory_summary() 方法對於調整很有用。只有在工作負載因「記憶體不足」而中止，並且顯示大量非作用中的分割區塊時，才應將此選項作為最後手段使用。 max_split_size_mb 僅在 backend:native 的情況下有意義。使用 backend:cudaMallocAsync 時，將忽略 max_split_size_mb。
roundup_power2_divisions 有助於將請求的分配大小捨入到最接近的 2 的冪次方分割，並更好地利用區塊。在原生 CUDACachingAllocator 中，大小會以 512 的區塊大小為倍數向上捨入，因此這對於較小的大小效果很好。但是，對於大型的鄰近分配，這可能效率低下，因為每個分配都會進入不同大小的區塊，並且區塊的重複使用會最小化。這可能會產生大量未使用的區塊，並浪費 GPU 記憶體容量。此選項啟用將分配大小捨入到最接近的 2 的冪次方分割。例如，如果我們需要捨入大小為 1200，並且分割數為 4，則大小 1200 位於 1024 和 2048 之間，如果我們在它們之間進行 4 次分割，則值為 1024、1280、1536 和 1792。因此，分配大小 1200 將捨入到 1280 作為最接近的 2 的冪次方分割的上限。指定單個值以應用於所有分配大小，或指定鍵值對陣列以為每個 2 的冪次方間隔單獨設定 2 的冪次方分割。例如，要為所有小於 256MB 的分配設定 1 個分割，為 256MB 和 512MB 之間的分配設定 2 個分割，為 512MB 和 1GB 之間的分配設定 4 個分割，並為任何更大的分配設定 8 個分割，請將旋鈕值設定為：[256:1,512:2,1024:4,>:8]。 roundup_power2_divisions 僅在 backend:native 的情況下有意義。使用 backend:cudaMallocAsync 時，將忽略 roundup_power2_divisions。
max_non_split_rounding_mb 將允許非分割區塊以獲得更好的重複使用，例如，
一個 1024MB 的快取區塊可以重複用於 512MB 的分配請求。在預設情況下，我們僅允許最多 20MB 的非分割區塊捨入，因此一個 512MB 的區塊只能用於 512-532 MB 大小的區塊。如果我們將此選項的值設定為 1024，它將允許 512-1536 MB 大小的區塊用於一個 512MB 的區塊，這會增加較大區塊的重複使用。這也有助於減少避免昂貴的 cudaMalloc 呼叫的停頓。
garbage_collection_threshold 有助於主動回收未使用的 GPU 記憶體，以避免觸發昂貴的同步和回收所有操作 (release_cached_blocks)，這可能不利於對延遲敏感的 GPU 應用程式 (例如，伺服器)。設定此閾值（例如，0.8）後，如果 GPU 記憶體容量使用量超過閾值（即，分配給 GPU 應用程式的總記憶體的 80%），則分配器將開始回收 GPU 記憶體區塊。該演算法優先釋放舊的和未使用的區塊，以避免釋放正在積極重複使用的區塊。閾值應大於 0.0 且小於 1.0。 garbage_collection_threshold 僅在 backend:native 的情況下有意義。使用 backend:cudaMallocAsync 時，將忽略 garbage_collection_threshold。
expandable_segments (實驗性，預設值：False) 如果設定為 True，此設定會指示分配器建立 CUDA 分配，這些分配稍後可以擴展，以更好地處理作業頻繁更改分配大小的情況，例如具有變更的批次大小。通常，對於大型（>2MB）分配，分配器會呼叫 cudaMalloc 來獲取與使用者請求的大小相同的分配。將來，如果這些分配是空閒的，則可以將它們的某些部分重複用於其他請求。當程式多次請求完全相同的大小或大小是該大小的偶數倍時，此方法效果很好。許多深度學習模型都遵循此行為。但是，一個常見的例外是當批次大小從一次迭代到下一次迭代略有變化時，例如在批次推論中。當程式最初以批次大小 N 執行時，它將進行適合該大小的分配。如果將來它以大小 N - 1 執行，則現有的分配仍然足夠大。但是，如果它以大小 N + 1 執行，則它必須進行略大的新分配。並非所有張量的大小都相同。有些可能是 (N + 1)*A，另一些可能是 (N + 1)*A*B，其中 A 和 B 是模型中的某些非批次維度。由於分配器在現有分配足夠大時會重複使用它們，因此某些 (N + 1)*A 分配實際上會適合已經存在的 N*B*A 段，儘管並不完美。隨著模型的運行，它將部分填滿所有這些段，從而在這些段的末尾留下無法使用的空閒記憶體切片。分配器在某個時候將需要 cudaMalloc 一個新的 (N + 1)*A*B 段。如果沒有足夠的記憶體，現在就無法恢復現有段末尾的空閒記憶體切片。對於 50 層以上的模型，這種模式可能會重複 50 多次，產生許多碎片。

expandable_segments 允許分配器最初建立一個段，然後在需要更多記憶體時稍後擴展其大小。它不是為每個分配建立一個段，而是嘗試建立一個（每個流）隨著需要而增長的段。現在，當 N + 1 案例運行時，分配將很好地平鋪到一個大段中，直到填滿為止。然後請求更多記憶體並將其附加到段的末尾。此過程不會產生許多無法使用的記憶體切片，因此更有可能成功找到此記憶體。

pinned_use_cuda_host_register 選項是一個布林標誌，用於確定是否使用 CUDA API 的 cudaHostRegister 函數來分配鎖定記憶體，而不是預設的 cudaHostAlloc。設定為 True 時，記憶體使用常規 malloc 分配，然後在呼叫 cudaHostRegister 之前將頁面映射到記憶體。這種頁面的預映射有助於減少 cudaHostRegister 執行期間的鎖定時間。

pinned_num_register_threads 選項僅在 pinned_use_cuda_host_register 設定為 True 時有效。預設情況下，使用一個線程來映射頁面。此選項允許使用更多線程來並行處理頁面映射操作，以減少鎖定記憶體的總分配時間。根據基準測試結果，此選項的理想值為 8。

pinned_use_background_threads 選項是一個布林標誌，用於啟用背景線程來處理事件。這避免了快速分配路徑中與查詢/處理事件相關的任何慢速路徑。預設情況下禁用此功能。

注意

由 CUDA 記憶體管理 API 報告的某些統計資訊特定於 backend:native，並且對於 backend:cudaMallocAsync 沒有意義。有關詳細資訊，請參閱每個函數的說明字串。

使用 CUDA 的自定義記憶體分配器¶

可以將分配器定義為 C/C++ 中的簡單函數，並將它們編譯為共享庫。下面的程式碼展示了一個基本的分配器，它只追蹤所有的記憶體操作。

#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <iostream>
// Compile with g++ alloc.cc -o alloc.so -I/usr/local/cuda/include -shared -fPIC
extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
   void *ptr;
   cudaMalloc(&ptr, size);
   std::cout<<"alloc "<<ptr<<size<<std::endl;
   return ptr;
}

void my_free(void* ptr, ssize_t size, int device, cudaStream_t stream) {
   std::cout<<"free "<<ptr<< " "<<stream<<std::endl;
   cudaFree(ptr);
}
}

這可以通過 torch.cuda.memory.CUDAPluggableAllocator 在 Python 中使用。使用者需要提供 .so 檔案的路徑以及符合上述簽名的 alloc/free 函數的名稱。

import torch

# Load the allocator
new_alloc = torch.cuda.memory.CUDAPluggableAllocator(
    'alloc.so', 'my_malloc', 'my_free')
# Swap the current allocator
torch.cuda.memory.change_current_allocator(new_alloc)
# This will allocate memory in the device using the new allocator
b = torch.zeros(10, device='cuda')

import torch

# Do an initial memory allocator
b = torch.zeros(10, device='cuda')
# Load the allocator
new_alloc = torch.cuda.memory.CUDAPluggableAllocator(
    'alloc.so', 'my_malloc', 'my_free')
# This will error since the current allocator was already instantiated
torch.cuda.memory.change_current_allocator(new_alloc)

cuBLAS 工作空間¶

對於每個 cuBLAS 控制代碼和 CUDA 串流的組合，如果該控制代碼和串流組合執行需要工作空間的 cuBLAS 核心，則將分配一個 cuBLAS 工作空間。為了避免重複分配工作空間，這些工作空間不會被釋放，除非調用 torch._C._cuda_clearCublasWorkspaces()。每次分配的工作空間大小可以通過環境變數 CUBLAS_WORKSPACE_CONFIG 指定，格式為 :[SIZE]:[COUNT]。例如，預設的每次分配的工作空間大小為 CUBLAS_WORKSPACE_CONFIG=:4096:2:16:8，它指定了總大小為 2 * 4096 + 8 * 16 KiB。要強制 cuBLAS 避免使用工作空間，請設置 CUBLAS_WORKSPACE_CONFIG=:0:0。

cuFFT 計劃快取¶

對於每個 CUDA 設備，使用 cuFFT 計劃的 LRU 快取來加速在具有相同幾何形狀和相同配置的 CUDA 張量上重複運行 FFT 方法（例如，torch.fft.fft()）。由於某些 cuFFT 計劃可能會分配 GPU 記憶體，因此這些快取具有最大容量。

您可以使用以下 API 控制和查詢當前設備的快取屬性

torch.backends.cuda.cufft_plan_cache.max_size 給出快取的容量（在 CUDA 10 及更高版本上預設為 4096，在較舊的 CUDA 版本上預設為 1023）。直接設定此值會修改容量。
torch.backends.cuda.cufft_plan_cache.size 給出目前駐留在快取中的計劃數量。
torch.backends.cuda.cufft_plan_cache.clear() 清除快取。

要控制和查詢非預設設備的計劃快取，您可以使用 torch.device 物件或設備索引來索引 torch.backends.cuda.cufft_plan_cache 物件，並存取上述屬性之一。例如，要設定設備 1 的快取容量，可以寫入 torch.backends.cuda.cufft_plan_cache[1].max_size = 10。

即時編譯¶

PyTorch 會在 CUDA 張量上執行某些操作（例如 torch.special.zeta）時進行即時編譯。此編譯可能非常耗時（根據您的硬體和軟體，可能需要幾秒鐘），並且對於單個運算符可能會發生多次，因為許多 PyTorch 運算符實際上是從各種核心中進行選擇，每個核心都必須根據其輸入進行一次編譯。此編譯每個進程發生一次，或者如果使用核心快取，則僅發生一次。

預設情況下，如果定義了 XDG_CACHE_HOME，PyTorch 會在 $XDG_CACHE_HOME/torch/kernels 中創建核心快取，如果未定義，則在 $HOME/.cache/torch/kernels 中創建（在 Windows 上除外，Windows 尚不支援核心快取）。可以使用兩個環境變數直接控制快取行為。如果將 USE_PYTORCH_KERNEL_CACHE 設定為 0，則不使用任何快取，如果設定了 PYTORCH_KERNEL_CACHE_PATH，則該路徑將用作核心快取，而不是預設位置。

最佳實踐¶

設備無關的程式碼¶

由於 PyTorch 的結構，您可能需要顯式編寫設備無關（CPU 或 GPU）的程式碼；一個例子可能是創建一個新的張量作為遞歸神經網路的初始隱藏狀態。

第一步是確定是否應該使用 GPU。一種常見的模式是使用 Python 的 argparse 模組來讀取使用者參數，並使用一個標誌來禁用 CUDA，並結合 is_available()。在以下示例中，args.device 會產生一個 torch.device 物件，可用於將張量移動到 CPU 或 CUDA。

import argparse
import torch

parser = argparse.ArgumentParser(description='PyTorch Example')
parser.add_argument('--disable-cuda', action='store_true',
                    help='Disable CUDA')
args = parser.parse_args()
args.device = None
if not args.disable_cuda and torch.cuda.is_available():
    args.device = torch.device('cuda')
else:
    args.device = torch.device('cpu')

注意

在評估給定環境中 CUDA 的可用性（is_available()）時，PyTorch 的預設行為是調用 CUDA Runtime API 方法 cudaGetDeviceCount。由於此調用進而初始化 CUDA Driver API（透過 cuInit），如果尚未初始化，則在執行 is_available() 的進程之後進行的後續分叉將會失敗，並顯示 CUDA 初始化錯誤。

可以在導入執行 is_available() 的 PyTorch 模組之前（或直接執行它之前），在您的環境中設定 PYTORCH_NVML_BASED_CUDA_CHECK=1，以便指示 is_available() 嘗試基於 NVML 的評估 (nvmlDeviceGetCount_v2)。如果基於 NVML 的評估成功（即 NVML 發現/初始化沒有失敗），is_available() 調用將不會影響後續的分叉。

如果 NVML 發現/初始化失敗，is_available() 將會回退到標準的 CUDA Runtime API 評估，並且上述的分叉約束將適用。

請注意，上述基於 NVML 的 CUDA 可用性評估提供的保證，比預設的 CUDA Runtime API 方法（需要 CUDA 初始化成功）弱。在某些情況下，基於 NVML 的檢查可能會成功，但之後的 CUDA 初始化可能會失敗。

現在我們有了 args.device，我們可以使用它在所需的裝置上建立 Tensor。

x = torch.empty((8, 42), device=args.device)
net = Network().to(device=args.device)

這可用於許多情況以產生與裝置無關的程式碼。以下是使用資料載入器時的範例：

cuda0 = torch.device('cuda:0')  # CUDA GPU 0
for i, x in enumerate(train_loader):
    x = x.to(cuda0)

當在系統上使用多個 GPU 時，您可以使用 CUDA_VISIBLE_DEVICES 環境變數來管理哪些 GPU 可供 PyTorch 使用。如上所述，要手動控制在哪個 GPU 上建立張量，最佳做法是使用 torch.cuda.device context manager。

print("Outside device is 0")  # On device 0 (default in most scenarios)
with torch.cuda.device(1):
    print("Inside device is 1")  # On device 1
print("Outside device is still 0")  # On device 0

如果您有一個張量，並且想要在相同的裝置上建立一個相同類型的新張量，則可以使用 torch.Tensor.new_* 方法（請參閱 torch.Tensor）。雖然前面提到的 torch.* 工廠函數 (Creation Ops) 取決於目前的 GPU context 和您傳入的屬性參數，但 torch.Tensor.new_* 方法會保留張量的裝置和其他屬性。

這是建議的做法，適用於在正向傳遞期間需要在內部建立新張量的模組。

cuda = torch.device('cuda')
x_cpu = torch.empty(2)
x_gpu = torch.empty(2, device=cuda)
x_cpu_long = torch.empty(2, dtype=torch.int64)

y_cpu = x_cpu.new_full([3, 2], fill_value=0.3)
print(y_cpu)

    tensor([[ 0.3000,  0.3000],
            [ 0.3000,  0.3000],
            [ 0.3000,  0.3000]])

y_gpu = x_gpu.new_full([3, 2], fill_value=-5)
print(y_gpu)

    tensor([[-5.0000, -5.0000],
            [-5.0000, -5.0000],
            [-5.0000, -5.0000]], device='cuda:0')

y_cpu_long = x_cpu_long.new_tensor([[1, 2, 3]])
print(y_cpu_long)

    tensor([[ 1,  2,  3]])

如果您想要建立與另一個張量相同類型和大小的張量，並用 1 或 0 填充它，則提供 ones_like() 或 zeros_like() 作為方便的輔助函數（它們也會保留 Tensor 的 torch.device 和 torch.dtype）。

x_cpu = torch.empty(2, 3)
x_gpu = torch.empty(2, 3)

y_cpu = torch.ones_like(x_cpu)
y_gpu = torch.zeros_like(x_gpu)

使用釘選記憶體緩衝區¶

警告

這是一個進階技巧。如果您過度使用釘選記憶體，在 RAM 不足時可能會導致嚴重的問題，並且您應該知道釘選通常是一個昂貴的操作。

當從釘選（頁鎖定）記憶體發起時，主機到 GPU 的複製速度會快得多。 CPU 張量和儲存體公開了一個 pin_memory() 方法，該方法會傳回物件的副本，並將資料放置在釘選區域中。

此外，一旦您釘選了一個張量或儲存體，您就可以使用非同步 GPU 複製。只需將額外的 non_blocking=True 參數傳遞給 to() 或 cuda() 呼叫。這可用於使資料傳輸與計算重疊。

您可以透過將 pin_memory=True 傳遞給其建構函式，使 DataLoader 傳回放置在釘選記憶體中的批次。

使用 nn.parallel.DistributedDataParallel，而不是 multiprocessing 或 nn.DataParallel¶

大多數涉及批次輸入和多個 GPU 的用例，都應該預設使用 DistributedDataParallel 來利用多個 GPU。

在使用 CUDA 模型與 multiprocessing 方面，存在顯著的注意事項；除非謹慎滿足資料處理要求，否則您的程式很可能會出現不正確或未定義的行為。

建議使用 DistributedDataParallel，而不是 DataParallel 來進行多 GPU 訓練，即使只有單一節點。

DistributedDataParallel 和 DataParallel 之間的區別是：DistributedDataParallel 使用 multiprocessing，為每個 GPU 建立一個進程，而 DataParallel 使用 multithreading。透過使用 multiprocessing，每個 GPU 都有其專用的進程，這避免了 Python 直譯器的 GIL 造成的效能開銷。

如果您使用 DistributedDataParallel，您可以使用 torch.distributed.launch 公用程式來啟動您的程式，請參閱 Third-party backends。

CUDA Graphs¶

CUDA 圖形是 CUDA stream 及其相關 stream 執行的工作（主要為 kernels 及其參數）的記錄。有關底層 CUDA API 的一般原則和詳細資訊，請參閱 Getting Started with CUDA Graphs 和 CUDA C Programming Guide 的 Graphs section。

PyTorch 支援使用 stream capture 構建 CUDA 圖形，該功能將 CUDA stream 置於capture mode。發送到捕獲 stream 的 CUDA 工作實際上並不會在 GPU 上執行。相反，這些工作會記錄在圖形中。

捕獲之後，可以啟動該圖形以根據需要多次執行 GPU 工作。每次重播都以相同的參數運行相同的內核。對於指針參數，這意味著使用相同的記憶體位址。透過在每次重播之前用新數據（例如，來自新批次）填充輸入記憶體，您可以在新數據上重新運行相同的工作。

為什麼需要 CUDA Graphs？¶

重播圖形會犧牲典型 eager execution 的動態彈性，但換來大幅降低的 CPU 負擔。由於圖形的引數和核心是固定的，因此圖形重播會跳過所有引數設定和核心調度的層級，包括 Python、C++ 和 CUDA 驅動程式的負擔。在底層，重播會透過單次呼叫 cudaGraphLaunch 將整個圖形的工作提交給 GPU。重播中的核心在 GPU 上執行速度也會稍微快一些，但主要優勢是省略了 CPU 負擔。

如果您的網路全部或部分是圖形安全 (graph-safe) 的 (通常這表示靜態形狀和靜態控制流程，但請參閱其他限制)，並且您懷疑其運行時至少在某種程度上受到 CPU 限制，則應嘗試 CUDA 圖形。

PyTorch API¶

警告

此 API 處於 Beta 階段，未來版本可能會變更。

PyTorch 透過原始的 torch.cuda.CUDAGraph 類別和兩個方便的包裝器 torch.cuda.graph 和 torch.cuda.make_graphed_callables 公開圖形。

torch.cuda.graph 是一個簡單且通用的上下文管理器，用於捕獲其上下文中的 CUDA 工作。在捕獲之前，先預熱要捕獲的工作負載，方法是運行幾個 eager 疊代。預熱必須在側邊串流 (side stream) 上進行。由於圖形在每次重播時都會從相同的記憶體位址讀取和寫入，因此您必須在捕獲期間保持對保存輸入和輸出資料的張量的長期引用。要在新輸入資料上運行圖形，請將新資料複製到捕獲的輸入張量中，重播圖形，然後從捕獲的輸出張量中讀取新的輸出。範例

g = torch.cuda.CUDAGraph()

# Placeholder input used for capture
static_input = torch.empty((5,), device="cuda")

# Warmup before capture
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for _ in range(3):
        static_output = static_input * 2
torch.cuda.current_stream().wait_stream(s)

# Captures the graph
# To allow capture, automatically sets a side stream as the current stream in the context
with torch.cuda.graph(g):
    static_output = static_input * 2

# Fills the graph's input memory with new data to compute on
static_input.copy_(torch.full((5,), 3, device="cuda"))
g.replay()
# static_output holds the results
print(static_output)  # full of 3 * 2 = 6

# Fills the graph's input memory with more data to compute on
static_input.copy_(torch.full((5,), 4, device="cuda"))
g.replay()
print(static_output)  # full of 4 * 2 = 8

請參閱完整網路捕獲、與 torch.cuda.amp 一起使用和與多個串流一起使用以取得實際且進階的模式。

make_graphed_callables 更為複雜。make_graphed_callables 接受 Python 函數和 torch.nn.Modules。對於每個傳遞的函數或 Module，它會建立 forward-pass 和 backward-pass 工作的單獨圖形。請參閱部分網路捕獲。

限制¶

如果一組操作不違反以下任何限制，則該組操作是可捕獲的。

限制適用於 torch.cuda.graph 上下文中的所有工作，以及您傳遞給 torch.cuda.make_graphed_callables() 的任何 callable 的 forward 和 backward pass 中的所有工作。

違反這些限制中的任何一項都可能導致運行時錯誤

捕獲必須在非預設串流上進行。(如果您使用原始的 CUDAGraph.capture_begin 和 CUDAGraph.capture_end 呼叫，這才需要擔心。graph 和 make_graphed_callables() 會為您設定側邊串流。)
禁止 CPU 與 GPU 同步的操作 (例如，.item() 呼叫)。
允許 CUDA RNG 操作，並且在使用圖形中的多個 torch.Generator 實例時，必須在圖形捕獲之前使用 CUDAGraph.register_generator_state 註冊它們。避免在捕獲期間使用 Generator.get_state 和 Generator.set_state；相反地，利用 Generator.graphsafe_set_state 和 Generator.graphsafe_get_state 以在圖形上下文中安全地管理產生器狀態。這確保了 CUDA 圖形中正確的 RNG 操作和產生器管理。

違反這些限制中的任何一項都可能導致無聲的數值錯誤或未定義的行為

在一個進程中，一次只能進行一個捕獲。
在捕獲進行時，不得在此進程 (任何執行緒上) 中運行任何未捕獲的 CUDA 工作。
CPU 工作不會被捕獲。如果捕獲的操作包含 CPU 工作，則該工作將在重播期間被省略。
每次重播都會從相同的 (虛擬) 記憶體位址讀取和寫入。
禁止動態控制流程 (基於 CPU 或 GPU 資料)。
禁止動態形狀。圖形假定捕獲的操作序列中的每個張量在每次重播中都具有相同的大小和佈局。
允許在捕獲中使用多個串流，但存在限制。

非限制¶

捕獲後，可以在任何串流上重播圖形。

全網路捕捉¶

如果你的整個網路都可以被捕捉，你可以捕捉並重播整個迭代。

N, D_in, H, D_out = 640, 4096, 2048, 1024
model = torch.nn.Sequential(torch.nn.Linear(D_in, H),
                            torch.nn.Dropout(p=0.2),
                            torch.nn.Linear(H, D_out),
                            torch.nn.Dropout(p=0.1)).cuda()
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Placeholders used for capture
static_input = torch.randn(N, D_in, device='cuda')
static_target = torch.randn(N, D_out, device='cuda')

# warmup
# Uses static_input and static_target here for convenience,
# but in a real setting, because the warmup includes optimizer.step()
# you must use a few batches of real data.
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for i in range(3):
        optimizer.zero_grad(set_to_none=True)
        y_pred = model(static_input)
        loss = loss_fn(y_pred, static_target)
        loss.backward()
        optimizer.step()
torch.cuda.current_stream().wait_stream(s)

# capture
g = torch.cuda.CUDAGraph()
# Sets grads to None before capture, so backward() will create
# .grad attributes with allocations from the graph's private pool
optimizer.zero_grad(set_to_none=True)
with torch.cuda.graph(g):
    static_y_pred = model(static_input)
    static_loss = loss_fn(static_y_pred, static_target)
    static_loss.backward()
    optimizer.step()

real_inputs = [torch.rand_like(static_input) for _ in range(10)]
real_targets = [torch.rand_like(static_target) for _ in range(10)]

for data, target in zip(real_inputs, real_targets):
    # Fills the graph's input memory with new data to compute on
    static_input.copy_(data)
    static_target.copy_(target)
    # replay() includes forward, backward, and step.
    # You don't even need to call optimizer.zero_grad() between iterations
    # because the captured backward refills static .grad tensors in place.
    g.replay()
    # Params have been updated. static_y_pred, static_loss, and .grad
    # attributes hold values from computing on this iteration's data.

部分網路捕捉¶

如果你的網路中有些部分不適合捕捉（例如，由於動態控制流、動態形狀、CPU 同步或重要的 CPU 端邏輯），你可以 eager 模式運行不安全的部分，並使用 torch.cuda.make_graphed_callables() 來僅對可安全捕捉的部分建立圖。

預設情況下，由 make_graphed_callables() 返回的可呼叫對象具有 autograd 感知能力，並且可以在訓練迴圈中直接替換你傳遞的函數或 nn.Module。

make_graphed_callables() 在內部建立 CUDAGraph 對象，運行預熱迭代，並根據需要維護靜態輸入和輸出。因此（與 torch.cuda.graph 不同），你不需要手動處理這些。

在下面的範例中，資料相關的動態控制流意味著網路無法進行端到端的捕捉，但是 make_graphed_callables() 允許我們捕捉並運行圖形安全的部分，而不受影響。

N, D_in, H, D_out = 640, 4096, 2048, 1024

module1 = torch.nn.Linear(D_in, H).cuda()
module2 = torch.nn.Linear(H, D_out).cuda()
module3 = torch.nn.Linear(H, D_out).cuda()

loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(chain(module1.parameters(),
                                  module2.parameters(),
                                  module3.parameters()),
                            lr=0.1)

# Sample inputs used for capture
# requires_grad state of sample inputs must match
# requires_grad state of real inputs each callable will see.
x = torch.randn(N, D_in, device='cuda')
h = torch.randn(N, H, device='cuda', requires_grad=True)

module1 = torch.cuda.make_graphed_callables(module1, (x,))
module2 = torch.cuda.make_graphed_callables(module2, (h,))
module3 = torch.cuda.make_graphed_callables(module3, (h,))

real_inputs = [torch.rand_like(x) for _ in range(10)]
real_targets = [torch.randn(N, D_out, device="cuda") for _ in range(10)]

for data, target in zip(real_inputs, real_targets):
    optimizer.zero_grad(set_to_none=True)

    tmp = module1(data)  # forward ops run as a graph

    if tmp.sum().item() > 0:
        tmp = module2(tmp)  # forward ops run as a graph
    else:
        tmp = module3(tmp)  # forward ops run as a graph

    loss = loss_fn(tmp, target)
    # module2's or module3's (whichever was chosen) backward ops,
    # as well as module1's backward ops, run as graphs
    loss.backward()
    optimizer.step()

與 torch.cuda.amp 的用法¶

對於典型的優化器，GradScaler.step 會將 CPU 與 GPU 同步，這在捕捉期間是被禁止的。為了避免錯誤，可以使用部分網路捕捉，或者（如果 forward、loss 和 backward 是可安全捕捉的）捕捉 forward、loss 和 backward，但不捕捉優化器步驟。

# warmup
# In a real setting, use a few batches of real data.
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for i in range(3):
        optimizer.zero_grad(set_to_none=True)
        with torch.cuda.amp.autocast():
            y_pred = model(static_input)
            loss = loss_fn(y_pred, static_target)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
torch.cuda.current_stream().wait_stream(s)

# capture
g = torch.cuda.CUDAGraph()
optimizer.zero_grad(set_to_none=True)
with torch.cuda.graph(g):
    with torch.cuda.amp.autocast():
        static_y_pred = model(static_input)
        static_loss = loss_fn(static_y_pred, static_target)
    scaler.scale(static_loss).backward()
    # don't capture scaler.step(optimizer) or scaler.update()

real_inputs = [torch.rand_like(static_input) for _ in range(10)]
real_targets = [torch.rand_like(static_target) for _ in range(10)]

for data, target in zip(real_inputs, real_targets):
    static_input.copy_(data)
    static_target.copy_(target)
    g.replay()
    # Runs scaler.step and scaler.update eagerly
    scaler.step(optimizer)
    scaler.update()

與多個 stream 的用法¶

捕捉模式會自動傳播到與捕捉 stream 同步的任何 stream。在捕捉期間，你可以透過呼叫不同的 stream 來暴露平行性，但整體的 stream 依賴 DAG 必須在捕捉開始後從初始捕捉 stream 分支，並在捕捉結束前重新加入初始 stream。

with torch.cuda.graph(g):
    # at context manager entrance, torch.cuda.current_stream()
    # is the initial capturing stream

    # INCORRECT (does not branch out from or rejoin initial stream)
    with torch.cuda.stream(s):
        cuda_work()

    # CORRECT:
    # branches out from initial stream
    s.wait_stream(torch.cuda.current_stream())
    with torch.cuda.stream(s):
        cuda_work()
    # rejoins initial stream before capture ends
    torch.cuda.current_stream().wait_stream(s)

注意

為了避免在 nsight systems 或 nvprof 中查看重播的高級使用者產生混淆：與 eager 執行不同，圖形將捕捉中非平凡的 stream DAG 解釋為提示，而不是命令。在重播期間，圖形可能會將獨立的 ops 重新組織到不同的 stream 上，或以不同的順序將它們排隊（同時尊重你原始 DAG 的整體依賴關係）。

與 DistributedDataParallel 的用法¶

NCCL < 2.9.6¶

早於 2.9.6 的 NCCL 版本不允許捕捉 collectives。你必須使用部分網路捕捉，這會將 allreduces 推遲到 backward 的圖形化部分之外進行。

在用 DDP 包裝網路之前，對可圖形化的網路部分呼叫 make_graphed_callables()。

NCCL >= 2.9.6¶

2.9.6 或更高版本的 NCCL 允許圖形中的 collectives。捕捉整個 backward pass 的方法是一個可行的選擇，但需要三個設定步驟。

禁用 DDP 的內部非同步錯誤處理

os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "0"
torch.distributed.init_process_group(...)

在完整 backward 捕捉之前，必須在 side-stream 上下文中構建 DDP。
```
with torch.cuda.stream(s):
    model = DistributedDataParallel(model)
```
你的預熱必須在捕捉之前至少運行 11 個啟用 DDP 的 eager 迭代。

圖形記憶體管理¶

捕捉的圖形每次重播時都作用於相同的虛擬位址。如果 PyTorch 釋放了記憶體，則稍後的重播可能會遇到非法記憶體存取。如果 PyTorch 將記憶體重新分配給新的 tensor，則重播可能會損壞這些 tensor 看到的值。因此，圖形使用的虛擬位址必須在重播中為圖形保留。PyTorch 緩存分配器透過檢測捕捉是否正在進行，並從圖形專用記憶體池中滿足捕捉的分配來實現這一點。私有池會一直保持活動狀態，直到其 CUDAGraph 對象和捕捉期間建立的所有 tensor 超出範圍。

私有池會自動維護。預設情況下，分配器為每次捕捉建立一個單獨的私有池。如果你捕捉多個圖形，這種保守的方法可確保圖形重播永遠不會損壞彼此的值，但有時會不必要地浪費記憶體。

CUDA 語義¶

Ampere (以及之後) 裝置上的 TensorFloat-32 (TF32)¶

FP16 GEMM 中降低的精度歸約¶

BF16 GEMM 中降低的精度歸約¶

非同步執行¶

CUDA streams¶

反向傳播的流語義¶

BC 注意事項：在預設流上使用梯度¶

記憶體管理¶

使用 `PYTORCH_CUDA_ALLOC_CONF` 優化記憶體使用量¶

使用 CUDA 的自定義記憶體分配器¶

cuBLAS 工作空間¶

cuFFT 計劃快取¶

即時編譯¶

最佳實踐¶

設備無關的程式碼¶

使用釘選記憶體緩衝區¶

使用 nn.parallel.DistributedDataParallel，而不是 multiprocessing 或 nn.DataParallel¶

CUDA Graphs¶

為什麼需要 CUDA Graphs？¶

PyTorch API¶

限制¶

非限制¶

全網路捕捉¶

部分網路捕捉¶

與 torch.cuda.amp 的用法¶

與多個 stream 的用法¶

與 DistributedDataParallel 的用法¶

NCCL < 2.9.6¶

NCCL >= 2.9.6¶

圖形記憶體管理¶

文件

教程

資源

CUDA 語義¶

Ampere (以及之後) 裝置上的 TensorFloat-32 (TF32)¶

FP16 GEMM 中降低的精度歸約¶

BF16 GEMM 中降低的精度歸約¶

非同步執行¶

CUDA streams¶

反向傳播的流語義¶

BC 注意事項：在預設流上使用梯度¶

記憶體管理¶

使用 PYTORCH_CUDA_ALLOC_CONF 優化記憶體使用量¶

使用 CUDA 的自定義記憶體分配器¶

cuBLAS 工作空間¶

cuFFT 計劃快取¶

即時編譯¶

最佳實踐¶

設備無關的程式碼¶

使用釘選記憶體緩衝區¶

使用 nn.parallel.DistributedDataParallel，而不是 multiprocessing 或 nn.DataParallel¶

CUDA Graphs¶

為什麼需要 CUDA Graphs？¶

PyTorch API¶

限制¶

非限制¶

全網路捕捉¶

部分網路捕捉¶

與 torch.cuda.amp 的用法¶

與多個 stream 的用法¶

與 DistributedDataParallel 的用法¶

NCCL < 2.9.6¶

NCCL >= 2.9.6¶

圖形記憶體管理¶

跨捕捉共享記憶體¶

文件

教程

資源

使用 `PYTORCH_CUDA_ALLOC_CONF` 優化記憶體使用量¶