分散式與平行訓練教學¶

建立於：2022 年 10 月 04 日 | 最後更新：2024 年 10 月 31 日 | 最後驗證：2024 年 11 月 05 日

分散式訓練是一種模型訓練範例，它涉及將訓練工作負載分散到多個工作節點上，從而顯著提高訓練速度和模型準確性。雖然分散式訓練可用於任何類型的 ML 模型訓練，但它對於大型模型和計算密集型任務（如深度學習）最有利。

您可以使用幾種方法在 PyTorch 中執行分散式訓練，每種方法在特定用例中都有其優勢

DistributedDataParallel (DDP)
Fully Sharded Data Parallel (FSDP)
Tensor Parallel (TP)
Device Mesh
遠端程序呼叫 (Remote Procedure Call, RPC) 分散式訓練
自訂擴充

在分散式概覽中了解有關這些選項的更多信息。

學習 DDP¶

DDP 簡介影片教學

一個關於如何開始使用DistributedDataParallel 並進階到更複雜主題的逐步影片系列

程式碼影片

https://pytorch.dev.org.tw/tutorials/beginner/ddp_series_intro.html?utm_source=distr_landing&utm_medium=ddp_series_intro

分散式資料平行入門

本教學課程提供了 PyTorch DistributedData Parallel 的簡短而溫和的介紹。

程式碼

https://pytorch.dev.org.tw/tutorials/intermediate/ddp_tutorial.html?utm_source=distr_landing&utm_medium=intermediate_ddp_tutorial

使用 Join Context Manager 進行輸入不均勻的分散式訓練

本教學課程描述了 Join context manager 並展示了它與 DistributedData Parallel 的使用。

程式碼

https://pytorch.dev.org.tw/tutorials/advanced/generic_join.html?utm_source=distr_landing&utm_medium=generic_join

學習 FSDP¶

FSDP 入門

本教學課程示範了如何使用 FSDP 在 MNIST 資料集上執行分散式訓練。

程式碼

https://pytorch.dev.org.tw/tutorials/intermediate/FSDP_tutorial.html?utm_source=distr_landing&utm_medium=FSDP_getting_started

FSDP 進階

在本教學課程中，您將學習如何使用 FSDP 微調 HuggingFace (HF) T5 模型以進行文本摘要。

程式碼

https://pytorch.dev.org.tw/tutorials/intermediate/FSDP_advanced_tutorial.html?utm_source=distr_landing&utm_medium=FSDP_advanced

學習張量平行 (TP)¶

使用張量平行 (Tensor Parallel, TP) 進行大規模 Transformer 模型訓練

本教學課程示範了如何使用張量平行和完全分片資料平行在數百到數千個 GPU 上訓練大型類 Transformer 模型。

程式碼

https://pytorch.dev.org.tw/tutorials/intermediate/TP_tutorial.html

學習 DeviceMesh¶

DeviceMesh 入門

在本教學課程中，您將了解 DeviceMesh 以及它如何幫助進行分散式訓練。

程式碼

https://pytorch.dev.org.tw/tutorials/recipes/distributed_device_mesh.html?highlight=devicemesh

學習 RPC¶

分散式 RPC 框架入門

本教學課程示範了如何開始使用基於 RPC 的分散式訓練。

程式碼

https://pytorch.dev.org.tw/tutorials/intermediate/rpc_tutorial.html?utm_source=distr_landing&utm_medium=rpc_getting_started

使用分散式 RPC 框架實作參數伺服器

本教學課程將引導您完成使用 PyTorch 的分散式 RPC 框架實作參數伺服器的簡單範例。

程式碼

https://pytorch.dev.org.tw/tutorials/intermediate/rpc_param_server_tutorial.html?utm_source=distr_landing&utm_medium=rpc_param_server_tutorial

使用非同步執行實作批次 RPC 處理

在本教學課程中，您將使用 @rpc.functions.async_execution 裝飾器構建批次處理 RPC 應用程式。

程式碼

https://pytorch.dev.org.tw/tutorials/intermediate/rpc_async_execution.html?utm_source=distr_landing&utm_medium=rpc_async_execution

結合分散式資料平行與分散式 RPC 框架

在本教學課程中，您將學習如何將分散式資料平行與分散式模型平行結合使用。

程式碼

https://pytorch.dev.org.tw/tutorials/advanced/rpc_ddp_tutorial.html?utm_source=distr_landing&utm_medium=rpc_plus_ddp

自訂擴充¶

使用 Cpp Extensions 自訂 Process Group 後端

在本教學課程中，您將學習實作自訂 ProcessGroup 後端，並使用 cpp 擴充將其插入 PyTorch 分散式套件中。

程式碼

https://pytorch.dev.org.tw/tutorials/intermediate/process_group_cpp_extension_tutorial.html?utm_source=distr_landing&utm_medium=custom_extensions_cpp