使用 Cpp 擴充自訂程序群組後端¶

建立於：2022 年 2 月 1 日 | 最後更新：2024 年 11 月 14 日 | 最後驗證：2024 年 11 月 5 日

作者：Howard Huang、Feng Tian、Shen Li、Min Si

注意

在 github 中檢視和編輯本教學課程。

先決條件

本教學課程示範如何實作自訂 Backend 並使用 cpp 擴充將其插入 PyTorch 分散式套件。當您需要專為您的硬體設計的特殊軟體堆疊，或者您想要試驗新的集體通訊演算法時，這會很有幫助。

基本概念¶

PyTorch 集體通訊為多種廣泛採用的分散式訓練功能提供動力，包括 DistributedDataParallel 和 ZeroRedundancyOptimizer。為了使相同的集體通訊 API 能夠與不同的通訊後端一起使用，分散式套件將集體通訊操作抽象為 Backend 類別。然後可以使用首選的第三方函式庫將不同的後端實作為 Backend 的子類別。PyTorch 分散式套件隨附三個預設後端，ProcessGroupNCCL、ProcessGroupGloo 和 ProcessGroupMPI。然而，除了這三個後端之外，還有其他通訊函式庫（例如，UCC、OneCCL）、不同類型的硬體（例如，TPU、Trainum）和新興的通訊演算法（例如，Herring、Reduction Server）。因此，分散式套件公開擴充 API，以允許自訂集體通訊後端。

以下 4 個步驟說明如何實作虛擬 Backend 後端並在 Python 應用程式碼中使用它。請注意，本教學課程重點在於示範擴充 API，而不是開發可運作的通訊後端。因此，dummy 後端僅涵蓋 API 的子集（all_reduce 和 all_gather），並且僅將張量的值設定為 0。

步驟 1：實作 `Backend` 的子類別¶

第一步是實作 Backend 子類別，該子類別會覆寫目標集體通訊 API 並執行自訂通訊演算法。擴充還需要實作 Work 子類別，該子類別充當通訊結果的未來，並允許在應用程式碼中進行非同步執行。如果擴充使用第三方函式庫，則它可以包含標頭並從 BackendDummy 子類別中呼叫函式庫 API。以下兩個程式碼片段顯示了 dummy.h 和 dummy.cpp 的實作。請參閱虛擬集體儲存庫以取得完整實作。

// file name: dummy.hpp
#include <torch/python.h>

#include <torch/csrc/distributed/c10d/Backend.hpp>
#include <torch/csrc/distributed/c10d/Work.hpp>
#include <torch/csrc/distributed/c10d/Store.hpp>
#include <torch/csrc/distributed/c10d/Types.hpp>
#include <torch/csrc/distributed/c10d/Utils.hpp>

#include <pybind11/chrono.h>

namespace c10d {

class BackendDummy : public Backend {
  public:
    BackendDummy(int rank, int size);

    c10::intrusive_ptr<Work> allgather(
        std::vector<std::vector<at::Tensor>>& outputTensors,
        std::vector<at::Tensor>& inputTensors,
        const AllgatherOptions& opts = AllgatherOptions()) override;

    c10::intrusive_ptr<Work> allreduce(
        std::vector<at::Tensor>& tensors,
        const AllreduceOptions& opts = AllreduceOptions()) override;

    // The collective communication APIs without a custom implementation
    // will error out if invoked by application code.
};

class WorkDummy : public Work {
  public:
    WorkDummy(
      OpType opType,
      c10::intrusive_ptr<c10::ivalue::Future> future) // future of the output
      : Work(
          -1, // rank, only used by recvAnySource, irrelevant in this demo
          opType),
      future_(std::move(future)) {}
    bool isCompleted() override;
    bool isSuccess() const override;
    bool wait(std::chrono::milliseconds timeout = kUnsetTimeout) override;
    virtual c10::intrusive_ptr<c10::ivalue::Future> getFuture() override;

  private:
    c10::intrusive_ptr<c10::ivalue::Future> future_;
};
} // namespace c10d

// file name: dummy.cpp
#include "dummy.hpp"

namespace c10d {

// This is a dummy allgather that sets all output tensors to zero
// Modify the implementation to conduct real communication asynchronously
c10::intrusive_ptr<Work> BackendDummy::allgather(
        std::vector<std::vector<at::Tensor>>& outputTensors,
        std::vector<at::Tensor>& inputTensors,
        const AllgatherOptions& /* unused */) {
    for (auto& outputTensorVec : outputTensors) {
        for (auto& outputTensor : outputTensorVec) {
            outputTensor.zero_();
        }
    }

    auto future = c10::make_intrusive<c10::ivalue::Future>(
        c10::ListType::create(c10::ListType::create(c10::TensorType::get())));
    future->markCompleted(c10::IValue(outputTensors));
    return c10::make_intrusive<WorkDummy>(OpType::ALLGATHER, std::move(future));
}

// This is a dummy allreduce that sets all output tensors to zero
// Modify the implementation to conduct real communication asynchronously
c10::intrusive_ptr<Work> BackendDummy::allreduce(
        std::vector<at::Tensor>& tensors,
        const AllreduceOptions& opts) {
    for (auto& tensor : tensors) {
        tensor.zero_();
    }

    auto future = c10::make_intrusive<c10::ivalue::Future>(
        c10::ListType::create(c10::TensorType::get()));
    future->markCompleted(c10::IValue(tensors));
    return c10::make_intrusive<WorkDummy>(OpType::ALLGATHER, std::move(future));
}
} // namespace c10d

步驟 2：公開擴充 Python API¶

後端建構函式是從 Python 端呼叫的，因此擴充也需要向 Python 公開建構函式 API。這可以透過新增以下方法來完成。在本範例中，store 和 timeout 會被 BackendDummy 實例化方法忽略，因為這些方法未在本虛擬實作中使用。但是，真實世界的擴充應考慮使用 store 執行 rendezvous 並支援 timeout 引數。

// file name: dummy.hpp
class BackendDummy : public Backend {
    ...
    <Step 1 code>
    ...

    static c10::intrusive_ptr<Backend> createBackendDummy(
        const c10::intrusive_ptr<::c10d::Store>& store,
        int rank,
        int size,
        const std::chrono::duration<float>& timeout);

    static void BackendDummyConstructor() __attribute__((constructor)) {
        py::object module = py::module::import("torch.distributed");
        py::object register_backend =
            module.attr("Backend").attr("register_backend");
        // torch.distributed.Backend.register_backend will add `dummy` as a
        // new valid backend.
        register_backend("dummy", py::cpp_function(createBackendDummy));
    }
}

// file name: dummy.cpp
c10::intrusive_ptr<Backend> BackendDummy::createBackendDummy(
        const c10::intrusive_ptr<::c10d::Store>& /* unused */,
        int rank,
        int size,
        const std::chrono::duration<float>& /* unused */) {
    return c10::make_intrusive<BackendDummy>(rank, size);
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("createBackendDummy", &BackendDummy::createBackendDummy);
}

步驟 3：建置自訂擴充¶

現在，擴充原始碼檔案已準備就緒。然後我們可以使用 cpp 擴充來建置它。為此，請建立一個 setup.py 檔案，該檔案會準備路徑和指令。然後呼叫 python setup.py develop 以安裝擴充。

如果擴展依賴於第三方函式庫，您也可以指定 libraries_dirs 和 libraries 到 cpp 擴展 API。請參閱 torch ucc 專案作為一個真實世界的例子。

# file name: setup.py
import os
import sys
import torch
from setuptools import setup
from torch.utils import cpp_extension

sources = ["src/dummy.cpp"]
include_dirs = [f"{os.path.dirname(os.path.abspath(__file__))}/include/"]

if torch.cuda.is_available():
    module = cpp_extension.CUDAExtension(
        name = "dummy_collectives",
        sources = sources,
        include_dirs = include_dirs,
    )
else:
    module = cpp_extension.CppExtension(
        name = "dummy_collectives",
        sources = sources,
        include_dirs = include_dirs,
    )

setup(
    name = "Dummy-Collectives",
    version = "0.0.1",
    ext_modules = [module],
    cmdclass={'build_ext': cpp_extension.BuildExtension}
)

步驟 4：在應用程式中使用擴展¶

安裝後，您可以方便地在使用 init_process_group 呼叫時使用 dummy 後端，就像它是內建的後端一樣。

我們可以通過更改 init_process_group 的 backend 參數來指定基於後端的調度。我們可以將具有 CPU tensor 的 collective 調度到 gloo 後端，並將具有 CUDA tensor 的 collective 調度到 dummy 後端，通過指定 cpu:gloo,cuda:dummy 作為 backend 參數。

要將所有 tensors 發送到 dummy 後端，我們可以簡單地指定 dummy 作為 backend 參數。

import os

import torch
# importing dummy_collectives makes torch.distributed recognize `dummy`
# as a valid backend.
import dummy_collectives

import torch.distributed as dist

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'

# Alternatively:
# dist.init_process_group("dummy", rank=0, world_size=1)
dist.init_process_group("cpu:gloo,cuda:dummy", rank=0, world_size=1)

# this goes through gloo
x = torch.ones(6)
dist.all_reduce(x)
print(f"cpu allreduce: {x}")

# this goes through dummy
if torch.cuda.is_available():
    y = x.cuda()
    dist.all_reduce(y)
    print(f"cuda allreduce: {y}")

    try:
        dist.broadcast(y, 0)
    except RuntimeError:
        print("got RuntimeError when calling broadcast")

使用 Cpp 擴充自訂程序群組後端¶

基本概念¶

步驟 1：實作 `Backend` 的子類別¶

步驟 2：公開擴充 Python API¶

步驟 3：建置自訂擴充¶

步驟 4：在應用程式中使用擴展¶

文件

教學

資源

使用 Cpp 擴充自訂程序群組後端¶

基本概念¶

步驟 1：實作 Backend 的子類別¶

步驟 2：公開擴充 Python API¶

步驟 3：建置自訂擴充¶

步驟 4：在應用程式中使用擴展¶

文件

教學

資源

步驟 1：實作 `Backend` 的子類別¶