注意

前往結尾以下載完整的範例程式碼。

使用 TorchRL 的多代理人強化學習 (PPO) 教學¶

作者: Matteo Bettini

另請參閱

BenchMARL 函式庫提供了使用 TorchRL 的 MARL 演算法的最新實作。

本教學示範如何使用 PyTorch 和 torchrl 來解決多代理人強化學習 (MARL) 問題。

為了方便起見，本教學將遵循已在使用 TorchRL 的強化學習 (PPO) 教學中提供的通用結構。建議但不強制在開始本教學之前熟悉該教學。

在本教學中，我們將使用來自 VMAS 的 Navigation 環境，VMAS 是一個基於 PyTorch 的多機器人模擬器，可在設備上執行並行批次模擬。

在 Navigation 環境中，我們需要訓練多個機器人（在隨機位置生成）導航到它們的目標（也在隨機位置），同時使用光達感測器來避免彼此之間的碰撞。

主要學習內容

如何在 TorchRL 中建立多代理人環境、其規格如何運作以及如何與函式庫整合；
如何在 TorchRL 中使用 GPU 向量化環境；
如何在 TorchRL 中建立不同的多代理人網路架構（例如，使用參數共享、集中式評估器）
如何使用 tensordict.TensorDict 來攜帶多代理人資料；
如何將所有函式庫元件（收集器、模組、重播緩衝區和損失）連結到多代理人 MAPPO/IPPO 訓練迴圈中。

如果您在 Google Colab 中執行此操作，請確保您安裝以下相依性

!pip3 install torchrl
!pip3 install vmas
!pip3 install tqdm

近端策略最佳化 (PPO) 是一種策略梯度演算法，其中收集一批資料並直接用於訓練策略，以最大化在某些近端約束條件下的預期回報。您可以將其視為 REINFORCE 的複雜版本，REINFORCE 是基礎的策略最佳化演算法。如需更多資訊，請參閱近端策略最佳化演算法論文。

這類演算法通常以on-policy（在策略上）方式進行訓練。這表示在每次學習迭代中，我們都會經歷一個取樣階段和一個訓練階段。在迭代 \(t\) 的取樣階段，會從智能體與環境的互動中，使用當前策略 \(\mathbf{\pi}_t\) 收集 rollouts (片段)。在訓練階段，所有收集到的 rollouts 會立即被饋送到訓練過程中以執行反向傳播。這會產生更新後的策略，然後再次用於取樣。這個過程的循環執行構成了on-policy 學習。

在 PPO 演算法的訓練階段，會使用一個critic（評論家）來評估策略所採取行動的優劣。這個 critic 會學習近似特定狀態的值（平均折扣回報）。然後，PPO 損失會將策略獲得的實際回報與 critic 估計的回報進行比較，以確定所採取行動的優勢，並指導策略優化。

在多智能體設定中，情況會稍微不同。我們現在有多個策略 \(\mathbf{\pi}\)，每個智能體對應一個。策略通常是本地的且分散的。這表示單一智能體的策略只會根據其觀察結果輸出該智能體的動作。在 MARL 文獻中，這被稱為分散式執行。另一方面，對於 critic 存在不同的公式，主要有：

在 MAPPO 中，critic 是集中式的，並將系統的全局狀態作為輸入。這可以是全局觀察，或者僅僅是智能體觀察結果的串聯。MAPPO 可用於執行集中式訓練的環境中，因為它需要存取全局資訊。
在 IPPO 中，critic 僅將相應智能體的觀察作為輸入，與策略完全相同。這允許分散式訓練，因為 critic 和策略都只需要本地資訊來計算其輸出。

集中式 critic 有助於克服多個智能體同時學習時的非平穩性，但另一方面，它們可能會受到其龐大輸入空間的影響。在本教學中，我們將能夠訓練這兩種公式，並且我們還將討論參數共享（跨智能體共享網路參數的做法）如何影響每個公式。

本教學的結構如下：

首先，我們將定義一組將要使用的超參數。
接下來，我們將使用 TorchRL 的 VMAS 模擬器包裝器建立一個向量化多智能體環境。
接下來，我們將設計策略和 critic 網路，並討論各種選擇對參數共享和 critic 集中化的影響。
接下來，我們將建立取樣收集器和重播緩衝區。
最後，我們將執行我們的訓練迴圈並分析結果。

如果您在 Colab 或具有 GUI 的機器上執行此操作，您還可以選擇在訓練之前和之後渲染和可視化您自己訓練的策略。

讓我們匯入我們的依賴項

# Torch
import torch

# Tensordict modules
from tensordict.nn import TensorDictModule
from tensordict.nn.distributions import NormalParamExtractor
from torch import multiprocessing

# Data collection
from torchrl.collectors import SyncDataCollector
from torchrl.data.replay_buffers import ReplayBuffer
from torchrl.data.replay_buffers.samplers import SamplerWithoutReplacement
from torchrl.data.replay_buffers.storages import LazyTensorStorage

# Env
from torchrl.envs import RewardSum, TransformedEnv
from torchrl.envs.libs.vmas import VmasEnv
from torchrl.envs.utils import check_env_specs

# Multi-agent network
from torchrl.modules import MultiAgentMLP, ProbabilisticActor, TanhNormal

# Loss
from torchrl.objectives import ClipPPOLoss, ValueEstimators

# Utils
torch.manual_seed(0)
from matplotlib import pyplot as plt
from tqdm import tqdm

定義超參數¶

我們為我們的教學設定超參數。根據可用的資源，您可以選擇在 GPU 或其他設備上執行策略和模擬器。您可以調整其中一些值以調整計算需求。

# Devices
is_fork = multiprocessing.get_start_method() == "fork"
device = (
    torch.device(0)
    if torch.cuda.is_available() and not is_fork
    else torch.device("cpu")
)
vmas_device = device  # The device where the simulator is run (VMAS can run on GPU)

# Sampling
frames_per_batch = 6_000  # Number of team frames collected per training iteration
n_iters = 10  # Number of sampling and training iterations
total_frames = frames_per_batch * n_iters

# Training
num_epochs = 30  # Number of optimization steps per training iteration
minibatch_size = 400  # Size of the mini-batches in each optimization step
lr = 3e-4  # Learning rate
max_grad_norm = 1.0  # Maximum norm for the gradients

# PPO
clip_epsilon = 0.2  # clip value for PPO loss
gamma = 0.99  # discount factor
lmbda = 0.9  # lambda for generalised advantage estimation
entropy_eps = 1e-4  # coefficient of the entropy term in the PPO loss

環境¶

多智能體環境模擬多個智能體與世界互動。TorchRL API 允許整合各種多智能體環境類型。一些範例包括具有共享或個別智能體獎勵、完成標誌和觀察結果的環境。有關多智能體環境 API 在 TorchRL 中的工作方式的更多資訊，您可以查看專用的文件章節。

特別是，VMAS 模擬器使用個別獎勵、資訊、觀察和動作對智能體進行建模，但具有集體的完成標誌。此外，它使用向量化以批次方式執行模擬。這表示其所有狀態和物理都是 PyTorch 張量，其中第一維表示批次中平行環境的數量。這允許利用 GPU 的單指令多資料 (SIMD) 範例，並透過利用 GPU warps 中的平行化來顯著加速平行計算。這也表示，在 TorchRL 中使用它時，模擬和訓練都可以在設備上執行，而無需將資料傳遞到 CPU。

我們今天將解決的多智能體任務是Navigation（請參閱上面的動畫圖）。在Navigation中，隨機產生的智能體（帶有周圍點的圓圈）需要導航到隨機產生的目標（較小的圓圈）。智能體需要使用雷射雷達（它們周圍的點）來避免相互碰撞。智能體在具有阻力和彈性碰撞的 2D 連續世界中行動。它們的動作是決定其加速度的 2D 連續力。獎勵由三個部分組成：碰撞懲罰、基於與目標的距離的獎勵，以及所有智能體都達到目標時給予的最終共享獎勵。基於距離的項計算為智能體與其目標之間的相對距離在兩個連續時間步長之間的差異。每個智能體都會觀察其位置、速度、雷射雷達讀數以及與其目標的相對位置。

我們現在將實例化環境。對於本教學，我們將把 episodes 限制為 max_steps，之後會設定完成標誌。此功能已在 VMAS 模擬器中提供，但也可以使用 TorchRL StepCount 轉換。我們也將使用 num_vmas_envs 向量化環境，以利用批次模擬。

max_steps = 100  # Episode steps before done
num_vmas_envs = (
    frames_per_batch // max_steps
)  # Number of vectorized envs. frames_per_batch should be divisible by this number
scenario_name = "navigation"
n_agents = 3

env = VmasEnv(
    scenario=scenario_name,
    num_envs=num_vmas_envs,
    continuous_actions=True,  # VMAS supports both continuous and discrete actions
    max_steps=max_steps,
    device=vmas_device,
    # Scenario kwargs
    n_agents=n_agents,  # These are custom kwargs that change for each VMAS scenario, see the VMAS repo to know more.
)

環境不僅由其模擬器和轉換定義，還由一系列元資料定義，這些元資料描述了在其執行期間可以預期什麼。出於效率目的，TorchRL 在環境規範方面非常嚴格，但您可以輕鬆檢查您的環境規範是否足夠。在我們的範例中，VmasEnv 負責為您的環境設定適當的規範，因此您不必擔心這一點。

有四個規範需要查看：

action_spec 定義了動作空間；
reward_spec 定義了獎勵域；
done_spec 定義了完成域；
observation_spec 定義了來自環境步驟的所有其他輸出的域；

print("action_spec:", env.full_action_spec)
print("reward_spec:", env.full_reward_spec)
print("done_spec:", env.full_done_spec)
print("observation_spec:", env.observation_spec)

action_spec: Composite(
    agents: Composite(
        action: BoundedContinuous(
            shape=torch.Size([60, 3, 2]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([60, 3, 2]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([60, 3, 2]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        device=cpu,
        shape=torch.Size([60, 3])),
    device=cpu,
    shape=torch.Size([60]))
reward_spec: Composite(
    agents: Composite(
        reward: UnboundedContinuous(
            shape=torch.Size([60, 3, 1]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([60, 3, 1]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([60, 3, 1]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        device=cpu,
        shape=torch.Size([60, 3])),
    device=cpu,
    shape=torch.Size([60]))
done_spec: Composite(
    done: Categorical(
        shape=torch.Size([60, 1]),
        space=CategoricalBox(n=2),
        device=cpu,
        dtype=torch.bool,
        domain=discrete),
    terminated: Categorical(
        shape=torch.Size([60, 1]),
        space=CategoricalBox(n=2),
        device=cpu,
        dtype=torch.bool,
        domain=discrete),
    device=cpu,
    shape=torch.Size([60]))
observation_spec: Composite(
    agents: Composite(
        observation: UnboundedContinuous(
            shape=torch.Size([60, 3, 18]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([60, 3, 18]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([60, 3, 18]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        info: Composite(
            pos_rew: UnboundedContinuous(
                shape=torch.Size([60, 3, 1]),
                space=ContinuousBox(
                    low=Tensor(shape=torch.Size([60, 3, 1]), device=cpu, dtype=torch.float32, contiguous=True),
                    high=Tensor(shape=torch.Size([60, 3, 1]), device=cpu, dtype=torch.float32, contiguous=True)),
                device=cpu,
                dtype=torch.float32,
                domain=continuous),
            final_rew: UnboundedContinuous(
                shape=torch.Size([60, 3, 1]),
                space=ContinuousBox(
                    low=Tensor(shape=torch.Size([60, 3, 1]), device=cpu, dtype=torch.float32, contiguous=True),
                    high=Tensor(shape=torch.Size([60, 3, 1]), device=cpu, dtype=torch.float32, contiguous=True)),
                device=cpu,
                dtype=torch.float32,
                domain=continuous),
            agent_collisions: UnboundedContinuous(
                shape=torch.Size([60, 3, 1]),
                space=ContinuousBox(
                    low=Tensor(shape=torch.Size([60, 3, 1]), device=cpu, dtype=torch.float32, contiguous=True),
                    high=Tensor(shape=torch.Size([60, 3, 1]), device=cpu, dtype=torch.float32, contiguous=True)),
                device=cpu,
                dtype=torch.float32,
                domain=continuous),
            device=cpu,
            shape=torch.Size([60, 3])),
        device=cpu,
        shape=torch.Size([60, 3])),
    device=cpu,
    shape=torch.Size([60]))

使用剛才展示的指令，我們可以存取每個數值的域 (domain)。這樣做之後，我們會發現除了 done 之外，所有 specs 都有前導形狀 (num_vmas_envs, n_agents)。這代表這些數值會存在於每個環境中每個 agent 的情況。另一方面，done spec 的前導形狀為 num_vmas_envs，代表 done 在所有 agents 之間是共享的。

TorchRL 提供了一種方法來追蹤哪些 MARL specs 是共享的，哪些不是。事實上，具有額外 agent 維度的 specs (也就是，它們對於每個 agent 都不同) 將會包含在內部的 "agents" 鍵中。

正如您所看到的，reward 和 action spec 都呈現了 "agent" 鍵，這表示 tensordict 中屬於這些 specs 的條目將會嵌套在 "agents" tensordict 中，將所有每個 agent 的數值分組。

為了快速存取 tensordict 中每個數值的鍵，我們可以簡單地向環境詢問對應的鍵，我們就能立即了解哪些是每個 agent 的，哪些是共享的。此資訊對於告訴所有其他 TorchRL 組件在哪裡可以找到每個數值非常有用。

print("action_keys:", env.action_keys)
print("reward_keys:", env.reward_keys)
print("done_keys:", env.done_keys)

action_keys: [('agents', 'action')]
reward_keys: [('agents', 'reward')]
done_keys: ['done', 'terminated']

Transforms¶

我們可以將任何需要的 TorchRL transform 附加到我們的環境中。這些 transform 將以某種期望的方式修改其輸入/輸出。我們強調，在多 agent 的情況下，明確提供要修改的鍵至關重要。

例如，在這種情況下，我們將實例化一個 RewardSum transform，它將會對 episode 中的 reward 進行求和。我們會告訴這個 transform 在哪裡可以找到 reward 鍵，以及在哪裡可以寫入總和的 episode reward。轉換後的環境將會繼承被包裹環境的設備和元數據，並根據其包含的 transform 序列來轉換它們。

env = TransformedEnv(
    env,
    RewardSum(in_keys=[env.reward_key], out_keys=[("agents", "episode_reward")]),
)

check_env_specs() 函數會運行一個小型的 rollout，並將其輸出與環境 specs 進行比較。如果沒有引發任何錯誤，我們可以確信 specs 已正確定義。

check_env_specs(env)

Rollout¶

為了好玩，讓我們看看一個簡單的隨機 rollout 是什麼樣子。您可以調用 env.rollout(n_steps) 並概覽環境的輸入和輸出。Actions 將會自動從 action spec 域中隨機抽取。

n_rollout_steps = 5
rollout = env.rollout(n_rollout_steps)
print("rollout of three steps:", rollout)
print("Shape of the rollout TensorDict:", rollout.batch_size)

rollout of three steps: TensorDict(
    fields={
        agents: TensorDict(
            fields={
                action: Tensor(shape=torch.Size([60, 5, 3, 2]), device=cpu, dtype=torch.float32, is_shared=False),
                episode_reward: Tensor(shape=torch.Size([60, 5, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                info: TensorDict(
                    fields={
                        agent_collisions: Tensor(shape=torch.Size([60, 5, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                        final_rew: Tensor(shape=torch.Size([60, 5, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                        pos_rew: Tensor(shape=torch.Size([60, 5, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([60, 5, 3]),
                    device=cpu,
                    is_shared=False),
                observation: Tensor(shape=torch.Size([60, 5, 3, 18]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([60, 5, 3]),
            device=cpu,
            is_shared=False),
        done: Tensor(shape=torch.Size([60, 5, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                agents: TensorDict(
                    fields={
                        episode_reward: Tensor(shape=torch.Size([60, 5, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                        info: TensorDict(
                            fields={
                                agent_collisions: Tensor(shape=torch.Size([60, 5, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                                final_rew: Tensor(shape=torch.Size([60, 5, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                                pos_rew: Tensor(shape=torch.Size([60, 5, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
                            batch_size=torch.Size([60, 5, 3]),
                            device=cpu,
                            is_shared=False),
                        observation: Tensor(shape=torch.Size([60, 5, 3, 18]), device=cpu, dtype=torch.float32, is_shared=False),
                        reward: Tensor(shape=torch.Size([60, 5, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([60, 5, 3]),
                    device=cpu,
                    is_shared=False),
                done: Tensor(shape=torch.Size([60, 5, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                terminated: Tensor(shape=torch.Size([60, 5, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
            batch_size=torch.Size([60, 5]),
            device=cpu,
            is_shared=False),
        terminated: Tensor(shape=torch.Size([60, 5, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
    batch_size=torch.Size([60, 5]),
    device=cpu,
    is_shared=False)
Shape of the rollout TensorDict: torch.Size([60, 5])

我們可以發現我們的 rollout 的 batch_size 為 (num_vmas_envs, n_rollout_steps)。這意味著其中的所有 tensors 都將具有這些前導維度。

更深入地觀察，我們可以發現輸出 tensordict 可以按以下方式劃分：

在根目錄 (root) 中 (可以透過運行 rollout.exclude("next") 來存取)，我們會找到在第一次 timestep 調用 reset 後可用的所有鍵。我們可以透過索引 n_rollout_steps 維度來查看它們在 rollout 步驟中的演變。在這些鍵中，我們會找到 rollout["agents"] tensordict 中每個 agent 不同的鍵，它將具有批次大小 (num_vmas_envs, n_rollout_steps, n_agents)，表示它正在儲存額外的 agent 維度。在此 agent tensordict 之外的鍵將是共享的鍵 (在這種情況下，只有 done)。
在 next 中 (可以透過運行 rollout.get("next") 來存取)。我們會發現與根目錄相同的結構，但適用於僅在 step 之後可用的鍵。

在 TorchRL 中，慣例是 done 和 observations 將會同時存在於 root 和 next 中 (因為這些在 reset 時間和 step 之後都可用)。Action 只會在 root 中可用 (因為沒有來自 step 的 action)，而 reward 只會在 next 中可用 (因為在 reset 時間沒有 reward)。此結構遵循 Reinforcement Learning: An Introduction (Sutton and Barto) 中的結構，其中 root 代表時間 \(t\) 的資料，而 next 代表世界 step 的時間 \(t+1\) 的資料。

Render a random rollout¶

如果您在 Google Colab 上，或在具有 OpenGL 和 GUI 的機器上，您可以實際渲染一個隨機 rollout。這將讓您了解隨機策略在此任務中將實現什麼，以便將其與您自己訓練的策略進行比較！

要渲染 rollout，請按照本教學課程末尾Render部分的說明進行操作，只需從 env.rollout() 中刪除 policy=policy 這行程式碼即可。

Policy¶

PPO 使用隨機策略來處理探索。這意味著我們的神經網路將必須輸出一個分佈的參數，而不是對應於所採取行動的單一值。

由於資料是連續的，我們使用 Tanh-Normal 分佈來尊重 action space 的邊界。TorchRL 提供了這種分佈，我們唯一需要關心的是構建一個輸出正確參數數量的神經網路。

在這種情況下，每個 agent 的 action 將由一個 2 維的獨立常態分佈表示。為此，我們的神經網路將必須為每個 action 輸出一個平均值和一個標準差。因此，每個 agent 將具有 2 * n_actions_per_agents 個輸出。

我們需要做出的另一個重要決定是我們是否希望我們的 agents 共享策略參數。一方面，共享參數意味著它們都將共享相同的策略，這將使它們能夠從彼此的經驗中受益。這也將導致更快的訓練。另一方面，它會使它們的行為同質化，因為它們實際上將共享相同的模型。對於這個範例，我們將啟用共享，因為我們不介意同質性，並且可以從計算速度中受益，但始終在您自己的問題中考慮這個決定非常重要！

我們分三個步驟設計策略。

首先：定義一個神經網路 n_obs_per_agent -> 2 * n_actions_per_agents

為此，我們使用 MultiAgentMLP，這是一個專為多個 agents 製作的 TorchRL 模組，具有許多可用的自定義選項。

share_parameters_policy = True

policy_net = torch.nn.Sequential(
    MultiAgentMLP(
        n_agent_inputs=env.observation_spec["agents", "observation"].shape[
            -1
        ],  # n_obs_per_agent
        n_agent_outputs=2 * env.action_spec.shape[-1],  # 2 * n_actions_per_agents
        n_agents=env.n_agents,
        centralised=False,  # the policies are decentralised (ie each agent will act from its observation)
        share_params=share_parameters_policy,
        device=device,
        depth=2,
        num_cells=256,
        activation_class=torch.nn.Tanh,
    ),
    NormalParamExtractor(),  # this will just separate the last dimension into two outputs: a loc and a non-negative scale
)

第二：將神經網路包裹在 TensorDictModule 中

這只是一個模組，它將從 tensordict 中讀取 in_keys，將它們饋送到神經網路，並將輸出直接寫入 out_keys。

請注意，我們使用 ("agents", ...) 鍵，因為這些鍵表示具有額外 n_agents 維度的資料。

policy_module = TensorDictModule(
    policy_net,
    in_keys=[("agents", "observation")],
    out_keys=[("agents", "loc"), ("agents", "scale")],
)

第三：將 TensorDictModule 包裹在 ProbabilisticActor 中

我們現在需要根據常態分佈的位置和尺度來建立一個分佈。為了做到這一點，我們指示 ProbabilisticActor 類別，利用位置和尺度參數來建立一個 TanhNormal。我們還提供這個分佈的最小值和最大值，這些值是從環境規格中收集的。

in_keys 的名稱（以及因此，來自上述 TensorDictModule 的 out_keys 名稱）必須以 TanhNormal 分佈建構子的關鍵字參數（loc 和 scale）結尾。

policy = ProbabilisticActor(
    module=policy_module,
    spec=env.unbatched_action_spec,
    in_keys=[("agents", "loc"), ("agents", "scale")],
    out_keys=[env.action_key],
    distribution_class=TanhNormal,
    distribution_kwargs={
        "low": env.unbatched_action_spec[env.action_key].space.low,
        "high": env.unbatched_action_spec[env.action_key].space.high,
    },
    return_log_prob=True,
    log_prob_key=("agents", "sample_log_prob"),
)  # we'll need the log-prob for the PPO loss

Critic 網路¶

Critic 網路是 PPO 演算法的一個關鍵組件，即使它在取樣時沒有被使用。這個模組會讀取觀測值並返回相應的價值估計。

和之前一樣，應該仔細思考共享 Critic 參數的決策。一般來說，參數共享會加快訓練收斂，但有一些重要的考量需要考慮。

當 agents 具有不同的獎勵函數時，不建議共享參數，因為 Critic 需要學習為相同的狀態分配不同的價值（例如，在混合合作競爭的環境中）。
在分散式訓練環境中，如果沒有額外的基礎設施來同步參數，則無法進行共享。

在所有其他獎勵函數（與 reward 區分）對所有 agents 相同的情況下（如目前的情境），共享可以提供更好的效能。這可能會以 agents 策略的同質性為代價。一般來說，要知道哪種選擇更好，最好的方法是快速地實驗這兩種選擇。

這裡也是我們必須在 MAPPO 和 IPPO 之間做出選擇的地方。

使用 MAPPO，我們將獲得一個具有完全可觀測性的中央 Critic（也就是說，它將把所有串聯的 agent 觀測值作為輸入）。我們可以這樣做，因為我們在模擬器中，並且訓練是集中的。
使用 IPPO，我們將擁有一個本地分散式的 Critic，就像 policy 一樣。

在任何情況下，Critic 的輸出都將具有形狀 (..., n_agents, 1)。如果 Critic 是集中式和共享的，則沿著 n_agents 維度的所有值都將相同。

share_parameters_critic = True
mappo = True  # IPPO if False

critic_net = MultiAgentMLP(
    n_agent_inputs=env.observation_spec["agents", "observation"].shape[-1],
    n_agent_outputs=1,  # 1 value per agent
    n_agents=env.n_agents,
    centralised=mappo,
    share_params=share_parameters_critic,
    device=device,
    depth=2,
    num_cells=256,
    activation_class=torch.nn.Tanh,
)

critic = TensorDictModule(
    module=critic_net,
    in_keys=[("agents", "observation")],
    out_keys=[("agents", "state_value")],
)

讓我們試試我們的 policy 和 critic 模組。如前所述，使用 TensorDictModule 使其可以直接讀取環境的輸出以運行這些模組，因為它們知道要讀取哪些訊息以及將其寫入何處。

從這一點開始，多 agent 特定的組件已經被實例化，我們將簡單地使用與單 agent 學習中相同的組件。這不是很棒嗎？

print("Running policy:", policy(env.reset()))
print("Running value:", critic(env.reset()))

Running policy: TensorDict(
    fields={
        agents: TensorDict(
            fields={
                action: Tensor(shape=torch.Size([60, 3, 2]), device=cpu, dtype=torch.float32, is_shared=False),
                episode_reward: Tensor(shape=torch.Size([60, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                info: TensorDict(
                    fields={
                        agent_collisions: Tensor(shape=torch.Size([60, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                        final_rew: Tensor(shape=torch.Size([60, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                        pos_rew: Tensor(shape=torch.Size([60, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([60, 3]),
                    device=cpu,
                    is_shared=False),
                loc: Tensor(shape=torch.Size([60, 3, 2]), device=cpu, dtype=torch.float32, is_shared=False),
                observation: Tensor(shape=torch.Size([60, 3, 18]), device=cpu, dtype=torch.float32, is_shared=False),
                sample_log_prob: Tensor(shape=torch.Size([60, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                scale: Tensor(shape=torch.Size([60, 3, 2]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([60, 3]),
            device=cpu,
            is_shared=False),
        done: Tensor(shape=torch.Size([60, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        terminated: Tensor(shape=torch.Size([60, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
    batch_size=torch.Size([60]),
    device=cpu,
    is_shared=False)
Running value: TensorDict(
    fields={
        agents: TensorDict(
            fields={
                episode_reward: Tensor(shape=torch.Size([60, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                info: TensorDict(
                    fields={
                        agent_collisions: Tensor(shape=torch.Size([60, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                        final_rew: Tensor(shape=torch.Size([60, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                        pos_rew: Tensor(shape=torch.Size([60, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([60, 3]),
                    device=cpu,
                    is_shared=False),
                observation: Tensor(shape=torch.Size([60, 3, 18]), device=cpu, dtype=torch.float32, is_shared=False),
                state_value: Tensor(shape=torch.Size([60, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([60, 3]),
            device=cpu,
            is_shared=False),
        done: Tensor(shape=torch.Size([60, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        terminated: Tensor(shape=torch.Size([60, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
    batch_size=torch.Size([60]),
    device=cpu,
    is_shared=False)

資料收集器¶

TorchRL 提供了一組資料收集器類別。簡而言之，這些類別執行三個操作：重置環境、使用 policy 和最新的觀測值計算 action、在環境中執行一個 step，並重複最後兩個 step，直到環境發出停止訊號（或達到完成狀態）。

我們將使用最簡單的資料收集器，它具有與環境 rollout 相同的輸出，唯一的區別是它會自動重置完成狀態，直到收集到所需的 frames。

collector = SyncDataCollector(
    env,
    policy,
    device=vmas_device,
    storing_device=device,
    frames_per_batch=frames_per_batch,
    total_frames=total_frames,
)

Replay 緩衝區¶

Replay 緩衝區是 off-policy RL 演算法的常見建構模組。在 on-policy 環境中，每次收集一批資料時，都會重新填充 replay 緩衝區，並且其資料會被重複消耗一定數量的 epochs。

為 PPO 使用 replay 緩衝區不是強制性的，我們可以簡單地在線上使用收集的資料，但是使用這些類別可以讓我們以可重現的方式輕鬆建立內部訓練迴圈。

replay_buffer = ReplayBuffer(
    storage=LazyTensorStorage(
        frames_per_batch, device=device
    ),  # We store the frames_per_batch collected at each iteration
    sampler=SamplerWithoutReplacement(),
    batch_size=minibatch_size,  # We will sample minibatches of this size
)

損失函數¶

為了方便起見，可以使用 ClipPPOLoss 類別直接從 TorchRL 導入 PPO 損失。這是利用 PPO 最簡單的方法：它隱藏了 PPO 的數學運算和與之相關的控制流程。

PPO 需要計算一些「優勢估計」。簡而言之，優勢是一個反映回報值期望的值，同時處理偏差/變異數的權衡。為了計算優勢，只需要 (1) 建立優勢模組，該模組利用我們的 value 運算符，以及 (2) 在每個 epoch 之前將每批資料傳遞給它。GAE 模組將使用新的 "advantage" 和 "value_target" 條目更新輸入 TensorDict。"value_target" 是一個無梯度張量，表示 value 網路應該用輸入觀測值表示的經驗價值。這些都將被 ClipPPOLoss 用於返回 policy 和 value 損失。

loss_module = ClipPPOLoss(
    actor_network=policy,
    critic_network=critic,
    clip_epsilon=clip_epsilon,
    entropy_coef=entropy_eps,
    normalize_advantage=False,  # Important to avoid normalizing across the agent dimension
)
loss_module.set_keys(  # We have to tell the loss where to find the keys
    reward=env.reward_key,
    action=env.action_key,
    sample_log_prob=("agents", "sample_log_prob"),
    value=("agents", "state_value"),
    # These last 2 keys will be expanded to match the reward shape
    done=("agents", "done"),
    terminated=("agents", "terminated"),
)


loss_module.make_value_estimator(
    ValueEstimators.GAE, gamma=gamma, lmbda=lmbda
)  # We build GAE
GAE = loss_module.value_estimator

optim = torch.optim.Adam(loss_module.parameters(), lr)

訓練迴圈¶

我們現在擁有編寫訓練迴圈所需的所有組件。這些步驟包括

收集資料
- 計算優勢
  
  迴圈遍歷 epochs
  
  迴圈遍歷 minibatches 以計算損失值
  
  反向傳播
  
  優化
  
  重複
  
  重複
- 重複
重複

pbar = tqdm(total=n_iters, desc="episode_reward_mean = 0")

episode_reward_mean_list = []
for tensordict_data in collector:
    tensordict_data.set(
        ("next", "agents", "done"),
        tensordict_data.get(("next", "done"))
        .unsqueeze(-1)
        .expand(tensordict_data.get_item_shape(("next", env.reward_key))),
    )
    tensordict_data.set(
        ("next", "agents", "terminated"),
        tensordict_data.get(("next", "terminated"))
        .unsqueeze(-1)
        .expand(tensordict_data.get_item_shape(("next", env.reward_key))),
    )
    # We need to expand the done and terminated to match the reward shape (this is expected by the value estimator)

    with torch.no_grad():
        GAE(
            tensordict_data,
            params=loss_module.critic_network_params,
            target_params=loss_module.target_critic_network_params,
        )  # Compute GAE and add it to the data

    data_view = tensordict_data.reshape(-1)  # Flatten the batch size to shuffle data
    replay_buffer.extend(data_view)

    for _ in range(num_epochs):
        for _ in range(frames_per_batch // minibatch_size):
            subdata = replay_buffer.sample()
            loss_vals = loss_module(subdata)

            loss_value = (
                loss_vals["loss_objective"]
                + loss_vals["loss_critic"]
                + loss_vals["loss_entropy"]
            )

            loss_value.backward()

            torch.nn.utils.clip_grad_norm_(
                loss_module.parameters(), max_grad_norm
            )  # Optional

            optim.step()
            optim.zero_grad()

    collector.update_policy_weights_()

    # Logging
    done = tensordict_data.get(("next", "agents", "done"))
    episode_reward_mean = (
        tensordict_data.get(("next", "agents", "episode_reward"))[done].mean().item()
    )
    episode_reward_mean_list.append(episode_reward_mean)
    pbar.set_description(f"episode_reward_mean = {episode_reward_mean}", refresh=False)
    pbar.update()

episode_reward_mean = 0:   0%|          | 0/10 [00:00<?, ?it/s]
episode_reward_mean = -0.4579917788505554:  10%|█         | 1/10 [00:05<00:51,  5.70s/it]
episode_reward_mean = 0.23260341584682465:  20%|██        | 2/10 [00:11<00:45,  5.64s/it]
episode_reward_mean = 1.1713813543319702:  30%|███       | 3/10 [00:16<00:39,  5.62s/it]
episode_reward_mean = 1.386345624923706:  40%|████      | 4/10 [00:22<00:33,  5.61s/it]
episode_reward_mean = 1.8939578533172607:  50%|█████     | 5/10 [00:28<00:27,  5.60s/it]
episode_reward_mean = 2.2214083671569824:  60%|██████    | 6/10 [00:33<00:22,  5.59s/it]
episode_reward_mean = 2.1770293712615967:  70%|███████   | 7/10 [00:39<00:16,  5.60s/it]
episode_reward_mean = 2.6274709701538086:  80%|████████  | 8/10 [00:44<00:11,  5.62s/it]
episode_reward_mean = 2.73148250579834:  90%|█████████ | 9/10 [00:50<00:05,  5.65s/it]
episode_reward_mean = 2.737316608428955: 100%|██████████| 10/10 [00:56<00:00,  5.68s/it]

結果¶

讓我們繪製每個 episode 獲得的平均 reward

為了使訓練持續更長時間，請增加 n_iters 超參數。

plt.plot(episode_reward_mean_list)
plt.xlabel("Training iterations")
plt.ylabel("Reward")
plt.title("Episode reward mean")
plt.show()

渲染¶

如果您在具有 GUI 的機器上運行此程式碼，您可以透過運行以下命令來渲染訓練好的 policy

with torch.no_grad():
   env.rollout(
       max_steps=max_steps,
       policy=policy,
       callback=lambda env, _: env.render(),
       auto_cast_to_device=True,
       break_when_any_done=False,
   )

如果您在 Google Colab 中運行此程式碼，您可以透過運行以下命令來渲染訓練好的 policy

!apt-get update
!apt-get install -y x11-utils
!apt-get install -y xvfb
!pip install pyvirtualdisplay

import pyvirtualdisplay
display = pyvirtualdisplay.Display(visible=False, size=(1400, 900))
display.start()
from PIL import Image

def rendering_callback(env, td):
    env.frames.append(Image.fromarray(env.render(mode="rgb_array")))
env.frames = []
with torch.no_grad():
   env.rollout(
       max_steps=max_steps,
       policy=policy,
       callback=rendering_callback,
       auto_cast_to_device=True,
       break_when_any_done=False,
   )
env.frames[0].save(
    f"{scenario_name}.gif",
    save_all=True,
    append_images=env.frames[1:],
   duration=3,
   loop=0,
)

from IPython.display import Image
Image(open(f"{scenario_name}.gif", "rb").read())

結論和下一步¶

在本教程中，我們已經看到

如何在 TorchRL 中建立多代理人環境、其規格如何運作以及如何與函式庫整合；
如何在 TorchRL 中使用 GPU 向量化環境；
如何在 TorchRL 中建立不同的多代理人網路架構（例如，使用參數共享、集中式評估器）
如何使用 tensordict.TensorDict 來攜帶多代理人資料；
如何將所有函式庫元件（收集器、模組、重播緩衝區和損失）連結到多代理人 MAPPO/IPPO 訓練迴圈中。

現在您已經精通多 agent DDPG，您可以查看 GitHub 儲存庫中的所有 TorchRL 多 agent 實現。這些是許多流行的 MARL 演算法的純程式碼腳本，例如本教程中看到的那些、QMIX、MADDPG、IQL 等等！

您還可以查看我們關於如何在具有多個 agent 群組的 PettingZoo/VMAS 中訓練競爭性 MADDPG/IDDPG 的另一個多 agent 教程：使用 TorchRL 教程進行競爭性多 Agent 強化學習 (DDPG)。

如果您有興趣在 TorchRL 中建立或封裝您自己的多 agent 環境，您可以查看專用的文件章節。

最後，您可以修改本教程的參數，以嘗試許多其他配置和情境，從而成為 MARL 大師。以下是一些您可以在 VMAS 中嘗試的可能情境的影片。

腳本總執行時間： (1 分鐘 52.899 秒)

預估記憶體用量： 357 MB

由 Sphinx-Gallery 產生圖庫