注意
前往結尾以下載完整的範例程式碼。
使用預訓練模型¶
本教學說明如何在 TorchRL 中使用預訓練模型。
在本教學結束時,您將能夠使用預訓練模型來進行有效率的圖像表示,並對其進行微調。
TorchRL 提供預訓練模型,可用作轉換或作為策略的元件。由於語意相同,因此它們可以在一種或另一種情況下互換使用。在本教學中,我們將使用 R3M (https://arxiv.org/abs/2203.12601),但其他模型 (例如 VIP) 也同樣有效。
import torch.cuda
from tensordict.nn import TensorDictSequential
from torch import nn
from torchrl.envs import R3MTransform, TransformedEnv
from torchrl.envs.libs.gym import GymEnv
from torchrl.modules import Actor
is_fork = multiprocessing.get_start_method() == "fork"
device = (
torch.device(0)
if torch.cuda.is_available() and not is_fork
else torch.device("cpu")
)
讓我們首先建立一個環境。為了簡單起見,我們將使用一個常見的 gym 環境。實際上,這將在更具挑戰性的具體 AI 環境中工作(例如,看看我們的 Habitat 封裝器)。
讓我們獲取我們的預訓練模型。我們透過 download=True 標誌來請求模型的預訓練版本。預設情況下,此標誌處於關閉狀態。接下來,我們將我們的轉換附加到環境中。實際上,將會發生的是,收集的每批資料將會經過轉換,並映射到輸出 tensordict 中的 “r3m_vec” 條目。我們的策略,由單層 MLP 組成,然後將讀取此向量並計算相應的動作。
r3m = R3MTransform(
"resnet50",
in_keys=["pixels"],
download=True,
)
env_transformed = TransformedEnv(base_env, r3m)
net = nn.Sequential(
nn.LazyLinear(128, device=device),
nn.Tanh(),
nn.Linear(128, base_env.action_spec.shape[-1], device=device),
)
policy = Actor(net, in_keys=["r3m_vec"])
Downloading: "https://pytorch.s3.amazonaws.com/models/rl/r3m/r3m_50.pt" to /root/.cache/torch/hub/checkpoints/r3m_50.pt
0%| | 0.00/374M [00:00<?, ?B/s]
4%|▍ | 16.5M/374M [00:00<00:04, 93.6MB/s]
8%|▊ | 31.2M/374M [00:00<00:03, 99.2MB/s]
11%|█ | 40.9M/374M [00:00<00:05, 66.8MB/s]
13%|█▎ | 49.2M/374M [00:00<00:05, 64.0MB/s]
17%|█▋ | 64.0M/374M [00:00<00:04, 76.0MB/s]
19%|█▉ | 71.6M/374M [00:01<00:04, 72.4MB/s]
22%|██▏ | 82.0M/374M [00:01<00:05, 58.1MB/s]
26%|██▌ | 97.8M/374M [00:01<00:03, 75.2MB/s]
28%|██▊ | 106M/374M [00:01<00:03, 72.8MB/s]
31%|███ | 115M/374M [00:01<00:03, 71.5MB/s]
35%|███▍ | 130M/374M [00:01<00:03, 70.2MB/s]
37%|███▋ | 138M/374M [00:02<00:03, 65.4MB/s]
38%|███▊ | 144M/374M [00:02<00:04, 60.4MB/s]
40%|████ | 150M/374M [00:02<00:05, 40.2MB/s]
44%|████▍ | 164M/374M [00:02<00:04, 54.2MB/s]
48%|████▊ | 179M/374M [00:02<00:02, 68.8MB/s]
50%|████▉ | 186M/374M [00:02<00:02, 68.4MB/s]
52%|█████▏ | 195M/374M [00:03<00:02, 71.4MB/s]
54%|█████▍ | 203M/374M [00:03<00:03, 59.8MB/s]
57%|█████▋ | 212M/374M [00:03<00:02, 62.5MB/s]
58%|█████▊ | 219M/374M [00:03<00:03, 44.5MB/s]
61%|██████▏ | 229M/374M [00:03<00:02, 52.5MB/s]
66%|██████▌ | 246M/374M [00:04<00:02, 63.1MB/s]
70%|██████▉ | 262M/374M [00:04<00:01, 74.1MB/s]
72%|███████▏ | 269M/374M [00:04<00:01, 66.1MB/s]
74%|███████▍ | 277M/374M [00:04<00:02, 40.5MB/s]
75%|███████▌ | 282M/374M [00:04<00:02, 38.6MB/s]
78%|███████▊ | 293M/374M [00:05<00:01, 49.1MB/s]
80%|████████ | 299M/374M [00:05<00:02, 39.0MB/s]
83%|████████▎ | 311M/374M [00:05<00:01, 38.5MB/s]
87%|████████▋ | 326M/374M [00:05<00:01, 45.5MB/s]
88%|████████▊ | 331M/374M [00:06<00:01, 44.1MB/s]
92%|█████████▏| 344M/374M [00:06<00:00, 50.1MB/s]
96%|█████████▋| 360M/374M [00:06<00:00, 59.6MB/s]
100%|█████████▉| 373M/374M [00:06<00:00, 57.6MB/s]
100%|██████████| 374M/374M [00:06<00:00, 57.8MB/s]
讓我們檢查策略的參數數量
print("number of params:", len(list(policy.parameters())))
number of params: 4
我們收集 32 個步驟的 rollout 並列印其輸出
rollout = env_transformed.rollout(32, policy)
print("rollout with transform:", rollout)
rollout with transform: TensorDict(
fields={
action: Tensor(shape=torch.Size([32, 8]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
next: TensorDict(
fields={
done: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
r3m_vec: Tensor(shape=torch.Size([32, 2048]), device=cpu, dtype=torch.float32, is_shared=False),
reward: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
truncated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
batch_size=torch.Size([32]),
device=cpu,
is_shared=False),
r3m_vec: Tensor(shape=torch.Size([32, 2048]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
truncated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
batch_size=torch.Size([32]),
device=cpu,
is_shared=False)
為了進行微調,我們在使參數可訓練後,將轉換整合到策略中。實際上,將其限制為參數的子集(例如 MLP 的最後一層)可能更明智。
r3m.train()
policy = TensorDictSequential(r3m, policy)
print("number of params after r3m is integrated:", len(list(policy.parameters())))
number of params after r3m is integrated: 163
再次,我們使用 R3M 收集 rollout。輸出的結構略有變化,因為現在環境傳回的是像素(而不是嵌入)。嵌入 “r3m_vec” 是我們策略的中間結果。
rollout = base_env.rollout(32, policy)
print("rollout, fine tuning:", rollout)
rollout, fine tuning: TensorDict(
fields={
action: Tensor(shape=torch.Size([32, 8]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
next: TensorDict(
fields={
done: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
pixels: Tensor(shape=torch.Size([32, 480, 480, 3]), device=cpu, dtype=torch.uint8, is_shared=False),
reward: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
truncated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
batch_size=torch.Size([32]),
device=cpu,
is_shared=False),
r3m_vec: Tensor(shape=torch.Size([32, 2048]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
truncated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
batch_size=torch.Size([32]),
device=cpu,
is_shared=False)
我們能夠輕鬆地將轉換從 env 交換到策略,這是因為兩者都像 TensorDictModule 一樣運作:它們有一組 “in_keys” 和 “out_keys”,可以輕鬆地在不同上下文中讀取和寫入輸出。
為了總結本教學,讓我們看看我們如何使用 R3M 來讀取儲存在回放緩衝區中的圖像(例如,在離線 RL 環境中)。首先,讓我們建立我們的資料集
from torchrl.data import LazyMemmapStorage, ReplayBuffer
storage = LazyMemmapStorage(1000)
rb = ReplayBuffer(storage=storage, transform=r3m)
我們現在可以收集資料(為了我們的目的而隨機 rollout)並用它來填滿回放緩衝區
total = 0
while total < 1000:
tensordict = base_env.rollout(1000)
rb.extend(tensordict)
total += tensordict.numel()
讓我們檢查一下我們的回放緩衝區儲存看起來如何。它不應包含 “r3m_vec” 條目,因為我們尚未使用它
print("stored data:", storage._storage)
stored data: TensorDict(
fields={
action: MemoryMappedTensor(shape=torch.Size([1000, 8]), device=cpu, dtype=torch.float32, is_shared=False),
done: MemoryMappedTensor(shape=torch.Size([1000, 1]), device=cpu, dtype=torch.bool, is_shared=False),
next: TensorDict(
fields={
done: MemoryMappedTensor(shape=torch.Size([1000, 1]), device=cpu, dtype=torch.bool, is_shared=False),
pixels: MemoryMappedTensor(shape=torch.Size([1000, 480, 480, 3]), device=cpu, dtype=torch.uint8, is_shared=False),
reward: MemoryMappedTensor(shape=torch.Size([1000, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: MemoryMappedTensor(shape=torch.Size([1000, 1]), device=cpu, dtype=torch.bool, is_shared=False),
truncated: MemoryMappedTensor(shape=torch.Size([1000, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
batch_size=torch.Size([1000]),
device=cpu,
is_shared=False),
pixels: MemoryMappedTensor(shape=torch.Size([1000, 480, 480, 3]), device=cpu, dtype=torch.uint8, is_shared=False),
terminated: MemoryMappedTensor(shape=torch.Size([1000, 1]), device=cpu, dtype=torch.bool, is_shared=False),
truncated: MemoryMappedTensor(shape=torch.Size([1000, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
batch_size=torch.Size([1000]),
device=cpu,
is_shared=False)
在採樣時,資料將會經過 R3M 轉換,從而為我們提供我們想要的已處理資料。透過這種方式,我們可以在由圖像組成的資料集上離線訓練演算法
batch = rb.sample(32)
print("data after sampling:", batch)
data after sampling: TensorDict(
fields={
action: Tensor(shape=torch.Size([32, 8]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
next: TensorDict(
fields={
done: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
pixels: Tensor(shape=torch.Size([32, 480, 480, 3]), device=cpu, dtype=torch.uint8, is_shared=False),
reward: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
truncated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
batch_size=torch.Size([32]),
device=cpu,
is_shared=False),
r3m_vec: Tensor(shape=torch.Size([32, 2048]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
truncated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
batch_size=torch.Size([32]),
device=cpu,
is_shared=False)
腳本的總執行時間: (0 分鐘 55.393 秒)
估計的記憶體使用量: 2354 MB