submitit_delayed_launcher¶
- class torchrl.collectors.distributed.submitit_delayed_launcher(num_jobs, framework='distributed', backend='gloo', tcpport='10003', submitit_main_conf: dict = {'slurm_cpus_per_task': 32, 'slurm_gpus_per_node': 1, 'slurm_partition': 'train', 'timeout_min': 10}, submitit_collection_conf: dict = {'slurm_cpus_per_task': 32, 'slurm_gpus_per_node': 0, 'slurm_partition': 'train', 'timeout_min': 10})[原始碼]¶
submitit 的延遲啟動器。
在某些情況下,啟動的任務無法自行產生其他任務,而這只能在跳轉主機層級完成。
在這些情況下,可以使用
submitit_delayed_launcher()
預先啟動收集器節點,這些節點將等待主要工作節點提供啟動指令。- 參數:
num_jobs (int) – 要啟動的收集任務數量。
framework (str, optional) – 要使用的框架。可以是
"distributed"
或"rpc"
。"distributed"
需要一個DistributedDataCollector
收集器,而"rpc"
需要一個RPCDataCollector
。預設為"distributed"
。backend (str, optional) – 當
framework
指向"distributed"
時,使用的 torch.distributed backend。此值必須與傳遞給收集器的值相符,否則主節點和衛星節點將無法達成 rendezvous 並永遠掛起(即不會引發例外!)預設為'gloo'
。tcpport (int or str, optional) – 要使用的 TCP 埠。預設為
torchrl.collectors.distributed.default_configs.TCP_PORT
submitit_main_conf (dict, optional) – 要傳遞給 submitit 的主節點組態。預設為
torchrl.collectors.distributed.default_configs.DEFAULT_SLURM_CONF_MAIN
submitit_collection_conf (dict, optional) – 要傳遞給 submitit 的組態。預設為
torchrl.collectors.distributed.default_configs.DEFAULT_SLURM_CONF
範例
>>> num_jobs=2 >>> @submitit_delayed_launcher(num_jobs=num_jobs) ... def main(): ... from torchrl.envs.utils import RandomPolicy from torchrl.envs.libs.gym import GymEnv ... from torchrl.data import BoundedContinuous ... collector = DistributedDataCollector( ... [EnvCreator(lambda: GymEnv("Pendulum-v1"))] * num_jobs, ... policy=RandomPolicy(BoundedContinuous(-1, 1, shape=(1,))), ... launcher="submitit_delayed", ... ) ... for data in collector: ... print(data) ... >>> if __name__ == "__main__": ... main() ...