otx.cli.utils.multi_gpu#
Multi GPU training utility.
Functions
|
Get proper GPU indices form --gpu arguments. |
Check current process is a child process for multi GPU training. |
|
|
Add arguments at proper position in sys.argv. |
Classes
|
Class to manage multi GPU training. |
- class otx.cli.utils.multi_gpu.MultiGPUManager(train_func: Callable, gpu_ids: str, rdzv_endpoint: str = 'localhost:0', base_rank: int = 0, world_size: int = 0, start_time: datetime | None = None)[source]#
Bases:
object
Class to manage multi GPU training.
- Parameters:
train_func (Callable) – model training function.
gpu_ids (str) – GPU indices to use. Format should be Comma-separated indices.
rdzv_endpoint (str) – Rendezvous endpoint for multi-node training.
base_rank (int) – Base rank of the worker.
world_size (int) – Total number of workers in a worker group.
start_time (Optional[datetime.datetime]) – Time when process starts. This value is used to decide timeout argument of distributed training.
- static check_parent_processes_alive()[source]#
Check parent process is alive and if not, exit by itself.
- static initialize_multigpu_train(rdzv_endpoint: str, rank: int, local_rank: int, gpu_ids: List[int], world_size: int)[source]#
Initilization for multi GPU training.
- Parameters:
rdzv_endpoint (str) – Rendezvous endpoint for multi-node training.
rank (int) – The rank of worker within a worker group.
local_rank (int) – The rank of worker within a local worker group.
gpu_ids (List[int]) – list including which GPU indeces will be used.
world_size (int) – Total number of workers in a worker group.
- is_available() bool [source]#
Check multi GPU training is available.
- Returns:
whether multi GPU training is available.
- Return type:
- static run_child_process(train_func: Callable, output_path: str, rdzv_endpoint: str, rank: int, local_rank: int, gpu_ids: List[int], world_size: int)[source]#
Function for multi GPU child process to execute.
- Parameters:
train_func (Callable) – model training function.
output_path (str) – output path where task output are saved.
rdzv_endpoint (str) – Rendezvous endpoint for multi-node training.
rank (int) – The rank of worker within a worker group.
local_rank (int) – The rank of worker within a local worker group.
gpu_ids (List[int]) – list including which GPU indeces will be used.
world_size (int) – Total number of workers in a worker group.
- setup_multi_gpu_train(output_path: str, optimized_hyper_parameters: ConfigurableParameters | None = None)[source]#
Carry out what should be done to run multi GPU training.
- Parameters:
output_path (str) – output path where task output are saved.
optimized_hyper_parameters (ConfigurableParameters or None) – hyper parameters reflecting HPO result.
- Returns:
If output_path is None, make a temporary directory and return it.
- Return type:
- otx.cli.utils.multi_gpu.get_gpu_ids(gpus: str) List[int] [source]#
Get proper GPU indices form –gpu arguments.
Given –gpus argument, exclude inappropriate indices and transform to list of int format.
- otx.cli.utils.multi_gpu.is_multigpu_child_process()[source]#
Check current process is a child process for multi GPU training.