otx.cli.utils.multi_gpu#

Multi GPU training utility.

Functions

`get_gpu_ids`(gpus)	Get proper GPU indices form --gpu arguments.
`is_multigpu_child_process`()	Check current process is a child process for multi GPU training.
`set_arguments_to_argv`(keys[, value, ...])	Add arguments at proper position in sys.argv.

Classes

MultiGPUManager(train_func, gpu_ids[, ...])

Class to manage multi GPU training.

class otx.cli.utils.multi_gpu.MultiGPUManager(train_func: Callable, gpu_ids: str, rdzv_endpoint: str = 'localhost:0', base_rank: int = 0, world_size: int = 0, start_time: datetime | None = None)[source]#

Bases: object

Class to manage multi GPU training.

Parameters:

train_func (Callable) – model training function.
gpu_ids (str) – GPU indices to use. Format should be Comma-separated indices.
rdzv_endpoint (str) – Rendezvous endpoint for multi-node training.
base_rank (int) – Base rank of the worker.
world_size (int) – Total number of workers in a worker group.
start_time (Optional[datetime.datetime]) – Time when process starts. This value is used to decide timeout argument of distributed training.

static check_parent_processes_alive()[source]#: Check parent process is alive and if not, exit by itself.

finalize()[source]#: Join all child processes.

static initialize_multigpu_train(rdzv_endpoint: str, rank: int, local_rank: int, gpu_ids: List[int], world_size: int)[source]#

Initilization for multi GPU training.

Parameters:

rdzv_endpoint (str) – Rendezvous endpoint for multi-node training.
rank (int) – The rank of worker within a worker group.
local_rank (int) – The rank of worker within a local worker group.
gpu_ids (List[int]) – list including which GPU indeces will be used.
world_size (int) – Total number of workers in a worker group.

is_available() → bool[source]#

Check multi GPU training is available.

Returns:: whether multi GPU training is available.
Return type:: bool

static run_child_process(train_func: Callable, output_path: str, rdzv_endpoint: str, rank: int, local_rank: int, gpu_ids: List[int], world_size: int)[source]#

Function for multi GPU child process to execute.

Parameters:

train_func (Callable) – model training function.
output_path (str) – output path where task output are saved.
rdzv_endpoint (str) – Rendezvous endpoint for multi-node training.
rank (int) – The rank of worker within a worker group.
local_rank (int) – The rank of worker within a local worker group.
gpu_ids (List[int]) – list including which GPU indeces will be used.
world_size (int) – Total number of workers in a worker group.

setup_multi_gpu_train(output_path: str, optimized_hyper_parameters: ConfigurableParameters | None = None)[source]#

Carry out what should be done to run multi GPU training.

Parameters:

output_path (str) – output path where task output are saved.
optimized_hyper_parameters (ConfigurableParameters or None) – hyper parameters reflecting HPO result.

Returns:

If output_path is None, make a temporary directory and return it.

Return type:

str

otx.cli.utils.multi_gpu.get_gpu_ids(gpus: str) → List[int][source]#

Get proper GPU indices form –gpu arguments.

Given –gpus argument, exclude inappropriate indices and transform to list of int format.

Parameters:: gpus (str) – GPU indices to use. Format should be Comma-separated indices.
Returns:: list including proper GPU indices.
Return type:: List[int]

otx.cli.utils.multi_gpu.is_multigpu_child_process()[source]#: Check current process is a child process for multi GPU training.

otx.cli.utils.multi_gpu.set_arguments_to_argv(keys: str | List[str], value: str | None = None, after_params: bool = False)[source]#

Add arguments at proper position in sys.argv.

Parameters:

keys (str or List[str]) – arguement keys.
value (str or None) – argument value.
after_params (bool) – whether argument should be after param or not.