otx.cli.utils.multi_gpu#

Multi GPU training utility.

Functions

get_gpu_ids(gpus)

Get proper GPU indices form --gpu arguments.

is_multigpu_child_process()

Check current process is a child process for multi GPU training.

set_arguments_to_argv(keys[, value, ...])

Add arguments at proper position in sys.argv.

Classes

MultiGPUManager(train_func, gpu_ids[, ...])

Class to manage multi GPU training.

class otx.cli.utils.multi_gpu.MultiGPUManager(train_func: Callable, gpu_ids: str, rdzv_endpoint: str = 'localhost:0', base_rank: int = 0, world_size: int = 0, start_time: datetime | None = None)[source]#

Bases: object

Class to manage multi GPU training.

Parameters:
  • train_func (Callable) – model training function.

  • gpu_ids (str) – GPU indices to use. Format should be Comma-separated indices.

  • rdzv_endpoint (str) – Rendezvous endpoint for multi-node training.

  • base_rank (int) – Base rank of the worker.

  • world_size (int) – Total number of workers in a worker group.

  • start_time (Optional[datetime.datetime]) – Time when process starts. This value is used to decide timeout argument of distributed training.

static check_parent_processes_alive()[source]#

Check parent process is alive and if not, exit by itself.

finalize()[source]#

Join all child processes.

static initialize_multigpu_train(rdzv_endpoint: str, rank: int, local_rank: int, gpu_ids: List[int], world_size: int)[source]#

Initilization for multi GPU training.

Parameters:
  • rdzv_endpoint (str) – Rendezvous endpoint for multi-node training.

  • rank (int) – The rank of worker within a worker group.

  • local_rank (int) – The rank of worker within a local worker group.

  • gpu_ids (List[int]) – list including which GPU indeces will be used.

  • world_size (int) – Total number of workers in a worker group.

is_available() bool[source]#

Check multi GPU training is available.

Returns:

whether multi GPU training is available.

Return type:

bool

static run_child_process(train_func: Callable, output_path: str, rdzv_endpoint: str, rank: int, local_rank: int, gpu_ids: List[int], world_size: int)[source]#

Function for multi GPU child process to execute.

Parameters:
  • train_func (Callable) – model training function.

  • output_path (str) – output path where task output are saved.

  • rdzv_endpoint (str) – Rendezvous endpoint for multi-node training.

  • rank (int) – The rank of worker within a worker group.

  • local_rank (int) – The rank of worker within a local worker group.

  • gpu_ids (List[int]) – list including which GPU indeces will be used.

  • world_size (int) – Total number of workers in a worker group.

setup_multi_gpu_train(output_path: str, optimized_hyper_parameters: ConfigurableParameters | None = None)[source]#

Carry out what should be done to run multi GPU training.

Parameters:
  • output_path (str) – output path where task output are saved.

  • optimized_hyper_parameters (ConfigurableParameters or None) – hyper parameters reflecting HPO result.

Returns:

If output_path is None, make a temporary directory and return it.

Return type:

str

otx.cli.utils.multi_gpu.get_gpu_ids(gpus: str) List[int][source]#

Get proper GPU indices form –gpu arguments.

Given –gpus argument, exclude inappropriate indices and transform to list of int format.

Parameters:

gpus (str) – GPU indices to use. Format should be Comma-separated indices.

Returns:

list including proper GPU indices.

Return type:

List[int]

otx.cli.utils.multi_gpu.is_multigpu_child_process()[source]#

Check current process is a child process for multi GPU training.

otx.cli.utils.multi_gpu.set_arguments_to_argv(keys: str | List[str], value: str | None = None, after_params: bool = False)[source]#

Add arguments at proper position in sys.argv.

Parameters:
  • keys (str or List[str]) – arguement keys.

  • value (str or None) – argument value.

  • after_params (bool) – whether argument should be after param or not.