P2P Averaging
Documentation of p2p_averaging
submodule's interface.
Index
Public Interface
ACiD.synchronize!
— FunctionThe synchronize function.
Expects that the peer with whom we communicate runs the same synchronize function. The p2p communication edits in-place the values of the parameters paramscom and paramscomtilde (if applyacid).
Parameters:
- params_com (torch.tensor): 1D tensor containing the model's parameters.
- params_other_worker (torch.tensor): 1D tensor, placeholder to receive the
params_com
of the worker with whom we communicate. - process_group (a torch distributed
process_group
): specifies theprocess_group
to use for the p2p communications. - other_rank (int): the rank of the worker we communicate with.
- apply_acid (bool): whether or not to apply ACiD momentum. If true, the communication is an "event" triggering a momentum update.
- params_com_tilde (torch.tensor): "momentum" variable, same size as
params_com
, mixing withparams_com
to obtain acceleration. - ode_matrix (torch.tensor): a 2x2 matrix storing the parameters of the linear mixing between params and
params_tilde
. - t_last_spike (float): time of the last local update to
params_com
(be it a communication or gradient one). - delta_t_grad (mp.Value storing a double): the variable keeping track of the time that it takes to make a grad step.
- beta_tilde (float): the α̃ value to use in ACiD.
ACiD.gossip_process
— FunctionGossip routine for the p2p averaging of the model's parameters running in the background.
- Average the parameters of all the workers at the beginning (to start from a common initialization), and at the end.
- Use the mp.Variable
rank_other
to communicate with the orchestring process that pairs available workers together to perform p2p communications, allowing this function to know with which rank to communicate next. - Depending on
deterministic_com
, implement or not a P.P.P for the communication process: if true, a random number of p2p communications between 2 grad steps are done, following a poisson law. - When the orchestrating process counted that the right number of grad step have been performed in total, signal it back to this process (stops the communication routine), which signals to the main process to stop performing grad steps.
Parameters:
- rank (int): our rank id in the distributed setting.
- local_rank (int): the local rank of the worker inside its compute node (to create a Cuda Stream in the right GPU).
- world_size (int): the total number of workers.
- rank_other (mp.Value): a multiprocessing Value to store the id of the rank of the next communication. It is updated by the orchestrating process pairing workers together, and re-initialized by this one after a communication. if
rank_other.value
== -1: (base value) no peer has been found yet. ifrank_other.value
== -2: signal from the orchestrating process that enough gradients have been computed in total, stops the communication process. ifrank_other.value
not in [-1, -2]: contains the rank of the worker we are supposed to communicate with next. - params_com (torch.tensor): 1D tensor containing the model's parameters.
- params_other (torch.tensor): 1D tensor, placeholder to receive the
params_com
of the worker with whom we communicate. - barrier_sync_averaging (mp.Barrier): a barrier used to communicate with the synchronization process. When we meet this barrier, we signal to the sync process that we finished our previous communication, and are available for the next one, so that it can begin to look for another available peer to connect to for the next p2p communication.
- continue_grad_routine (mp.Value containing a bool): whether or not the grad process should continue. Initialized at 1 (true). Is put to 0 (False) when the orchestrating process signals to us that the total number of gradients quota has been met.
- barrier_end_init (mp.Barrier): a barrier to signal to the __init__ function of ADP's class that the initializing average of the parameters has been performed, and that ADP can resume its init.
- barrier_com_grad (mp.Barrier): a barrier to make sure a certain amount of communication has been made between 2 grads. Also used to make sure a certain amount of grad have been performed between 2 comm if
rate_com
< 1. - log (logger): to print messages in the logs if needed.
- com_history (list of mp.Value): list of size
world_size
. Used to logg how many times this worker communicated with each of its peers. - count_coms_local (mp.Value): a count of the number of p2p communications this worker has done.
- rate_com (float): the rate at which p2p communications are done (in expectation) compared to local grad steps.
- apply_acid (bool): whether or not to apply ACiD momentum. If True, the communication is an "event" triggering a momentum update.
- params_com_tilde (torch.tensor): "momentum" variable, same size as
params_com
, mixing withparams_com
to obtain acceleration. - ode_matrix (torch.tensor): a 2x2 matrix storing the parameters of the linear mixing between params and
params_tilde
. - t_last_spike (float): time of the last local update to
params_com
(be it a communication or gradient one). - delta_t_grad (mp.Value storing a double): the variable keeping track of the time that it takes to make a grad step.
- beta_tilde (float): the α̃ value to use in ACiD.
- deterministic_com (bool): whether or not to schedule to use Poisson Point Processes for the communications. if True, a random number of p2p communications between 2 grad steps are done, following a poisson law.