Distributed Computation#
High-level Point FMM Interface#
To perform point-FMM, first construct a
boxtree.distributed.DistributedFMMRunner
object. The constructor will
distribute the necessary information from the root rank to all worker ranks. Then,
the boxtree.distributed.DistributedFMMRunner.drive_dfmm()
can be used for
launching FMM.
- class boxtree.distributed.DistributedFMMRunner(queue, global_tree_dev, traversal_builder, wrangler_factory, calibration_params=None, comm=<mpi4py.MPI.Intracomm object>)[source]#
Helper class for setting up and running distributed point FMM.
- __init__(queue, global_tree_dev, traversal_builder, wrangler_factory, calibration_params=None, comm=<mpi4py.MPI.Intracomm object>)[source]#
Distributes the global tree from the root rank to each worker rank.
- Parameters
global_tree_dev – a
boxtree.Tree
object in device memory.traversal_builder – an object which, when called, takes a
pyopencl.CommandQueue
object and aboxtree.Tree
object, and generates aboxtree.traversal.FMMTraversalInfo
object from the tree using the command queue.wrangler_factory – an object which, when called, takes the local traversal and the global traversal objects and returns an
boxtree.fmm.ExpansionWranglerInterface
object.calibration_params – Calibration parameters for the cost model, if supplied. The cost model is used for estimating the execution time of each box, which is used for improving load balancing.
comm – MPI communicator.
Distributed Algorithm Overview#
Construct the global tree and traversal lists on the root rank and broadcast to all worker ranks.
Partition boxes into disjoint sets, where the number of sets is the number of MPI ranks. (See Partition Boxes)
Each rank constructs the local tree and traversal lists independently, according to the partition. (See Construct Local Tree and Traversal)
Distribute source weights from the root rank to all worker ranks. (See Distributed Wrangler)
Each rank independently forms multipole expansions from the leaf nodes of the local tree and propagates the partial multipole expansions upwards.
Communicate multipole expansions so that all ranks have the complete multipole expansions needed.
Each ranks indepedently forms local expansions, propagates the local expansions downwards, and evaluate potentials of target points in its partition. The calculated potentials are then assembled on the root rank.
For step 5-7, see Distributed FMM Evaluation.
Note that step 4-7 may be repeated multiple times with the same tree and traversal object built from step 1-3. For example, when iteratively solving a PDE, step 4-7 is executed for each iteration of the linear solver.
The next sections will cover the interfaces of these steps.
Partition Boxes#
- boxtree.distributed.partition.partition_work(cost_per_box, traversal, comm)[source]#
This function assigns responsible boxes for each rank.
If a rank is responsible for a box, it will calculate the multiple expansion of the box and evaluate target potentials in the box.
- Parameters
cost_per_box – The expected running time of each box. This argument is only significant on the root rank.
traversal – The global traversal object containing all particles. This argument is significant on all ranks.
comm – MPI communicator.
- Returns
A numpy array containing the responsible boxes of the current rank.
- class boxtree.distributed.partition.BoxMasks(responsible_boxes: pyopencl.array.Array, ancestor_boxes: pyopencl.array.Array, point_src_boxes: pyopencl.array.Array, multipole_src_boxes: pyopencl.array.Array)[source]#
Box masks needed for the distributed calculation. Each of these masks is a PyOpenCL array with length
tree.nboxes
, whose i-th entry is 1 if box i is set.- responsible_boxes#
Current process will evaluate target potentials and multipole expansions in these boxes. Sources and targets in these boxes are needed.
- ancestor_boxes#
Ancestors of the responsible boxes.
- point_src_boxes#
Current process needs sources but not targets in these boxes.
- multipole_src_boxes#
Current process needs multipole expressions in these boxes.
- boxtree.distributed.partition.get_box_masks(queue, traversal, responsible_boxes_list)[source]#
Given the responsible boxes for a rank, this helper function calculates the relevant masks.
- Parameters
responsible_boxes_list – A numpy array of responsible box indices.
- Returns
A
BoxMasks
object of the relevant masks.
Construct Local Tree and Traversal#
- class boxtree.distributed.local_tree.LocalTree(valuedict=None, exclude=None, **kwargs)[source]#
Inherits from
boxtree.Tree
.- box_to_user_rank_starts#
box_id_t [nboxes + 1]
- box_to_user_rank_lists#
int32 [*]
A CSR-like interaction list storage array, together with
box_to_user_rank_starts
. For each box, the list of ranks which own targets that use the multipole expansion at this box, via either List 3 or (possibly downward propagated from an ancestor) List 2.
- boxtree.distributed.local_tree.generate_local_tree(queue, global_traversal, responsible_boxes_list, comm)[source]#
Generate the local tree for the current rank.
This is an MPI-collective routine on comm.
- Parameters
queue – a
pyopencl.CommandQueue
object.global_traversal – Global
boxtree.traversal.FMMTraversalInfo
object on host memory.responsible_boxes_list – a
numpy.ndarray
object containing the responsible boxes of the current rank.
- Returns
a tuple of
(local_tree, src_idx, tgt_idx)
, wherelocal_tree
is an object with classboxtree.distributed.local_tree.LocalTree
of the generated local tree,src_idx
is the indices of the local sources in the global tree, andtgt_idx
is the indices of the local targets in the global tree.src_idx
andtgt_idx
are needed for distributing source weights from root rank and assembling calculated potentials on the root rank.
- boxtree.distributed.local_traversal.generate_local_travs(queue, local_tree, traversal_builder, merge_close_lists=False)[source]#
Generate local traversal from local tree.
- Parameters
queue – a
pyopencl.CommandQueue
object.local_tree – the local tree of class boxtree.tools.ImmutableHostDeviceArray on which the local traversal object will be constructed.
traversal_builder – a function, taken a
pyopencl.CommandQueue
and a tree, returns the traversal object based on the tree.
- Returns
generated local traversal object in host memory
Distributed Wrangler#
- class boxtree.distributed.calculation.DistributedExpansionWrangler(context, comm, global_traversal, communicate_mpoles_via_allreduce=False)[source]#
Distributed expansion wrangler base class.
This is an abstract class and should not be directly instantiated. Instead, it is expected that all distributed wranglers should be subclasses of this class.
- distribute_source_weights(src_weight_vecs, src_idx_all_ranks)[source]#
Used by the distributed implementation for transferring needed source weights from root rank to each worker rank in the communicator.
This method needs to be called collectively by all ranks in the communicator.
- Parameters
src_weight_vecs – a sequence of
numpy.ndarray
, each with lengthnsources
, representing the weights of sources on the root rank. None on worker ranks.src_idx_all_ranks – a
list
of lengthnranks
, including the root rank, where the i-th entry is anumpy.ndarray
of indices, of which src_weight_vecs to be sent from the root rank to rank i. Each entry can be generated bygenerate_local_tree()
. None on worker ranks.
- Returns
Received source weights of the current rank, including the root rank.
- gather_potential_results(potentials, tgt_idx_all_ranks)[source]#
Used by the distributed implementation for gathering calculated potentials from all worker ranks in the communicator to the root rank.
This method needs to be called collectively by all ranks in the communicator.
- Parameters
potentials – Calculated potentials on each rank. This argument is significant on all ranks, including the root rank.
tgt_idx_all_ranks – a
list
of lengthnranks
, where the i-th entry is anumpy.ndarray
of the global potential indices of potentials from rank i. This argument is only significant on the root rank.
- Returns
Gathered potentials on the root rank. None on worker ranks.
- communicate_mpoles(mpole_exps, return_stats=False)[source]#
Based on Algorithm 3: Reduce and Scatter in Lashuk et al. 1.
The main idea is to mimic an allreduce as done on a hypercube network, but to decrease the bandwidth cost by sending only information that is relevant to the rank receiving the message.
- 1
Lashuk, Ilya, Aparna Chandramowlishwaran, Harper Langston, Tuan-Anh Nguyen, Rahul Sampath, Aashay Shringarpure, Richard Vuduc, Lexing Ying, Denis Zorin, and George Biros. “A massively parallel adaptive fast multipole method on heterogeneous architectures.” Communications of the ACM 55, no. 5 (2012): 101-109.
Distributed FMM Evaluation#
The distributed version of the FMM evaluation shares the same interface as the
shared-memory version. To evaluate FMM in a distributed manner, use a subclass
of boxtree.distributed.calculation.DistributedExpansionWrangler
in
boxtree.fmm.drive_fmm()
.