Distributed Computation#

High-level Point FMM Interface#

To perform point-FMM, first construct a boxtree.distributed.DistributedFMMRunner object. The constructor will distribute the necessary information from the root rank to all worker ranks. Then, the boxtree.distributed.DistributedFMMRunner.drive_dfmm() can be used for launching FMM.

class boxtree.distributed.DistributedFMMRunner(queue, global_tree, traversal_builder, wrangler_factory, calibration_params=None, comm=<mpi4py.MPI.Intracomm object>)[source]#

Helper class for setting up and running distributed point FMM.

__init__(queue, global_tree, traversal_builder, wrangler_factory, calibration_params=None, comm=<mpi4py.MPI.Intracomm object>)[source]#

Construct a DistributedFMMRunner object.

  • global_tree – a boxtree.Tree object. This tree could live in the host or the device memory, depending on the wrangler. This argument is only significant on the root rank.

  • traversal_builder – an object which, when called, takes a pyopencl.CommandQueue object and a boxtree.Tree object, and generates a boxtree.traversal.FMMTraversalInfo object from the tree using the command queue.

  • wrangler_factory – an object which, when called, takes the local traversal and the global traversal objects and returns an boxtree.fmm.ExpansionWranglerInterface object.

  • calibration_params – Calibration parameters for the cost model, if supplied. The cost model is used for estimating the execution time of each box, which is used for improving load balancing.

  • comm – MPI communicator.

drive_dfmm(source_weights, timing_data=None)[source]#

Calculate potentials at target points.

Distributed Algorithm Overview#

  1. Construct the global tree and traversal lists on the root rank and broadcast to all worker ranks.

  2. Partition boxes into disjoint sets, where the number of sets is the number of MPI ranks. (See Partition Boxes)

  3. Each rank constructs the local tree and traversal lists independently, according to the partition. (See Construct Local Tree and Traversal)

  4. Distribute source weights from the root rank to all worker ranks. (See Distributed Wrangler)

  5. Each rank independently forms multipole expansions from the leaf nodes of the local tree and propagates the partial multipole expansions upwards.

  6. Communicate multipole expansions so that all ranks have the complete multipole expansions needed.

  7. Each ranks indepedently forms local expansions, propagates the local expansions downwards, and evaluate potentials of target points in its partition. The calculated potentials are then assembled on the root rank.

For step 5-7, see Distributed FMM Evaluation.

Note that step 4-7 may be repeated multiple times with the same tree and traversal object built from step 1-3. For example, when iteratively solving a PDE, step 4-7 is executed for each iteration of the linear solver.

The next sections will cover the interfaces of these steps.

Partition Boxes#

boxtree.distributed.partition.partition_work(cost_per_box, traversal, comm)[source]#

This function assigns responsible boxes for each rank.

If a rank is responsible for a box, it will calculate the multiple expansion of the box and evaluate target potentials in the box.

  • cost_per_box – The expected running time of each box. This argument is only significant on the root rank.

  • traversal – The global traversal object containing all particles. This argument is significant on all ranks.

  • comm – MPI communicator.


A numpy array containing the responsible boxes of the current rank.

class boxtree.distributed.partition.BoxMasks(responsible_boxes: Array, ancestor_boxes: Array, point_src_boxes: Array, multipole_src_boxes: Array)[source]#

Box masks needed for the distributed calculation. Each of these masks is a PyOpenCL array with length tree.nboxes, whose i-th entry is 1 if box i is set.


Current process will evaluate target potentials and multipole expansions in these boxes. Sources and targets in these boxes are needed.


Ancestors of the responsible boxes.


Current process needs sources but not targets in these boxes.


Current process needs multipole expressions in these boxes.

boxtree.distributed.partition.get_box_masks(queue, traversal, responsible_boxes_list)[source]#

Given the responsible boxes for a rank, this helper function calculates the relevant masks.


responsible_boxes_list – A numpy array of responsible box indices.


A BoxMasks object of the relevant masks.

Construct Local Tree and Traversal#

class boxtree.distributed.local_tree.LocalTree(valuedict=None, exclude=None, **kwargs)[source]#

Inherits from boxtree.Tree.


box_id_t [nboxes + 1]


int32 [*]

A CSR-like interaction list storage array, together with box_to_user_rank_starts. For each box, the list of ranks which own targets that use the multipole expansion at this box, via either List 3 or (possibly downward propagated from an ancestor) List 2.

boxtree.distributed.local_tree.generate_local_tree(queue, global_traversal, responsible_boxes_list, comm)[source]#

Generate the local tree for the current rank.

This is an MPI-collective routine on comm.


a tuple of (local_tree, src_idx, tgt_idx), where local_tree is an object with class boxtree.distributed.local_tree.LocalTree of the generated local tree, src_idx is the indices of the local sources in the global tree, and tgt_idx is the indices of the local targets in the global tree. src_idx and tgt_idx are needed for distributing source weights from root rank and assembling calculated potentials on the root rank.

boxtree.distributed.local_traversal.generate_local_travs(queue, local_tree, traversal_builder, merge_close_lists=False)[source]#

Generate local traversal from local tree.

  • queue – a pyopencl.CommandQueue object.

  • local_tree – the local tree of class boxtree.tools.ImmutableHostDeviceArray on which the local traversal object will be constructed.

  • traversal_builder – a function, taken a pyopencl.CommandQueue and a tree, returns the traversal object based on the tree.


generated local traversal object in device memory

Distributed Wrangler#

class boxtree.distributed.calculation.DistributedExpansionWrangler(context, comm, global_traversal, traversal_in_device_memory, communicate_mpoles_via_allreduce=False)[source]#

Distributed expansion wrangler base class.

This is an abstract class and should not be directly instantiated. Instead, it is expected that all distributed wranglers should be subclasses of this class.

__init__(context, comm, global_traversal, traversal_in_device_memory, communicate_mpoles_via_allreduce=False)[source]#
distribute_source_weights(src_weight_vecs, src_idx_all_ranks)[source]#

Used by the distributed implementation for transferring needed source weights from root rank to each worker rank in the communicator.

This method needs to be called collectively by all ranks in the communicator.

  • src_weight_vecs – a sequence of numpy.ndarray, each with length nsources, representing the weights of sources on the root rank. None on worker ranks.

  • src_idx_all_ranks – a list of length nranks, including the root rank, where the i-th entry is a numpy.ndarray of indices, of which src_weight_vecs to be sent from the root rank to rank i. Each entry can be generated by generate_local_tree(). None on worker ranks.


Received source weights of the current rank, including the root rank.

gather_potential_results(potentials, tgt_idx_all_ranks)[source]#

Used by the distributed implementation for gathering calculated potentials from all worker ranks in the communicator to the root rank.

This method needs to be called collectively by all ranks in the communicator.

  • potentials – Calculated potentials on each rank. This argument is significant on all ranks, including the root rank.

  • tgt_idx_all_ranks – a list of length nranks, where the i-th entry is a numpy.ndarray of the global potential indices of potentials from rank i. This argument is only significant on the root rank.


Gathered potentials on the root rank. None on worker ranks.

communicate_mpoles(mpole_exps, return_stats=False)[source]#

Based on Algorithm 3: Reduce and Scatter in Lashuk et al. [1].

The main idea is to mimic an allreduce as done on a hypercube network, but to decrease the bandwidth cost by sending only information that is relevant to the rank receiving the message.

Distributed FMM Evaluation#

The distributed version of the FMM evaluation shares the same interface as the shared-memory version. To evaluate FMM in a distributed manner, use a subclass of boxtree.distributed.calculation.DistributedExpansionWrangler in boxtree.fmm.drive_fmm().