Distributed ComputationΒΆ

High-level Point FMM InterfaceΒΆ

To perform point-FMM, first construct a boxtree.distributed.DistributedFMMRunner object. The constructor will distribute the necessary information from the root rank to all worker ranks. Then, the boxtree.distributed.DistributedFMMRunner.drive_dfmm() can be used for launching FMM.

class boxtree.distributed.DistributedFMMRunner(array_context: ~arraycontext.context.ArrayContext, global_tree, traversal_builder, wrangler_factory, calibration_params=None, comm=<mpi4py.MPI.Intracomm object>)[source]ΒΆ

Helper class for setting up and running distributed point FMM.

__init__(array_context: ~arraycontext.context.ArrayContext, global_tree, traversal_builder, wrangler_factory, calibration_params=None, comm=<mpi4py.MPI.Intracomm object>)[source]ΒΆ

Construct a DistributedFMMRunner object.

Parameters:
  • global_tree – a boxtree.Tree object. This tree could live in the host or the device memory, depending on the wrangler. This argument is only significant on the root rank.

  • traversal_builder – an object which, when called, takes a pyopencl.CommandQueue object and a boxtree.Tree object, and generates a boxtree.traversal.FMMTraversalInfo object from the tree using the command queue.

  • wrangler_factory – an object which, when called, takes the local traversal and the global traversal objects and returns an boxtree.fmm.ExpansionWranglerInterface object.

  • calibration_params – Calibration parameters for the cost model, if supplied. The cost model is used for estimating the execution time of each box, which is used for improving load balancing.

  • comm – MPI communicator.

drive_dfmm(actx: ArrayContext, source_weights)[source]ΒΆ

Calculate potentials at target points.

Distributed Algorithm OverviewΒΆ

  1. Construct the global tree and traversal lists on the root rank and broadcast to all worker ranks.

  2. Partition boxes into disjoint sets, where the number of sets is the number of MPI ranks. (See Partition Boxes)

  3. Each rank constructs the local tree and traversal lists independently, according to the partition. (See Construct Local Tree and Traversal)

  4. Distribute source weights from the root rank to all worker ranks. (See Distributed Wrangler)

  5. Each rank independently forms multipole expansions from the leaf nodes of the local tree and propagates the partial multipole expansions upwards.

  6. Communicate multipole expansions so that all ranks have the complete multipole expansions needed.

  7. Each rank independently forms local expansions, propagates the local expansions downwards, and evaluate potentials of target points in its partition. The calculated potentials are then assembled on the root rank.

For step 5-7, see Distributed FMM Evaluation.

Note that step 4-7 may be repeated multiple times with the same tree and traversal object built from step 1-3. For example, when iteratively solving a PDE, step 4-7 is executed for each iteration of the linear solver.

The next sections will cover the interfaces of these steps.

Partition BoxesΒΆ

boxtree.distributed.partition.partition_work(cost_per_box, traversal, comm)[source]ΒΆ

This function assigns responsible boxes for each rank.

If a rank is responsible for a box, it will calculate the multiple expansion of the box and evaluate target potentials in the box.

Parameters:
  • cost_per_box – The expected running time of each box. This argument is only significant on the root rank.

  • traversal – The global traversal object containing all particles. This argument is significant on all ranks.

  • comm – MPI communicator.

Returns:

A numpy array containing the responsible boxes of the current rank.

class boxtree.distributed.partition.BoxMasks(responsible_boxes: Array, ancestor_boxes: Array, point_src_boxes: Array, multipole_src_boxes: Array)[source]ΒΆ

Box masks needed for the distributed calculation. Each of these masks is an array with length tree.nboxes, whose i-th entry is 1 if box i is set.

responsible_boxesΒΆ

Current process will evaluate target potentials and multipole expansions in these boxes. Sources and targets in these boxes are needed.

ancestor_boxesΒΆ

Ancestors of the responsible boxes.

point_src_boxesΒΆ

Current process needs sources but not targets in these boxes.

multipole_src_boxesΒΆ

Current process needs multipole expressions in these boxes.

boxtree.distributed.partition.get_box_masks(actx, traversal, responsible_boxes_list)[source]ΒΆ

Given the responsible boxes for a rank, this helper function calculates the relevant masks.

Parameters:

responsible_boxes_list – A numpy array of responsible box indices.

Returns:

A BoxMasks object of the relevant masks.

Construct Local Tree and TraversalΒΆ

class boxtree.distributed.local_tree.LocalTree(root_extent: Array, box_centers: Array, box_parent_ids: Array, box_child_ids: Array, box_levels: Array, box_flags: Array | None, level_start_box_nrs: Array | None, box_id_dtype: dtype, box_level_dtype: dtype, coord_dtype: dtype, sources_have_extent: bool, targets_have_extent: bool, extent_norm: ExtentNorm, stick_out_factor: float, _is_pruned: bool, sources_are_targets: bool, particle_id_dtype: dtype, sources: Array, source_radii: Array, targets: Array, target_radii: Array, bounding_box: tuple[Array, Array], user_source_ids: Array, sorted_target_ids: Array, box_source_starts: Array, box_source_counts_nonchild: Array, box_source_counts_cumul: Array, box_target_starts: Array, box_target_counts_nonchild: Array, box_target_counts_cumul: Array, box_source_bounding_box_min: Array, box_source_bounding_box_max: Array, box_target_bounding_box_min: Array, box_target_bounding_box_max: Array, box_to_user_rank_starts: Array, box_to_user_rank_lists: Array, responsible_boxes_list: Array, responsible_boxes_mask: Array, ancestor_mask: Array)[source]ΒΆ

Inherits from boxtree.Tree.

box_to_user_rank_startsΒΆ

box_id_t [nboxes + 1]

box_to_user_rank_listsΒΆ

int32 [*]

A CSR-like interaction list storage array, together with box_to_user_rank_starts. For each box, the list of ranks which own targets that use the multipole expansion at this box, via either List 3 or (possibly downward propagated from an ancestor) List 2.

boxtree.distributed.local_tree.generate_local_tree(actx: ArrayContext, global_traversal, responsible_boxes_list, comm)[source]ΒΆ

Generate the local tree for the current rank.

This is an MPI-collective routine on comm.

Parameters:
Returns:

a tuple of (local_tree, src_idx, tgt_idx), where local_tree is an object with class boxtree.distributed.local_tree.LocalTree of the generated local tree, src_idx is the indices of the local sources in the global tree, and tgt_idx is the indices of the local targets in the global tree. src_idx and tgt_idx are needed for distributing source weights from root rank and assembling calculated potentials on the root rank.

boxtree.distributed.local_traversal.generate_local_travs(actx, local_tree, traversal_builder, merge_close_lists=False)[source]ΒΆ

Generate local traversal from local tree.

Parameters:
  • local_tree – the local tree on which the local traversal object will be constructed.

  • traversal_builder – a function, taken a arraycontext.ArrayContext and a tree, returns the traversal object based on the tree.

Returns:

generated local traversal object in device memory

Distributed WranglerΒΆ

class boxtree.distributed.calculation.DistributedExpansionWranglerMixin[source]ΒΆ

Distributed expansion wrangler base class.

This class is meant to aid in adding distributed capabilities to wranglers. All distributed wranglers should inherit from this class.

commΒΆ
global_traversalΒΆ
communicate_mpoles_via_allreduceΒΆ
distribute_source_weights(actx: ArrayContext, src_weight_vecs, src_idx_all_ranks)[source]ΒΆ
gather_potential_results(actx: ArrayContext, potentials, tgt_idx_all_ranks)[source]ΒΆ
communicate_mpoles(actx: ArrayContext, mpole_exps, return_stats=False)[source]ΒΆ

Based on Algorithm 3: Reduce and Scatter in Lashuk et al. [1].

The main idea is to mimic an allreduce as done on a hypercube network, but to decrease the bandwidth cost by sending only information that is relevant to the rank receiving the message.

Distributed FMM EvaluationΒΆ

The distributed version of the FMM evaluation shares the same interface as the shared-memory version. To evaluate FMM in a distributed manner, use a subclass of boxtree.distributed.calculation.DistributedExpansionWranglerMixin in boxtree.fmm.drive_fmm().