Models for Optimizing Multiple Communication Cost Metrics in Parallelizing Irregular Applications
Cevdet Aykanat, Bilkent University, Turkey
Slides (6 MB in PDF format)
Scaling irregular applications on distributed memory systems necessitates correct encapsulation and optimization of multiple communication cost metrics in those applications. These cost metrics can be categorized under two groups: metrics related to bandwidth costs (such as total or maximum communication volume), and metrics related to latency costs (such as total or maximum message count). Most common and traditional graph/hypergraph models used in the partitioning of irregular applications often encode only the single bandwidth cost metric of total communication volume. This talk discusses four different frameworks to encode multiple bandwidth and/or latency communication cost metrics. All these models utilize existing partitioners in distinct ways and hence do not require development of novel partitioners. Moreover, computational load imbalance is implicitly addressed in all of them, hence we do not emphasize it further.
The first framework relies on a model called communication hypergraph. It addresses two bandwidth cost metrics and one latency cost metric: total communication volume, maximum communication volume, and total message count. It is a two-phase methodology and utilized in the second phase to address the last two cost metrics of the mentioned three metrics. The communication hypergraph model was first proposed for 1D row- or column-parallel Sparse Matrix Vector Multiplication (SpMV), then it was extended to 2D row-column-parallel SpMV and 1D parallel Sparse Matrix Sparse Matrix Multiplication (SpGEMM). Recently, it has been enhanced for encoding send volume balancing in reduce type of communications.
The second framework is based on multi-stage hypergraph partitioning and used for multi-dimensional Cartesian partitioning of data to match the multi-dimensional virtual processor grid organization and enforce upper bounds on the latency cost metrics. This framework addresses one bandwidth cost and provides relatively good upper bounds on two latency cost metrics: total communication volume, maximum message count, and total message count. Note that this framework does not directly optimize the mentioned latency cost metrics. At each stage of the multi-stage partitioning process, it aims to minimize the total communication volume along the respective dimension of the virtual processor grid. A noteworthy feature of this framework is the multiple-constraint formulation used to encode the computational load balance. This framework was first proposed for 2D checkerboard partitioning to scale row-column-parallel SpMV and then enhanced for N-dimensional Cartesian partitioning to scale CPD-ALS algorithm for N-dimensional tensor factorization.
The third framework relies on recursive hypergraph partitioning to address one bandwidth cost metric and one latency cost metric: total communication volume and total message count. It is a single-phase methodology, i.e., it addresses these two metrics in a single partitioning process. In this framework, standard hypergraph models, nets of which encapsulate the total communication volume, are augmented with message nets, which encapsulate the total message count. This framework was first proposed for scaling 1D row- or column-parallel SpMV based on 1D rowwise or columnwise matrix partitioning, and later extended for 2D row-column-parallel SpMV based on 2D fine-grain matrix partitioning.
The last framework is a regularization framework for irregular sparse point-to-point (P2P) communications in latency-bound applications. Instead of minimizing multiple communication metrics as is the case for the three frameworks described so far, this framework offers a flexible medium called virtual process topology (VPT) in order to attain a trade-off between bandwidth and latency cost metrics. This is achieved by using different dimensions in the formation of the VPT, in which a low-dimensional VPT favors bandwidth costs over latency costs while a high-dimensional VPT favors latency costs over bandwidth costs. In the VPT, we organize processes into a topology inspired by the k-ary n-cube networks and regularize irregular P2P messages by imposing regular communication pattern(s) onto them. Although this framework is utilized in scaling latency-bound 1D row-parallel SpMV instances, it can be applied for attaining a trade-off between the bandwidth and latency costs of a given set of irregular P2P messages. This framework is especially tailored for the use cases where the messages are small or medium sized and there is high variance in the sent or received messages.