Distributed Orthogonal Slicing for Load Balancing of Large Spatial Datasets
Imagine you have N objects placed in a tridimensional space and stored across a distributed network of C compute nodes. The problem is simple:
- how to divide the universe in such a way that each compute node gets an equal number of objects (N/C)? and
- how to ensure that objects of each compute node are as close as possible to each other? and
- how to ensure that the regions of objects on each compute node overlap very little wither oter compute nodes’ regions?
This is a very common problem set-up and a problemn of high importance in large-scale scientific problems. The main rationale is that a good spatial decomposition allows computations of proximal objects to be computed individually and in parallel by all compute nodes, and the neighbouring elements to be duplicated (cloned) among nodes.
On a simpler 2D space, one can think of several slicing alternatives:
A small analysis of efficiency: if you pick the slicing layout (1) then you would incur in a lot of duplication of objects, specially when ojects are long or if there is a high number of compute nodes. (2) works better by reducing cloning, yet it does not guarantee an accurate spatial decomposition. (3) and (4) are similar, but while one is a sequence of \(log_2 C\) recursive bisectional decomposition, the other is a 3-level recursive \(m*n*h=c\) sub-sections decomposition algorithm.
The following table details the complexity of each algorithm (we assume sorting with average complexity \(n log n\):
Accross all slicing layouts possible, the (3) is the most efficient and guarantees low data duplication, allows for accurate slicing, and requires only D (dimensions) recursive iteratione, independently of the number of compute nodes (contrarily to (4)). However, its implementation on a distributed memory environment is not trivial. Possible options are:
- A histogram/binning slicing algorithm that clusters elements in spacial bins and uses the sum of elements per bin as a metric;
- A sampling method that performs serial slicing in one compute node, and approximates slice coordinated based on a portion of the data;
Can we do better?
A technique I published on the Proceedings of the International SuperComputing 2015 is called the Sort-Balance-Split and is a multi-dimensional tree-recursive algorithm of spatial division. The algorithm follows three steps:
- The sort step performs a distributed sorting operation of elements on a given dimension;
- The distributed sorting algorithm was covered in depth in my previous post;
- The balance step equalized the elements on each compute node in such was the it is still sorted and each compute node has N/C elements;
- The split creates sub-networks from the main network, and performs the same algorithm on the next dimension, on each sub-compute node.
This method has several advantages compared to existing approximated methods while delivering an efficient execution, due to (1) a fixed complexity due to the number of communication steps being fixed for any network and input size; and (2) accurate decomposition.
A sample application of the algorithm on a 2D universe of 16 shapes distributed across 4 compute nodes (ranks) is presented below:
From left to right:
- Initial data layout;
- The inital sorting operation will sort all elements on the first (X) dimension, without caring about the other coordinates;
- Data is load balanced on the X axis, so that each rank holds 4 shapes;
- A network split divides the initial problem of 16 shapes across 4 nodes on a network, into 8 shapes on two networks of 2 nodes;
- Each network performs a distributed sort of its objects on the next axis (Y), and balances the workload;
- Final data layout;
Here’s a more detailed example (skip it if you’ve already understood):
And here is an application of a distributed slicing to the N-particles problem, enabling one to simulate cosmic activity such as the Big Bang, and several times winner of the Gordon Bell Prize for Scientific Computing. And an application to a small network of neurons. Enjoy the view.
For limitations, benchmarks and a more thorough analysis of the problem, check the original publication.
For the C++
code implementation, download slicing.tar.gz.