Chapter 1.  Introduction

This thesis presents scalable dynamic algorithms to solve the problems of load balancing and mapping in distributed computer systems.  The  load balancing problem partitions a large set of concurrent processes into equally balanced subsets for execution on a small set of computers.  The  mapping problem partitions the same set of processes into connected sub-graphs for the same purpose.  Both problems have the goal of optimizing the performance of the processes on the computers: in the case of load balancing, to equalize (and therefore minimize) the time required by each computer; in the case of mapping, to minimize the amount of communication among computers.

(Footnote: the mapping and load balancing problems are similar to a number of other important problems, such as the problem of partitioning circuits for VLSI placement and simulation.  The approach that is demonstrated here may be applicable to some of these other problems.)

These algorithms are constructed using an approach based on spectral properties of graphs: graphs of physical network links among computers in the load balancing problem, and graphs of logical communication channels among processes in the mapping problem.  In both problems the approach starts by characterizing problem solutions as equilibrium distributions of appropriately chosen scalar quantities.  For the load balancing problem this characterization is easily arrived at, since every solution is an equilibrium distribution of workload among a set of computers.  For the mapping problem the characterization is somewhat less obvious, and the equilibrium sought is an equal distribution of  communication distance as measured along paths through a physical interconnection network.

After the problem has been formulated in this way an algorithm is constructed from an iterative procedure to solve the Laplace equation on the vertices of the graph.  The procedure to solve the Laplace equation is analogous to a process in which an initial dis-equilibrium condition diffuses (or relaxes) to an equilibrium.  The iterative procedure yields a matrix of coefficients known as the  Laplacian matrix of the graph.  Spectral properties of this matrix can be used to analyze the convergence and scaling properties of the resulting algorithms.  The thesis uses these properties to prove a  scalability theorem which says that the resulting algorithms will reduce a discrete  disturbance at a constant rate regardless of the scale of the system in which the disturbance occurs.  Constant convergence can be a useful property for algorithms that execute on scalable computer systems, and a necessary property for the largest systems.

The scalability theorem assumes a simplified model in which individual imbalances arise, one after another, with each imbalance confined to a single computer.  In real applications this simplified model may not be realistic.  Imbalances may arise on several computers simultaneously, as a result of several unrelated events, or as a result of a single event that involves several computers.  When the number of such events is fixed the rate of convergence of the algorithms is not significantly affected by problem scale, just as in the case of a single disturbance.  Scale-independent convergence is also the expected behavior for any number of events, when they occur at random times, on randomly selected computers, and with random magnitude.

These claims of scalability are tested in an implementation of a distributed algorithm for image synthesis.  A dynamic load balancing algorithm is implemented according to the approach presented here, and used to support a photo-realistic Monte Carlo path tracing algorithm.  The scaling of these two algorithms are measured, and the observed load balancing solutions are compared to solutions produced by competing load balancing algorithms.  The measurements show that the performance of the load balancing and Monte Carlo algorithms are consistent with the scalability theorem.  The comparison shows that the load balancing algorithm is superior, in both cost and quality of solution, to a popular load balancing algorithm for ray-tracing on parallel computer systems.

A subsequent chapter presents the dynamic mapping problem and derives an algorithm to solve it.  The mapping algorithm is applied to a simple model problem that illustrates some of the issues that arise in a distributed application.  A final chapter discusses issues in developing the software for this study and shows a partial program trace for one of the many data points that were collected.  The purpose of presenting this information is to aid in reproducibility, by showing the many ancillary issues that must be addressed in order to implement the complete application.  It also shows explicitly how the data was gathered.

Scalability is the principal feature that distinguishes these algorithms from competing approaches.  The algorithms are distributed and concurrent, have no central thread of control, and require no centralized communication.  These properties are necessary for achieving scalable performance on the largest distributed computer systems.  Examples of such systems include "massively parallel" computer systems which have hundreds and in some cases thousands of processors connected by high performance communication networks.  Another example is the Internet which has millions of computers connected by a relatively poor network in which communication is slow and synchronization is impractical.

When scalability is not important the algorithms presented in this thesis have little advantage over competing approaches.  The load balancing and mapping problems can always be solved by serial algorithms that compute a solution on a single computer and then broadcast the result to the other computers.  A serial approach may often be the simplest to implement, and may be satisfactory for small parallel computer systems.  However this approach will not give scalable performance on large computer systems, and may not even be feasible on a system such as the Internet.  In these cases algorithms such as those presented here may be the only ones that are practical.
 

Summary of original contributions

This thesis presents novel problem formulations, algorithms, literature surveys, algorithm  analysis, and empirical data.  The key contributions are listed here.