For quite some time, the Work Stealing algorithm has been the de facto standard for scheduling multithreaded computations. To ensure scalability and achieve high perfor- mance, work is scattered through processors. In turn, each processor owns a concurrent work queue that uses to keep track of its assigned tasks. When a processor’s work queue becomes empty, it becomes a thief and starts targeting victims uniformly at random, from which it attempts stealing tasks. This strategy was proved to be efficient in both theory and practice, and is currently used in state-of-the-art Work Stealing algorithms.
Nevertheless, purely receiver initiated load balancing schemes, such as Work Steal- ing’s, are known not to be suitable for scheduling computations with few or unbalanced parallelism. More, due to the concurrent nature of work queues, even local operations require memory fences that are extremely expensive on modern computer architectures. Consequently, even when a processor is busy, it may incur in costly overheads caused by local accesses to its work queue. Finally, as the scheduler’s load balancer relies on ran- dom steals, its performance when executing memory bound computations is very limited. Despite all efforts, no silver-bullet has been found, and, even worse, all these limitations still exist in state-of-the-art Work Stealing algorithms.
In this thesis we make three major theoretical contributions, addressing each of the aforementioned limitations. First, we prove that Work Stealing can easily be extended to make use of custom load balancers, that, for various classes of workloads (e.g. memory bound computations), can greatly boost the scheduler’s performance, while, at the same time, maintaining Work Stealing’s high performance for the general setting. Then, we present a provably efficient scheduler that mixes both receiver and sender-initiated poli- cies, and theoretically show that it successfully overcomes Work Stealing’s limitations for the execution of computations with few or irregular parallelism. Finally, we present a novel scheduling algorithm, whose expected runtime bounds are optimal within a con- stant factor, and that avoids most of the costs associated with memory fences, bounding the total expected overheads incurred by memory fences to O(PT∞), where T∞ is the critical-path length of a computation, and P is the number of processors. This contrasts with state-of-the-art Work Stealing algorithms where the total overheads incurred by these synchronization mechanisms can grow proportionally with the total amount of work. From this perspective, our proposal greatly improves the state-of-the-art Work Stealing algorithm. In fact, as we will prove, for several classes of computations, the over- heads incurred by our algorithm are exponentially smaller than the overheads incurred by state-of-the-art Work Stealing algorithms.
Faculdade Ciências e Tecnologia, Universidade Nova de Lisboa