A System’s Approach to Cache Hierarchy-Aware Decomposition of Data-Parallel Computations
The architecture of nowadays’ processors is very complex, comprising several computational cores and an intricate hierarchy of cache memories. The latter, in particular, differ considerably between the many processors currently available in the market, resulting in a wide variety of configurations. Application development is typically oblivious of this complexity and diversity, taking only into consideration the number of available execution cores. This oblivion prevents such applications from fully harnessing the computing power available in these architectures. This problem has been recognized by the community, which has proposed languages and models to express and tune applications according to the underlying machine’s hierarchy. These, however, lack the desired abstraction level, forcing the programmer to have deep knowledge of computer architecture and parallel programming, in order to ensure performance portability across a wide range of architectures. Realizing these limitations, the goal of this thesis is to delegate these hierarchy-aware optimizations to the runtime system. Accordingly, the programmer’s responsibilities are confined to the definition of procedures for decomposing an application’s domain, into an arbitrary number of partitions. With this, the programmer has only to reason about the application’s data representation and manipulation. We prototyped our proposal on top of a Java parallel programming framework, and evaluated it from a performance perspective, against cache neglectful domain decompositions. The results demonstrate that our optimizations deliver significant speedups against decomposition strategies based solely on the number of execution cores, without requiring the programmer to reason about the machine’s hardware. These facts allow us to conclude that it is possible to obtain performance gains by transferring hierarchy-aware optimizations concerns to the runtime system.