This chapter surveys the main issues involved in correctness debugging of parallel and distributed programs. Distributed debugging is an instance of the more general problem of observation of a distributed computation. This chapter briefly summarizes the theoretical foundations of the distributed debugging activity. Then a survey is presented of the main methodologies used for parallel and distributed debugging, including state and event based debugging, deterministic re-execution, systematic state exploration, and correctness predicate evaluation. Such approaches are complementary to one another, and the chapter discusses how they can be supported using distinct techniques for observation and control.
Parallel Program Development for Cluster Computing: Methodology, Tools and Integrated Environments