RMIT University
Browse

Scalable relative debugging

journal contribution
posted on 2024-11-02, 07:15 authored by Minh DinhMinh Dinh, David Abramson, Chao Jin
Detecting and isolating bugs that arise only at high processor counts is a challenging task. Over a number of years, we have implemented a special debugging method, called 'relative debugging,' that supports debugging applications as they evolve or are ported to larger machines. It allows a user to compare the state of a suspect program against another reference version even as the number of processors is increased. The innovative idea is the comparison of runtime data to reason about the state of the suspect program. While powerful, a naïve implementation of the comparison phase does not scale to large problems running on large machines. In this paper, we propose two different solutions including a hash-based scheme and a direct point-to-point scheme. We demonstrate the implementation, a case study, as well as the performance, of our techniques on 20K cores of a Cray XE6 system.

History

Journal

IEEE Transactions on Parallel and Distributed Systems

Volume

25

Number

6487495

Issue

3

Start page

740

End page

749

Total pages

10

Publisher

IEEE

Place published

United States

Language

English

Copyright

© 2014 IEEE

Former Identifier

2006094370

Esploro creation date

2020-06-22

Fedora creation date

2019-10-23