Significance of OS Noise in HPC Applications
Operating system interference with parallel application performance is receiving a significant amount of attention in the high-performance computing (HPC) community. Custom lightweight operating systems from IBM and Cray allow applications to scale extremely well on several thousand to more than one hundred thousand processors. In contrast, studies have shown that interference in commodity operating system environments can lead to a 50% performance penalty compared to specialized solutions. Unfortunately, there is no general consensus as to the interference characteristics that are important to the performance of applications. These characteristics include the types of interference (e.g. CPU, network, cache), the frequency and duration of the noise, the scale at which noise becomes significant, and the impact of the balance of the system (e.g. relative processing and network performance).
Our goal is to build an infrastructure to synthetically generate a known quantity and type of noise in order to measure its performance impact on HPC applications. We are building this infrastructure into the Catamount lightweight operating system, which runs on nearly twenty-six thousand AMD Opteron cores on the Cray Red Storm system at Sandia National Laboratories. The extremely low noise profile of Catamount is ideal for creating an infrastructure to generate the noise signatures of many different HPC operating systems, such as Linux, so that the impact of noise can be properly analyzed and studied. Additionally, since Red Storm is a large-scale, highly balanced system, we can also explore the impact of system balance by using hardware-based mechanisms that degrade the compute and network performance of the machine.
The results of this work will provide significant insight into several important areas, including the discovery of the important noise characteristics of operating systems for HPC platforms, the sensitivity of applications to noise in combination with system balance, and the ability to project the behavior of applications given a particular noise signature and set of balance measurements. Moreover, this tool will be critical in directing the design and implementation of future lightweight operating systems tailored specifically for large-scale multi-core parallel processing systems.
People
Results
Documentation
Publications
- An Infrastructure for Characterizing the Sensitivity of Parallel Applications to OS Noise.
Kurt Ferreira, Ron Brightwell, and Patrick Bridges. 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI'06). Work-in-Progress Session. November 2006 ( statement and Presentation).
Acknowledgments
Many have provided invaluable support on this project. This includes, in no particular order, Kevin Pedretti, Trammell Hudson, Suzanne Kelly, Barry Oliphant, Courtenay Vaughan, Joel Stevenson, Mahesh Rajan, Mark Taylor, and Barney Maccabe
Funding
This research is sponsored by the Office of Advanced Scientific Computing Research – U.S. Department of Energy