
HPC system management is essential for the success of research, development and computational application. It is a complex task given the diversity of hardware technologies, operating systems, middleware and applications that need to be deployed and managed in order to realize the inflated expectations for performance. The resource requirements for HPC systems are increasing, both in terms of processors, memory and networks. The architectural design will also have an influence with respect to the available resources. All these factors, which will have lasting consequences for the HPC practitioners, need to be considered for system management as a whole.
In addition to the specific components that are mandatory for an implementation and configuration of a particular machine (for example CPU’s, memory, I/O ports or a particular software or middleware application), there is also a variety of optional subsystems that can be used for enhancements or that may make up the infrastructure required by groups of users who need large data sets (such as databases) and applications (such as web servers). This variety of subsystems increase the complexity of system management in HPC. Moreover, there is a class of applications that require high computing power for a short period. These applications need a short turn-around from the proposed configuration to being available for execution. The group of system management tools and techniques that facilitates these tasks is called ‘reconfiguration management’. The ability to achieve reconfiguration quickly and with minimal downtime of applications is crucial to HPC system management and it will be explored in this chapter.
Moreover, computational performance is not only determined by the specification of hardware components but also by the software environment (the operating system, middleware and applications that are used). HPC system management is typically embedded in this software environment and it needs to be considered as part of the overall system management. As the complexity of HPC systems increases, the usage of alternative operating systems and middleware will have an increasing importance for a good HPC system management.
In order to address the difficulties that arise from multi-component computers, there are different solutions available (built-in solutions, custom and community solutions). The user community is divided in hardware vendors, system administrators, operating system designers and application developers who have different opinions about system management problems. The end result is a large amount of overlapping software solutions. This can make system management difficult. The same factor explains why not all HPC systems are managed in the same way despite the similarities to separate it from traditional computer systems.
Management of HPC systems involves the selection, acquisition, deployment and management of HPC systems within an organization. Its goal is to make the best use of computing resources to accomplish a particular task or mission. Often these tasks have strict deadlines that must be met and may require high performance levels throughout development and operation phases of the project. This demand puts a premium on software tools and methods which allow for quick configuration changes or reconfiguration at multiple levels within an HPC system (individual computer, cluster, cluster under control). The challenge may be met through the use of end-user tools such as information systems (for example enterprise resource planning systems, project management and configuration management) or through different types of system management software.
Unfortunately, not all HPC system administrators have had the same experiences with this set of tools and techniques. The aim of this chapter is to review different types of HPC system management technologies and techniques that make it easier to manage HPC systems. A large number of these tools are being used today in research and development programs around the world for this purpose.
System configuration management (SCM) is a branch of system administration that deals with the organization, optimization and control of software configuration files.