
Clustrx supports all levels of a cluster’s infrastructure, from bootable operating system images on compute nodes to user/administrator interfaces fully abstracted from lower levels of architecture. Most system components of Clustrx are implemented in Erlang/OTP and use all advantages provided by Erlang to deliver a scalable set of robust, distributed services.
Comprehensive Monitoring & Resource Management
The backbone of a Clustrx-driven computing cluster is its monitoring, management and control system, Clustrx Watch. Clustrx Watch is an innovative cluster-wide monitoring system capable of surveying millions of checkpoints (hardware sensors, SNMP data sources, traps, kernel and software metrics) in nearly real time, while scaling linearly. It features a hierarchical architecture of service data collection, aggregation, distribution, processing and logging that has been engineered to serve multi-petascale systems.
Clustrx Watch includes an advanced power manager integrated with its resource manager. Any unused hardware can be switched off quickly. An emergency shutdown system guards hardware against critical failures when the cooling or power fails. Nodes that the monitoring system resides on are mutually replaceable, i.e. as soon as any of those are found to be in trouble, they will exchange their roles intelligently, smoothly and transparently. This contributes to an unbreakable architecture with no single point of failure.
The resource manager for Clustrx (based on SLURM) is connected with the monitoring system, and relies largely on it. Its purpose is to launch computing jobs on compute nodes and track their execution. If Clustrx Watch finds nodes in a critical state, the two layers use sophisticated logic to redistribute the computing load between the nodes.
Robust Performance
The basis for robust performance and no single point of failure is the division of a cluster into computing and management nodes and the creation of a deep hierarchical control structure. A cluster’s hardware and software resources are redistributed by the OS management infrastructure transparently for the user, to achieve stable and safe operation.
System services are implemented as distributed and virtualised ones, to run on management nodes in a floating and mutually replaceable fashion. Most important services include AAA (user accounts, authentication, authorisation), a highly controllable booting/configuring of compute nodes, a single configuration database named dConf (Distributed Configuration) that allows a customised access.
Single-Point Administration
Clustrx OS views an HPC cluster as a single supercomputing machine, as a “black box” that aggregates the computing power of a large number of nodes (that can be totally diverse hardware and system platforms) into a comprehensible and scalable service that can be deployed, controlled, and distributed from a single point. This single point includes command-line and graphical interfaces, from where any administration task can be done manually or automated by scripts and OpenAPIs.
The whole suite can be deployed within hours, requiring only modest human effort. The queue of computing jobs, the user access rights, and assigned limits of computing resources are all controlled via a single administration interface.
Easy To Use
Users work with an HPC cluster under Clustrx OS as they would with a single Linux/UNIX machine, but provided with a lot of flexible power. The user has the right to formulate specific requirements to a compute node boot image, check for the presence of certain tools and libraries, and place orders for whatever resources may be needed to run their compute jobs. Both command-line and graphical interfaces are available.