AI and High Performance computing

Directors of Scientific Unit: Prof. Luca Benini

 

In the field of High Performance Computing (HPC), scientific supercomputing and cloud computing, a vast array of methodologies from the AI area have being proposed in recent years and are being currently explored, encompassing a wide range of aspects, from data centre automation to improved code and resource usages, passing through broader issues such as energy and computer architecture optimization. HPC systems should be considered as well as enabling technology for the enhancement of the whole AI research area, as the creation of large scale and efficient AI is a cornerstone for the future development of the field. In this area, a key research focus of ALMA AI will be the development and deployment of HPC infrastructure optimized for AI-based computations (e.g. training of large Deep Learning models, solutions of complex optimization problems, simulations with huge number of agents/components, etc).

Another aspect where AI can bring remarkable improvements is the support to HPC applications and reducing the "time to science”, that is the overall time to provide the results of the computation performed in a supercomputer. In this area, DL models can be used to estimate the duration of scientific applications, depending on both the simulation parameters and on the configuration of the hardware used to run the experiments. Such models can then be used to support HPC program developers when deciding the optimal HW and SW configurations (parallelism, node type, HW accelerators, etc.), and to automatically adapt application configuration at run-time (self-tuning) with great benefits in terms of time, computational and energy costs.

Another strategical research direction in ALMA AI  is  data centre automation, where AI-inspired models are employed to learn the system behaviour, either through historical data or simulators, and to provide real-time assistance to the supercomputer administrator. The adoption of Machine Learning (ML) and Deep Learning (DL) models – sub-symbolic, black-box techniques which rely on computational power and large amount of historical data collected from real systems – has been proved extremely fruitful in the connected challenges of anomaly detection (identifying at real-time critical or undesired states) and fault prediction (forecasting critical  situations before their occurrence).

In alignment with broader societal and environmental concerns, in the last years the HPC community put great effort in finding effective ways to reduce power consumption of HPC facilities, either developing new hardware and software solutions or optimizing the management of existing systems. AI-based models based on optimization techniques such as Constraint Programming or Mixed Integer Linear Programming have opened promising directions for the reduction of energy and power consumption of supercomputing, acting on application scheduling, resource management policies, and pricing schemes. In this context, the development of extremely accurate ML/DL models to characterize the thermal behaviour of supercomputer components and of the whole system provided huge benefits.

A non-exhaustive list of research directions explored by ALMA AI in connection with high-performance and high-throughput computing is the following:

  • DL models for anomaly detection and fault prediction
  • DL models for the prediction of power & energy consumption of HPC applications and systems
  • DL models to estimate the duration of scientific applications running on supercomputer (both the duration of the whole application or its sub-parts)
  • Optimization models to improve the management of the supercomputers, in terms of resource usage, code efficiency, quality of services for the users, and management costs for facility owners
  • Combination of optimization approaches and ML models to reduce the power consumption of HPC facilities
  • DL models to characterize the thermal behaviour of supercomputing nodes and the whole HPC system
  • Compute architectures and systems for efficient model inference and training.
  • New generation of programming paradigm and toolchain for efficient usage of optimized architecture for AI
  • New-generation and brain-inspired architectures for accelerating Machine Learning and AI