ANDREAS (Artificial intelligence traiNing scheDuler foR disaggrEgAted resource clusterS) aims at addressing two key needs in the market: efficiency for the usage of resources and reduction of power consumption. Today, Artificial Intelligence (AI) and Deep Learning (DL) methods are used for a wide range of applications and supported by several HW and SW platforms. Larger and more sophisticated usage of AI models uncover new areas for improvements related to the management of the power footprint of GPU-based ML model for training and retraining across various deployments, from on-premises systems to mid-size infrastructures (like EU cloud operators and HPC centers) and large providers to edge/fog systems.
DL models are trained on GPU-based systems, consistently achieving 5-40x speedup with respect to CPU-based servers. The ability to optimize the efficiency of the infrastructure under stringent power constraints is key for cloud, datacenter operators and HPC centers providing GPU computing power for AI/ML purposes. Local operators/centers are subject to power consumption quotas and their energy bill depends on the efficiency of the deployed infrastructure. While there are advanced solutions allowing to manage virtual servers or containers, the growing complexity of ML models running on GPU-based servers requires keeping power consumption under hard quotas while optimizing the allocation of the GPUs, which are high-value assets.
ANDREAS aims at developing advanced scheduling solutions for the optimization of the DL training's run-time performance and for minimizing the energy consumption of the training phase in aggregated and disaggregated GPU-based clusters. The architecture envisioned in ANDREAS is based on a SLURM queue manager, a pool of CPU-based servers, a pool of GPUs accessed through a switch, intelligent modules interacting with the jobs scheduler for performing predictions on consumption for the application and for the performance. Training jobs are submitted to SLURM and are characterized by a deadline and a priority (i.e. weight). Jobs are never rejected and can possibly be delayed. The final goal is to minimize the weighted job tardiness given the power budget established by the System Administrator or by the user. ANDREAS is a 10-month project, and the team plans to build early prototypes of the solution by fall 2020.
TETRAMAX is a Horizon 2020 innovation action within the European Smart Anything Everywhere (SAE) initiative in the domain of customized and low-energy computing for Cyber Physical Systems and the Internet of Things. As a Digital Innovation Hub, TETRAMAX aims to bring added value to European industry, helping to gain competitive advantage through faster digitization. The project partially builds on experiences with the TETRACOM project during 2013-2016. TETRAMAX was launched in Sep 2017 and runs until Dec 2021.