HPC resource management improvement using Reinforcement Learning
I studied the problem of resource allocation into HPC clusters during a 4-months internship (April to July 2020) at the R&D center of Bull (Atos) at Échirolles, in the Cognitive DataCenter team.
Schedulling in a HPC
High Performance Computing (HPC) are big data centers composed of many powerful processors. Users can submit jobs that require many resources.
However, as the number of both compute nodes and jobs is really important, Cependant, le nombre de noeuds de calcul étant très important, you have to be able to place user jobs in the cluster in an efficient way, that is to say define a strategy that allows to run the jobs as quickly as possible. Current schedulers algorithms are deterministics and follow empirical rules defined by the cluster regulator.
Reinforcement Learning
The idea of the internship is to use the potential of Artificial Intelligence by modelling the scheduling as a Reinforcement Learning problem.
Reinforcement Learning consists in an interraction between an agent and a random environnement in which the agent will have to learn how to interract in an efficient way by testing different possibilities. At each step, the agent chooses a possible actoin depending on the current environment staten, and the environment returns to the agent a reward representing the quality of the chosen action and a new state which depends on the taken action. The agent objective is to maximize the sum of rewards.
During this internship, I propose a Reinforcement Learning model and I implement it using a scheduling simulator. Simple experiments are then realized in order to give a proof of concept.