HPC resource management improvement using Reinforcement Learning

I studied the problem of resource allocation into HPC clusters during a 4-months internship (April to July 2020) at the R&D center of Bull (Atos) at Échirolles, in the Cognitive DataCenter team.

Schedulling in a HPC

High Performance Computing (HPC) are big data centers composed of many powerful processors. Users can submit jobs that require many resources.

However, as the number of both compute nodes and jobs is really important, Cependant, le nombre de noeuds de calcul étant très important, you have to be able to place user jobs in the cluster in an efficient way, that is to say define a strategy that allows to run the jobs as quickly as possible. Current schedulers algorithms are deterministics and follow empirical rules defined by the cluster regulator.

Reinforcement Learning

The idea of the internship is to use the potential of Artificial Intelligence by modelling the scheduling as a Reinforcement Learning problem.

Reinforcement Learning consists in an interraction between an agent and a random environnement in which the agent will have to learn how to interract in an efficient way by testing different possibilities. At each step, the agent chooses a possible actoin depending on the current environment staten, and the environment returns to the agent a reward representing the quality of the chosen action and a new state which depends on the taken action. The agent objective is to maximize the sum of rewards.

During this internship, I propose a Reinforcement Learning model and I implement it using a scheduling simulator. Simple experiments are then realized in order to give a proof of concept.

Antoine BARRIER
Antoine BARRIER
PostDoc in Medical Imaging

I’m interested in Medical Imaging techniques and in Optimization Algorithms in Sequential Learning.