June 2019

Machine Learning for Cloud and Distributed Computing

As the Final Year Project of my MEng Electrical and Electronic Engineering degree from Imperial College London, I developed machine-learning techniques to monitor and predict resource utilization and availability on a very large number of computers in cloud computing and distributed environments.

My supervisor for the project was Professor Kin K. Leung. My second marker for the project was Doctor Wei Dai. This project was part of the US-UK ITA project.

The project began with a survey of existing techniques available and various underlying models reported in the literature. Using real datasets, algorithms were implemented to track the resource usage of many computer servers as a way to predict resource occupancy and workload on the computers in the near future. By using the actual resource-usage measurements, the machine-learning techniques were validated in order to show their effectiveness.

In my final report, I first looked at the current state of the distributed and cloud computing market. I identified the problems the industry is facing today. I went on to justify better resource occupancy predictions via machine learning as a good solution to these problems. I used a large dataset provided by Google for my technical investigations. I conducted exploratory analysis on the dataset to determine the dynamics of the system. I then identified well-suited prediction models, implemented them, and compared their performance to some baseline models.

All the data used for this project has been voluntarily published by Google in an attempt to "make visible many of the scheduling complexities that affect Google's workload, including the variety of job types, complex scheduling constraints on some jobs, mixed hardware types, and user mis-estimation of resource consumption". The usage trace is located in a public Google Cloud Platform bucket . All code related to the project is publicly avaiable as well, at a GitHub repository.