Machine Learning on Fantasy Premier League

Introduction
Exploratory Data Analysis
Machine Learning Model
- Random Forest
- Neural Network/Deep Learning
Evaluation
- Random Forest
- Neural Network
Conclusion

Introduction

Fantasy Premier League (FPL) is a free-to-play game to choose specific combination of real life Premier League players that would contribute points based on the player’s actual statistical performance and their perceived contribution on the field of play. FPL is growing rapidly as it reaches 7 million users this season.

FPL is a unique and special because it blends the sport with the fans. It gives the fans an opportunity to manage actual Premier League players. As each player has their own assigned priced, fans need to manage their team carefully within limited budget to assemble them. In the beginning, the FPL managers would have to select 15 players, with a maximum of three players from one club. When a player on your team give you attacking contributions on the pitch like goal or assists, your team will be rewarded with points. But when your player get conceded or booked, your team suffers as well. In every round, the manager would be only given a single transfer to change their player to another players. If the manager making more than one transfer in a round, there would be point deductions for them.

Basically there are three basic ways to score points in FPL which are through events such as goals, assists and clean sheets. However the exciting part is your picked captain get the score doubled. So, whenever your captain score a goal, your FPL score will be doubled. So, your captaincy pick is crucial for your FPL team.

“FPL user growth in the last 13 years. Source: https://fanarena.com/wp-content/uploads/2019/11/Fantasy-Premier-League-Users-per-Season.png”

For FPL managers, selecting players who would produce the most point in each round is bot easy as there are many variables that could influence. Player form may influence in some parts, but however we could not neglected other factors, such as the team overall form or the fixtures. For example, in the season of 2019/2020, Teemu Pukki from Norwich City was having an excellent form leading to the GW 6. He managed to score 49 points across five matches, including a 12-pointer against the holding champion, Manchester City! However, in the six matches afterward, amazingly he only gained 12 points.

The project aims to make a point prediction model by using machine learning. I use the dataset from fantasy.premierleague.com, understat.com and club elo.com from the 2018/2019 and 2019/2020 season from more than 150 players. The data combined from those sources giving variables such as the player form, team form, home or away fixture, Elo ratings, expected goals per 90 minutes or even the time when the match happened.

As FPL is expected to grow even faster in the future, this point prediction model would be really useful not only for fantasy premier league and football enthusiast, but also for public use.

After doing some data cleansing and wrangling which is not shown here, finally we can look into the data. The original dataset contain of 94 variables, however we would only use some of them. Let’s take a look at how Mohammed Salah of Liverpool did at the round 12-19 in season 2019/2020:

	player	round	club	opponent	was_home	total_points	goals	assists	minutes	club_elo	opponent_elo	position	value	influence_form	creativity_form	threat_form	ict_index_form	date_hour	kickoff_hour	range_day	team_scoring_form	team_conceded_form	opponent_scoring_form	opponent_conceded_form	team_form	opponent_form	fpl_form	goals_90	assists_90	npxG_90	npg_90	xG_90	xGChain_90	xGBuildup_90	xA_90	shots_90	key_passes_90	season	minutes_chance	scoring_chance	conceded_chance
6971	Mohamed Salah	12	LIV	MCI	True	8	1	0	86	2053.544	2025.337	AM	123	15.80	13.550	55.50	8.450	2019-11-10 16:30:00	afternoon kick-off	8.062	1.75	1.00	1.75	0.75	2.000	1.625	2.75	0.378	0.000	0.588	0.000	0.876	0.888	0.292	0.092	6.050	1.513	2019/2020	0.661	1.250	1.375
6972	Mohamed Salah	13	LIV	CRY	False	0	0	0	0	2061.503	1740.198	AM	123	22.55	4.950	44.00	7.125	2019-11-23 15:00:00	afternoon kick-off	12.938	2.00	1.00	0.50	2.00	2.000	0.375	4.25	0.769	0.000	0.359	0.385	0.652	0.669	0.327	0.014	5.385	0.385	2019/2020	0.650	2.000	0.750
6973	Mohamed Salah	14	LIV	BHA	True	3	0	0	68	2043.680	1632.715	AM	122	22.55	4.950	44.00	7.125	2019-11-30 15:00:00	afternoon kick-off	7.000	2.25	1.00	1.50	1.75	2.250	1.000	4.25	0.769	0.000	0.359	0.385	0.652	0.669	0.327	0.014	5.385	0.385	2019/2020	0.650	2.000	1.250
6974	Mohamed Salah	15	LIV	EVE	True	0	0	0	0	2044.447	1701.255	AM	122	11.40	4.300	28.25	4.375	2019-12-04 20:15:00	night kick-off	4.219	2.25	1.00	1.00	1.50	2.250	0.875	3.25	0.413	0.000	0.353	0.413	0.353	0.555	0.150	0.052	3.716	0.413	2019/2020	0.606	1.875	1.000
6975	Mohamed Salah	16	LIV	BOU	False	13	1	1	90	2046.424	1678.172	AM	122	11.40	4.000	19.50	3.500	2019-12-07 15:00:00	afternoon kick-off	2.781	3.00	1.25	1.00	2.00	2.125	0.000	2.75	0.584	0.000	0.324	0.584	0.324	0.443	0.046	0.073	2.922	0.584	2019/2020	0.428	2.500	1.125
6976	Mohamed Salah	17	LIV	WAT	True	16	2	0	90	2058.863	1629.378	AM	122	15.90	16.050	30.25	6.225	2019-12-14 12:30:00	lunchtime kick-off	6.896	3.00	1.00	0.25	1.75	2.250	0.250	4.00	0.570	0.570	0.840	0.570	0.840	1.097	0.195	0.434	3.987	2.848	2019/2020	0.439	2.375	0.625
6977	Mohamed Salah	19	LIV	LEI	False	3	0	0	69	2059.825	1826.247	AM	122	33.35	17.025	61.50	11.200	2019-12-26 20:00:00	night kick-off	12.312	3.00	0.75	2.00	1.25	2.125	1.375	8.00	1.089	0.363	0.894	1.089	0.894	1.058	0.124	0.276	5.081	1.815	2019/2020	0.689	2.125	1.375

I will dive into more details about variables on the above table that we’ve just read:

total_points: Our target variable. It is the point that the player gained in each specific round.
club: The club where the player played.
opponent: The opponent of the club.
position: The position of the player on the pitch. Different with FPL terms of position, I classified it into more detail. I divided defender into two different positions, centre back (CB) and full back (FB). While, I divided midfielder into defensive midfielder (DM) and attacking midfielder (AM).
value: The price of the player that set by FPL.
round: The gameweek or matchweek that the game take placed.
was_home: Whether the match was at home or away from the club.
date_hour: The specific time where the kickoff started.
kickoff_hour: Category whether the match happened at lunch, afternoon or night.
range_day: The time difference between the match and the previous Premier League match.
team_form: The form of the club leading to the match.
opponent_form: The form of the opponent leading to the match.
club_elo: The elo rating of the club leading to the match.
opponent_elo: The elo rating of the opponent leading to the match.
minute_chance: The number of minutes that the player have in the last 4 rounds prior to the match per 360 minutes.
fpl_form: The average number of points that the player have in the last 4 rounds prior to the match.
influence_form: The average number of influence rating that the player have in the last 4 rounds prior to the match.
creativity_form: The average number of creativity rating that the player have in the last 4 rounds prior to the match.
threat_form: The average number of threat rating that the player have in the last 4 rounds prior to the match.
ict_index_form: The average number of ICT index that the player have in the last 4 rounds prior to the match.
xG_90: The average number of expected goals that the player have in the last 4 rounds per 90 minutes prior to the match.
xGChain_90: The average number of expected goals chain that the player have in the last 4 rounds per 90 minutes prior to the match.
xA_90: The average number of expected assists that the player have in the last 4 rounds per 90 minutes prior to the match.
shots_90: The average number of shots that the player have in the last 4 rounds per 90 minutes prior to the match.
key_passes_90: The average number of key passes that the player have in the last 4 rounds per 90 minutes prior to the match.

Exploratory Data Analysis

One of the ways to see the relation between numeric variables is by using correlation plot. We will start by correlate team attributes such as team_form and opponent_form to our target variable:

Then, we will correlate the player attributes which contain parameters such as xG_90, fpl_form, and shots_90:

In both graph, we could say that our target variable, total_points show a weak correlation with other parameters. It shows us that we could not interpretate the player points by only using one or few variables. To predict the point, we need to combine other parameters.

The following plot is the average points of that scored by each position in each round. We could see that attacking midfielder (AM) and forward (ST) nearly dominate the highest average in every round. Forwards seem to be more fluctuative than the attacking midfielders, however they have the highest ceiling among the other positions. It also important to be noted that the full back (FB) is having a higher average points than the centre back (CB), most likely because of their attacking contributions on the picth. Defensive midfielders is the lowest among the group, that can be explained because of their lack of involvement in attack and the absence of defensive acknowledgement for them in the game.

Does the expensive player really give you more points throughout the season? If we look the following plot, the answer would be correct, generally. However, the real task is to get the best valuable player, which it means player with high points per value (or points per million). In the plot, on the high price range (right of the plot), attacking midfielders is tend to be a better value than forward throughout the season. On the lower price range, we could see that defenders from top team, such as Trent Alexander-Arnold of Liverpool or Aymeric Laporte of Man City are a better value pick than the other positions.

Which position frequently give us double digit returns in each round throughout the season? We are trying answer the question by using the following graph. The left one is the number of single digit returns (6 - 9 points) that produced by position in each round and the right is the double digit returns (> 9 points). From the graph it is obvious that only full back and attacking midfielder that give us consistent number of single digit return of more than 150 times. However only attacking midfielder that give us more than 100 times of double digit returns throughout the season.

In the following plot, we would like to know if club influenced the points gained by player. On the plot, we compare average (yellow dot) and the maximum points (blue dot) that can be achieved by the player from each clubs. Top rank clubs like Liverpool and Manchester City as expected generates high average points and also having a high ceiling than other clubs. Watford shows us that their players could gain high points in one or few rounds although the points will be normalized into the average throughout the season.

Machine Learning Model

Random Forest

Random forest is a Supervised Learning algorithm which uses ensemble learning method for classification and regression. In order to prevent bias in our model, we remove the data under 0.1% in our target variable, so that our target variable will range from -2 to 16.

fpl_ml <- read.csv("fpl_ml.csv")

fpl_ml %>% 
  pull(total_points) %>% 
  table() %>% 
  prop.table()*100

## .
##          -6          -4          -3          -2          -1           0 
##  0.01355197  0.01355197  0.04065592  0.20327958  0.71825451 21.12752405 
##           1           2           3           4           5           6 
## 17.44138772 27.30722320  6.84374577  1.55847676  3.52351267  7.26385689 
##           7           8           9          10          11          12 
##  3.26602521  2.81881014  2.49356281  1.27388535  0.93508606  0.90798211 
##          13          14          15          16          17          18 
##  0.67759859  0.37945521  0.52852690  0.27103944  0.09486380  0.06775986 
##          19          20          21          23          24 
##  0.06775986  0.06775986  0.06775986  0.01355197  0.01355197

fpl_ml_rm <- fpl_ml %>% 
  dplyr::filter(total_points < 16 & total_points > -2)

Before do the modeling, we need to do cross validation. We seperate the data into training and test dataset.

set.seed(100)
splitted <- rsample::initial_split(fpl_ml_rm, 0.7, "total_points")
fpl_train <- rsample::training(splitted)
fpl_test <- rsample::testing(splitted)

fpl_train <- fpl_train %>% 
  dplyr::select(- c(player, round, goals, assists, minutes, date_hour, season))

fpl_test <- fpl_test %>% 
  dplyr::select(- c(player, round, goals, assists, minutes, date_hour, season))

set.seed(417)

ctrl <- trainControl(method="repeatedcv", 
                     number=5, 
                     repeats = 3)

Then we can start the modeling:

#fpl_forest <- train(total_points~., fpl_train, method="rf", trControl= ctrl, ntree = 500)

#saveRDS(fpl_forest, "fpl_forest_RM.RDS")

fpl_forest <- readRDS("fpl_forest.RDS")

fpl_prediction_test <- predict(fpl_forest, fpl_test)
fpl_prediction_train <- predict(fpl_forest, fpl_train)

Neural Network/Deep Learning

Neural networks are set of algorithms that comprised of an arrangement of “layers” and “nodes” (representing neurons) such that “information” flows from one later and relayed to another, which designed to recognize patterns.

Before modeling, we divided the data in train and test data for ratio of 7:3.

fpl_ml <- read.csv("fpl_ml.csv")

fpl_ml_rm <- fpl_ml %>% 
  dplyr::filter(total_points < 16 & total_points > -2) %>% 
  dplyr::select(- X)

set.seed(100)
splitted <- rsample::initial_split(fpl_ml_rm, 0.7, "total_points")
fpl_train <- rsample::training(splitted)
fpl_test <- rsample::testing(splitted)

train_x <- fpl_train[,-6]
test_x <- fpl_test[,-6]

train_y <- fpl_train[,6]
test_y <- fpl_test[,6]

train_x_keras <- model.matrix(~club + opponent + was_home + club_elo + opponent_elo + position + value + influence_form + creativity_form + threat_form +  threat_form + ict_index_form +
kickoff_hour + range_day + team_form + opponent_form + fpl_form +
  goals_90 + assists_90 + xG_90 + xGChain_90 + xGBuildup_90 + xA_90 + shots_90 + key_passes_90 + minutes_chance + scoring_chance + conceded_chance, train_x)

test_x_keras <- model.matrix(~club + opponent + was_home + club_elo + opponent_elo + position + value + influence_form + creativity_form + threat_form +  threat_form + ict_index_form +
kickoff_hour + range_day + team_form + opponent_form + fpl_form +
  goals_90 + assists_90 + xG_90 + xGChain_90 + xGBuildup_90 + xA_90 + shots_90 + key_passes_90 + minutes_chance + scoring_chance + conceded_chance, test_x)

train_x_keras <- array_reshape(train_x_keras, dim = dim(train_x_keras))
test_x_keras <- array_reshape(test_x_keras, dim = dim(test_x_keras))

train_y_keras <- train_y

Now we do the modeling. We created two layers with 64 nodes at the beginning:

model <- keras_model_sequential()

model %>% 
  layer_dense(units = 64, input_shape = ncol(train_x_keras), activation = "linear", name = "1") %>% 
  layer_dense(units = 32, activation = "linear", name = "2") %>% 
  layer_dense(units = 16, activation = "linear", name = "3") %>% 
  layer_dense(units = 1, activation = "linear", name = "output")

model %>% 
  compile(loss = "mse", optimizer = optimizer_adam(lr = 0.01), metrics = c("mean_absolute_error"))

set.seed(100)

model %>% 
  fit(train_x_keras, train_y_keras, epoch = 5, batch_size = 5)

pred_test_nn <-  model %>%  predict(test_x_keras)

pred_test_nn <- as.numeric(pred_test_nn)

pred_train_nn <- model %>%  predict(train_x_keras)

pred_train_nn <- as.numeric(pred_train_nn)

Evaluation

For evaluation, there are many parameters, one of them is Mean Absolute Error (MAE). We will seperate the evaluation for data test and data train to see whether the our model is overfitting:

Random Forest

yardstick::mae_vec(truth = fpl_train$total_points, estimate = fpl_prediction_train)

## [1] 1.89482

yardstick::mae_vec(truth = fpl_test$total_points, estimate = fpl_prediction_test)

## [1] 2.000539

For the data train, we have 1.89 of MAE while for data test, the MAE is 2.00. The small differences in MAE betwen data train and data test show that our model is not overfitting. Our data has range from -2 to 16, so that MAE of 2.00 shows that our model has percentage error of 2/(16-(-2)) = 11.11%

Neural Network

yardstick::mae_vec(truth = train_y, estimate = pred_train_nn)

## [1] 2.048037

yardstick::mae_vec(truth = test_y, estimate = pred_test_nn)

## [1] 2.078562

In our neural network model, the MAE is 2.34 for the data train, while MAE is 2.41 for data test. Our data has range from -2 to 16, so that MAE of 2.41 shows that our model has percentage error of 2/(16-(-2)) = 13.39%

Conclusion

In our case, random forest proof to be perform marginally better by 0.4 than neural networkmore accurate in predicting than neural network in our dataset by 0.4. Neural networks works better when there are invariants to the data such as for the case in image recognition. However, when these invariances do not exist, they do not have much of an advantage.