Introduction

In a previous post on the FPL, I attempted to develop an approach for selecting a winning FPL team. Although it has only been one Game Week (GW) since, the results have been less than satisfactory. Upon a review of the approach, I realised that I made two mistakes. First, I placed too much emphasis on the median of bonus points (BPs) scored by each player. The mean, though easily distorted, provided a better estimate of a players’ points-scoring ability. Second, I focused too much on accumulating points at a stable rate. Choosing the coefficient of variation as a key metric led me to choose players who were excellent footballers, but not excellent choices in the FPL.

In this post, I propose a Bayesian approach (I’ll explain this later) for predicting and choosing the optimal players to win the FPL.

The Data

I use two datasets: (1) player-game data for the 2017/2018 season and (2) player-game data for many leagues across Europe.

EPL 2017/2018

This dataset contains player-game data: records of each player in each game. Each record tell us how many FPL points, goals, assists, clean sheets, and other points-scoring criteria were clocked by a specific player in a specific game. This dataset will help us to identify the main metrics to look for.

Note: The findings were similar to those in the section Understanding the FPL: The Stats That Matter in my first post. You may wish to skip to the section on Bayes Theorem if you have read that post already.

Historical Player-Game Data

The original dataset from Kaggle contained match, team and player data. In total, there were 8 seasons worth of games in each of the following leagues:

  1. English Premier League
  2. German Bundesliga
  3. Italian Serie A
  4. Spanish La Liga
  5. Dutch Eredivisie [removed 2 seasons]
  6. French Ligue 1 [removed 2 seasons]
  7. Scottish Premier League [removed 4 seasons]
  8. Swiss Super League [removed 7 seasons]
  9. Belgian Jupiler League [removed all seasons]
  10. Polish Ekstraklasa [removed all seasons]
  11. Portuguese Liga ZON Sagres [removed all seasons]

However, not all matches contained team lineup and goal data. As such, several seasons and entire leagues were removed from the dataset. There were a total of 13,325 matches remaining. The match data was transformed into player-game data by extracting the number of goals scored and assists made in each match. This translated into 298,194 observations - each observation is a player record for a given game that he played in.

Player Positions

Strangely, the dataset contained very detailed data on player attributes (from the FIFA game versions), but did not state the players’ positions. As such, I scraped player positions from the players’ respective stats pages on SoFIFA, a website dedicated to providing FIFA game stats.

The Stats That Matter (Version 2)

In this section, I replicate the regression models from my first post to identify the metrics that matter for goalkeepers (GKs), defenders (DEFs), midfielders (MIDs) and strikers (STKs). This time, I use the actual FPL points scored as the variable explained in the model, and the scoring metrics to explain the FPL points scored.

Goalkeepers (GK)

The graph below shows the number of actual FPL points awarded for each of the following activities. Realistically, we should only be concerned with clean sheets and saves. It is optimal to choose a keeper who (1) is good enough to keep the ball out of the net consistently because clean sheets give a whopping 4 points; and (2) has DEFs who make it difficult for attackers to shoot, but still let shots through - because each save is worth approximately 0.45 points.

Defenders (DEF)

For DEFs, the metrics we want to look out for are clean sheets and assists. Goals provide many more points, but are much rarer and harder to predict. We should choose defenders who are (1) part of a strong defensive team (other DEFs and GK) to boost their chances of earning a clean sheet, and (2) aren’t afraid to push forward to make crosses (to get assists). Typically, wingbacks would meet these criteria. Robertson, Alexander-Arnold and Alonso are worth keeping an eye on.

Midfielders (MIDs)

The selection criteria for MIDs are goals scored and assists. Nothing particularly novel here.

Strikers

For STKs, we want to watch their goals scored and assists. Again, this is expected.

Summary

In summary, goals scored, assists and clean sheets are generally the metrics that matter. It is difficult to predict how many goals a player will score from game to game. And as a friend rightly pointed out, it is difficult to predict how a player who just transferred into an EPL team (like Salah) would perform. Historical data should be used to estimate players’ point scoring ability. How do we achieve that? The answer is Bayesian inference.

Bayes Theorem

\[ {P(H|E) = \frac{P(E|H)P(H)}{P(E)}} \]

I’m sure many of you are familiar with Bayes’ Theorem. I will only provide an elementary, intuitive proof of it, because there are many articles out there that provide much more rigorous explanations of the rule. Moving the denominator on the right to the left, we get:

\[ \begin{aligned} P(H|E)P(E) &= P(E|H)P(H) \\ \Rightarrow P(H\cap E) &= P(E\cap H) \end{aligned} \]

The left side of the equation is the probability that H occurs given that E occurs multiplied by the probability that E has occurred. In other words, it is the probability that H and E occur together. By a similar logic, we can deduce that the right side of the equation is also the probability that H and E occur together. This means that the left side of the equation equals the right side of the equation - it is mathematically sound. We could have proven Bayes’ rule by going the opposite direction. In any case, what’s important is how we apply it to the problem at hand.

Here’s a simple example using the recent Liverpool v. West Brom game. Let’s say we switched on our TV to find Liverpool 2-0 up against West Brom. What is the probability that Liverpool wins the game? We can model this using Bayes’ Rule (above). H is the event that Liverpool beats West Brom, and E is the event that Liverpool scores two goals. We label them H and E respectively to indicate that they are our hypothesis (something we don’t yet know) and our evidence (something we observe). On the right side of the equation:

Bayes Theorem allows us to combine these three pieces of information to arrive at a posterior probability: a probability that something happens after taking evidence into consideration. In this case, it is the probability that Liverpool wins if they have scored two goals.

So all this sounds interesting in theory, but how do we apply it? Let’s tweak the example slightly. Suppose we think that Liverpool would win the game with 75% probability. This is \(P(H)\), our prior belief, established by professional knowledge, gut feel, research, or empirical statistics - it doesn’t matter, as long as there is some quantified belief. As of the 4th minute when Danny Ings scored, the win probability changed. In the 72nd minute, Salah scored a second goal, and the win probability changed again. We are not referring to \(P(H)\) here. \(P(H)\) applies when we stubbornly refuse to incorporate any new information. By accounting for Ing’s and Salah’s goals, we are stepping into Bayesian territory - Bayes’ Theorem applies here.

Suppose instead of E being the probability that Liverpool scored exactly two goals, E is the probability that Liverpool scores X goals. We can model this easily using a Poisson distribution. The blue bars in the graph below shows the distribution of goals scored by STKs in the 2017/2018 season, and the red lines represent a fitted Poisson distribution for this data. The fit is not perfect, but it shows that goals scored typically follow a Poisson distribution.

We could, by all means, choose to ignore the goals scored in the game and stick to our probability \(P(H)\) of winning. Alternatively, we could incorporate new information as it comes. By accounting for Ing’s goal in the 4th minute, the updated probability of winning would be:

\[ {P(LIV\ Wins|LIV\ scored\ 1\ goal)} \]

which can be estimated using past data and our Poisson distribution of Liverpool goals. By accounting for Salah’s goal in the 72nd minute, we would update the probability above to:

\[ {P(Liverpool\ Wins|Liverpool\ scored\ 2\ goals)} \]

This is not to say that Liverpool now has a greater chance of winning (with a 2-0 lead). We don’t know that because we have not actually performed the statistical analysis. For all we know, the probability of winning could be lower if Liverpool generally becomes more complacent whenever they have scored 2 or more goals, and end up drawing or losing the game.

The key point to note here is that Bayes’ Theorem allows us to (1) quantify an existing belief - a prior probability, as in the probability \(P(H) = 0.75\) that Liverpool would win; and (2) make updates to this belief using new data, like the incorporation of Liverpool’s two goals into a conditional probability by applying Bayes’ rule. This concept is the foundation for the Bayesian approach we will use to estimate players’ points-scoring abilities.

An Empirical Bayes Approach

As discussed in the first section, the metrics of interest are: goals, assists and clean sheets. If we have an estimate of players’ probabilities of scoring, assisting, and maintaining clean sheets, we could estimate their worth for a particular season. In fact, we could, as we did above, update these probabilities from game to game. This would require us to establish a hierarchical Bayes model with two distributions:

  1. A distribution for scoring goals
  2. A prior distribution (prior) to describe the parameters for the above distribution

We can think of these as the right side of the Bayes’ rule above. Such a model will require us to assume that the parameters for the first distribution (for scoring goals) change randomly from game to game, and how much these parameters vary can be modelled by a second distribution (the prior). This description is rather technical. I will provide examples once we have specified how we wish to model goal scoring.

Configurations

Model A: The Number of Goals Scored

I have chosen two possible configurations. The first (A) is to estimate the probabilities of the number of goals scored by a player throughout the season. This will require a Poisson distribution (Model A1) as the distribution for scoring goals, and a Gamma distribution as the prior (Model A2). In English, this means that the number of goals that a player like Salah would score in any given game is randomly drawn from a Poisson distribution with parameter \(\lambda_i\). Note the subscript i: this means that the parameter of the Poisson varies from game to game! How does it vary? Well, \(\lambda_i\) is itself drawn from a distribution - the Gamma distribution. Together, these imply two levels of randomness: (1) to randomly draw the parameter \(\lambda_i\) from a Gamma distribution, and (2) to then use this parameter to randomly draw a number of goals for a specific game that Salah is playing in using the Poisson distribution.

Model B: The Probability of Scoring in a Game

The second configuration (B) is much more unorthodox for football: to estimate the probability that a player will score in a game. This will require a Binomial distribution (Model B1) to model the chance of scoring in a game and a Beta distribution as the prior (Model B2). In English, this means that the chance that Salah will score (yes/no) in a given game is governed by a probability \(p_i\). The chance of scoring is a parameter in the Binomial distribution, which models the number of successes (games in which Salah scores) out of a total number of binary (yes/no) trials (games). Note the subscript i, which, like configuration A, implies that \(p_i\) is also drawn randomly from a distribution. In this case, we have the Beta distribution. Again, there are two levels of randomness: (1) randomly drawing a \(p_i\) from a Beta distribution and (2) flipping a biased coin which has a probability \(p_i\) of getting heads (score a goal).

Analysing the Models

We will do pretty much the same thing for both configurations - what is known as empirical Bayes estimation:

  1. Estimate the parameters \(\alpha_0\) and \(\beta_0\) for the prior distribution using a whole lot of past data (hence, empirical) up to the 2014/2015 season
  2. Analyse the distribution of the parameters \(\lambda_i\) or \(p_i\) for the main distributions (Poisson and Binomial)
  3. Incorporate the data in the 2015/2016 season by updating \(\alpha_0\) and \(\beta_0\) for the prior distribution to get the posterior distribution
  4. Analyse the new distribution for the parameters \(\hat{\lambda_i}\) and \(\hat{p_i}\), and the predicted distribution of goals (or games with goals) with respect to the true distributions

For the purpose of keeping the post relatively short, I will save Model B (Beta-Binomial model) and the comparison on both models for the next post in this series, and the results of the models for assists and clean sheets in the post after.

Predicting the Number of Goals Scored in a Game

In this section, we attempt to implement Model A (Poisson-Gamma). Specifically, we will use STK data from the 2008/2009 season up to the 2013/2014 season to develop a generic prior distribution (applies to any STK). We will then update the prior distribution with data from the 2014/2015 EPL season to arrive at posterior distribution for each STK in the EPL. This is our predicted probability distribution for STK in the 2015/2016 EPL season (the latest season we have). Finally, we will perform two analyses:

  1. Compare the “predicted” goal distributions with the actual goal distributions for each STK in the 2015/2016 EPL season
  2. Repeat the model for the 2008/2009 season to the 2014/2015 season, update the priors with the 2015/2016 season, and compare the “predicted” goal distributions with the actual goal distributions for the 2017/2018 season.

Fitting the Model

First, we need to obtain a distribution of \(\lambda_i\). We achieve this by:

  1. Summarise goal records for players who played at least 20 games and scored at least 1 goal. This is so that we have sufficient data per player, and we don’t have any empirical \(\lambda\)s that are zero. Zeros make it impossible to estimate a Gamma distribution.
  2. Fit a Poisson distribution to the goal distribution (i.e. all games) for each STK. For player i, the parameter for the Poisson distribution unique to him would be the mean of goals scored in all his games. Calculating the mean goals for each of the 677 selected STKs gives us a distribution of \(\lambda_i\) (below).

We then fit a Gamma distribution to this distribution using the fitdistr function. The resulting distribution of \(\lambda_i\) and the fitted Gamma distribution is shown in the graph below:

The Gamma distribution approximately fits the distribution of \(\lambda_i\). The parameters were:

\[ \begin{align} \alpha_0 &= 4.57 \\ \beta_0 &= 13.83 \end{align} \]

\(\alpha_0\) can be interpreted as the number of goals scored, and \(\beta_0\) as the number of games played.

Updating the Model

In this section, we update the model for all STKs using data from the 2014/2015 EPL season. Here’s an example of how we can do so. Suppose Player X played 33 out of 38 games in that season. We get random draws from a Poisson distribution with \(\lambda = 0.9\) to obtain the following performance graph:

That amounts to a total of 28 goals in 33 games - a prolific striker indeed. To update the Gamma distribution’s parameters, we would add the total number of goals to the first parameter \(\alpha_0\), and the total number of games to the second parameter \(\beta_0\):

\[ \begin{align} \alpha_X = \alpha_0 + 32 &= 4.56 + 28 = 32.56 \\ \beta_X = \beta_0 + 33 &= 13.83 + 33 = 46.83 \end{align} \]

With these new parameters for our Gamma distribution, we can now plot the distribution of \(\lambda_i\), the updated parameter for the Poisson distribution of goals for Player X as of the end of the 2014/2015 season.

The mean number of goals (or empirical \(\lambda_i\)) for Player X was 0.84. In contrast, the mean of the posterior distribution for \(\lambda_i\) was only 0.69. Although Player X had a superb season, his expected distribution of goals for the next season was not scaled up in proportion to his performance. This was because, from past data, we know that the average STK has a \(\lambda_i\) of approximately 0.33. Player X has to maintain good performance over multiple seasons for the model to increase his \(\lambda_i\) estimate. Next, we perform these same actions of updating the distribution of \(\lambda_i\) for all EPL STKs, using the data from the 2014/2015 season.

Analysis of Distributions

To start off, I use Roberto Firmino as an example to illustrate what updating the model implies. First, I update Firmino’s \(\lambda_i\) distribution with his data from the 2014/2015 season. I then run a simulation of 100 seasons for Firmino, assuming that he plays in 35 of 38 games in each of these hypothetical seasons. In each game, I draw a \(\lambda_i\) from his updated distribution, and use this \(\lambda_i\) to randomly draw a number of goals from a Poisson distribution. We end up with 100 different seasons, which we can interpret as the possibilities for Firmino in the 2015/2016 season.

The blue line shows Firmino’s goal distribution in the 2014/2015 season, the red line shows his goal distribution in the 2015/2016 season, and the grey lines are the goal distributions from the 100 simulated seasons with the updated Gamma distribution. From the graph, we see that the model underestimated his scoring ability. Specifically, it overestimated the number of games where he scored 0 or 1 goal, and underestimated the number of games in which he scored 2 goals.

In 2014/2015, Firmino scored an average of 0.21 goals per game (\(\lambda_i\)). In 2015/2016, he doubled that figure to 0.40 goals per game. However, due to his poor showing in 2014/2015, the estimated mean was only 0.25. These results are consistent with the graphical interpretations. What we can conclude here is that a simpler and more practical approach would be to compare the mean and 95% confidence interval of the updated Gamma distribution for \(\lambda_i\) with the “true” parameters. We compare these parameters first for the 2014/2015 season (the equivalent of in-sample prediction), and then for the 2015/2016 season (the equivalent of out-of-sample prediction).

2014/2015 Season

Analysis here.

[SOME GRAPH TO COMPARE PREDICTED AND ACTUAL PARAMETERS]

2015/2016 Season

Analysis here.

Predictions for 2017/2018

Next, we establish the model for all players using data up to the 2014/2015 season and update the priors with data from the 2015/2016 season. Due to data constraints, we do not have data for the 2016/2017 season. Hence, we will use these “predicted” parameters for comparison with the true outcomes in the 2017/2018 season. I expected a high degree of inaccuracy in the results.

The graph below shows