Introduction

The Premier League(officially known as The Football Association Premier League Limited) [1] is the highest level of professional football in the English football league system. There are twenty clubs that compete against each other every season, playing in a total of 380 matches every year. The Fantasy Premier League (FPL) is the official fantasy football game of the English Premier League, with over 11.5 million players in 2023 [2].

The dataset used in this project is obtained from kaggle.com, collected by PAOLO MAZZA. [3] It is a basic dataset containing the aggregated fantasy premier league data taken from the official website Fantasy Premier League [4]. In FPL, players create a fictional team composed of real-life football players and earn points based on their performance statistics or perceived on-field contributions. [5]

This project aims to explore player performance statistics and their correlation to total points using EDA, ETL, and visualization. We will observe and analyze the results and try to identify better strategies for playing FPL.

Data Preparation

Load necessary libraries.

# install.packages("tidyverse")
# install.packages("viridis")
# install.packages("hrbrthemes")
# install.packages("gridExtra")

library(tidyverse)
library(viridis)
library(hrbrthemes)
library(gridExtra)

Load the dataset from a CSV file.

path <- "./FPL_Dataset_2022-2023.csv"
fpl_raw_data <- read.csv(path)

Exploratory Data Analysis

Overview Data

To begin, we need to review the input dataset. We can do this by using the glimpse() function to get a general overview of the raw data. Next, we can use the str() function to examine the dataset structure. Then, we can use the summary() function to get a summary of the dataset. Finally, we can use the head() function to take a look at the first six lines of the dataset.

glimpse(fpl_raw_data)
str(fpl_raw_data)

The glimpse() function and str() function show that the dataset has 778 rows and 75 columns and contains continuous and discrete variables of various data types including integer, character, double, etc. The dataset also has missing values(NA) that need to be handled.

summary(fpl_raw_data)

The summary() function shows that each row in the dataset represents a player, resulting in a total of 778 players for the 2022-23 season. The total point for each player ranges from 0 to 272, with an average of 40.78. The cost for each player ranges from 37 to 131, with a median of 45. According to the rule from the official FPL website [6], a player’s value should be between 3.0 and 14.0, and the total budget for 15 selected players is 100. We can assume that the cost values are 10 times larger for easier storage in the database. The cost values might be a useful statistic to explore further.

head(fpl_raw_data)

Using the head() function, we can view the default first 6 rows. Each row contains various details such as goals, assists, and other statistics.

Initial Transform

Before continuing our exploratory data analysis, we should make an initial transformation to make the dataset more suitable for further use. As discussed earlier, player costs range from 3.0 to 14.0, and we can separate them into three different value levels: Low, Middle, and Premium.

To explore the most important data in FPL, players’ total points, we should first examine the key performance data that can award players points. [7] Extract these data from the raw dataset and rename them. After this step, let’s take a look at the transformed data for its first 5 rows.

# get level function
get_cost_level <- function(cost_bin) {
  level <- levels(cut_interval((fpl_raw_data$now_cost) / 10, 3))
  case_when(
    cost_bin == level[1] ~ "Low",
    cost_bin == level[2] ~ "Middle",
    cost_bin == level[3] ~ "Premium",
    .default = NA
  )
}

# transform data
fpl_data <- fpl_raw_data[c(
  "web_name",
  "team",
  "position",
  "total_points",
  "points_per_game",
  "now_cost",
  "minutes",
  "goals_scored",
  "expected_goals",
  "assists",
  "expected_assists",
  "goals_conceded",
  "expected_goals_conceded",
  "clean_sheets",
  "penalties_order"
)] %>%
  rename(all_of(
    c(
      Player = "web_name",
      Club = "team",
      Position = "position",
      Points = "total_points",
      Points90 = "points_per_game",
      Cost = "now_cost",
      Minutes = "minutes",
      Goals = "goals_scored",
      xG = "expected_goals",
      Assists = "assists",
      xA = "expected_assists",
      GC = "goals_conceded",
      xGC = "expected_goals_conceded",
      CS = "clean_sheets",
      PK_Taker = "penalties_order"
    )
  )) %>%
  # cost is 10 times larger because it is easier to store an integer in the database
  # now we can return it to its actual value by dividing it by 10
  mutate(Cost = Cost / 10) %>%
  # separate the players into three groups based on their values
  mutate(Cost_bin = cut_interval(Cost, 3), .after = Cost) %>%
  # categorize player level by their values
  mutate(Level = get_cost_level(Cost_bin), .after = Cost_bin)

# create a constant for further use
Positions <-  c("FWD", "MID", "DEF", "GKP")

# show the table(first five rows)
knitr::kable(fpl_data[1:5,], caption =  "First 5 rows of fpl data")
First 5 rows of fpl data
Player Club Position Points Points90 Cost Cost_bin Level Minutes Goals xG Assists xA GC xGC CS PK_Taker
Xhaka Arsenal MID 153 4.1 4.8 [3.7,6.83] Low 2992 7 4.65 8 3.89 35 36.52 13 NA
Elneny Arsenal MID 6 1.2 4.1 [3.7,6.83] Low 111 0 0.00 0 0.04 2 1.29 0 NA
Holding Arsenal DEF 21 1.5 4.2 [3.7,6.83] Low 562 1 0.32 0 0.15 13 11.14 0 NA
Partey Arsenal MID 86 2.6 4.7 [3.7,6.83] Low 2480 3 2.59 0 2.17 28 32.27 11 NA
Ødegaard Arsenal MID 212 5.7 6.9 (6.83,9.97] Middle 3132 15 9.75 8 8.02 38 37.94 13 NA

Performance Data Cleaning and Transform

In the next stage, we will try to find out the performance of players to determine why elite players are able to score more points. One way to do this is to use expected values such as expected goals (xG) [8], expected assists (xA) [9], and expected goals conceded (xGC) [10]. Fortunately, all of this data is included in the raw dataset.

As players have different playing times, it is inaccurate to use actual or expected values directly. Therefore, we will introduce per 90 statistics, representing values per 90 minutes of play. We will transform Points, Goals, xG, Assists, xA, GoalsConcede and xGC into Points per 90, Goals per 90, xG per 90, Assists per 90, xA per 90, GoalsConcede per 90 and xGC per 90.

fpl_data <- fpl_data %>%
  mutate(Goals90 = Goals / (Minutes / 90), .after = Goals) %>%
  mutate(xG90 = xG / (Minutes / 90), .after = xG) %>%
  mutate(Assists90 = Assists / (Minutes / 90), .after = Assists) %>%
  mutate(xA90 = xA / (Minutes / 90), .after = xA) %>%
  mutate(GC90 = GC / (Minutes / 90), .after = GC) %>%
  mutate(xGC90 = xGC / (Minutes / 90), .after = xGC)

# new data table
knitr::kable(fpl_data[1:5, 10:17], caption = "First 5 rows of fpl data  with added per 90 values")
First 5 rows of fpl data with added per 90 values
Goals Goals90 xG xG90 Assists Assists90 xA xA90
7 0.2105615 4.65 0.1398730 8 0.2406417 3.89 0.1170120
0 0.0000000 0.00 0.0000000 0 0.0000000 0.04 0.0324324
1 0.1601423 0.32 0.0512456 0 0.0000000 0.15 0.0240214
3 0.1088710 2.59 0.0939919 0 0.0000000 2.17 0.0787500
15 0.4310345 9.75 0.2801724 8 0.2298851 8.02 0.2304598
# count missing values
missing_values <- apply(is.na(fpl_data), 2, sum)
na_values <- which(missing_values > 0)
na_values
##   Goals90      xG90 Assists90      xA90      GC90     xGC90  PK_Taker 
##        11        13        15        17        19        21        23
# if the actual value and expected value are both 0, then the actual value of 90 and expected value of 90 will result in NaN
# if the expected 90 values are NaN, it means that the player did not play a single minute
# replace NaN with NA
fpl_data$Goals90 <-
  ifelse(is.na(fpl_data$Goals90), 0, fpl_data$Goals90)
fpl_data$xG90 <- ifelse(is.na(fpl_data$xG90), 0, fpl_data$xG90)
fpl_data$Assists90 <-
  ifelse(is.na(fpl_data$Assists90), 0, fpl_data$Assists90)
fpl_data$xA90 <- ifelse(is.na(fpl_data$xA90), 0, fpl_data$xA90)
fpl_data$GC90 <- ifelse(is.na(fpl_data$GC90), 0, fpl_data$GC90)
fpl_data$xGC90 <- ifelse(is.na(fpl_data$xGC90), 0, fpl_data$xGC90)
# look at the missing values again, they have been cleaned
missing_values <- apply(is.na(fpl_data), 2, sum)
missing_values
##    Player      Club  Position    Points  Points90      Cost  Cost_bin     Level 
##         0         0         0         0         0         0         0         0 
##   Minutes     Goals   Goals90        xG      xG90   Assists Assists90        xA 
##         0         0         0         0         0         0         0         0 
##      xA90        GC      GC90       xGC     xGC90        CS  PK_Taker 
##         0         0         0         0         0         0       718
performance_data <- fpl_data[c(
  "Player",
  "Club",
  "Position",
  "Level",
  "Cost",
  "Points",
  "Goals",
  "Assists",
  "GC",
  "Minutes",
  "Points90",
  "Goals90",
  "xG90",
  "Assists90",
  "xA90",
  "GC90",
  "xGC90",
  "CS"
)]

Penalty Takers Data Cleaning and Transform

Penalties are the most likely opportunities to score during a football match, with each penalty shoot given an xG of 0.78 from the StatsBomb xG 2022 model. [11] Therefore, clubs should wisely choose their penalty takers. A player with a higher conversion ratio would be superior at finishing, increasing the likelihood of scoring the penalty kick.

If we are trying to analyze the data on penalty takers, we should make some cleanings and transformations first. The original dataset, penalty order, treats players who are not among the first three in their club’s penalty taking order as missing values. To solve this problem, we can split the players in each club into two groups: penalty takers(pk taker) and non-penalty takers(non pk taker). We can then calculate the goal performance for both groups.

# check NA
sum(is.na(fpl_data$PK_Taker))
## [1] 718
# divide players into two categories, penalty takers and non-penalty takers
# convert numerical values to logical values
pk_taker_data <- fpl_data %>%
  select(c("Player", "Club", "Goals90", "xG90", "PK_Taker")) %>%
  mutate(PK_Taker = ifelse(is.na(PK_Taker), FALSE, TRUE))
# group by pk_taker, observe the goals performance
# use a FUN = mean here, because each group may contain a different number of players
pk_taker_data <-
  aggregate(cbind(Goals90, xG90) ~ Club + PK_Taker,
            data = pk_taker_data,
            FUN = mean) %>%
  # calculate the different between Goals90 and xG90
  mutate(Diff90 = Goals90 - xG90) %>%
  arrange(Club, PK_Taker)

# show the table(first eight rows)
knitr::kable(pk_taker_data[1:8,], caption =  "First 8 rows of PK Taker data")
First 8 rows of PK Taker data
Club PK_Taker Goals90 xG90 Diff90
Arsenal FALSE 0.1102035 0.0819119 0.0282916
Arsenal TRUE 0.4531829 0.4136932 0.0394897
Aston Villa FALSE 0.0430851 0.0515743 -0.0084893
Aston Villa TRUE 0.2736389 0.3116063 -0.0379674
Bournemouth FALSE 0.0519630 0.0570702 -0.0051072
Bournemouth TRUE 0.0942408 0.1738094 -0.0795685
Brentford FALSE 0.0305403 0.0690410 -0.0385006
Brentford TRUE 0.3836538 0.3228561 0.0607976

Exploratory Points Distribution

Distribution of Points Among Players

The first thing we want to explore is the distribution of points for players, as points are what FPL is all about. To gain a clear insight, we can analyze players who played regularly, which means they played at least 1000 minutes in 2022-23 season.

fpl_data %>%
  filter(Minutes >= 1000) %>%
  ggplot(aes(x = Points)) +
  geom_density(fill = "#69b3a2",
               color = "#e9ecef",
               alpha = 0.6) +
  xlim(0, 300) +
  ggtitle("Points distribution of players") +
  theme_ipsum() +
  theme(plot.title = element_text(size = 20))

The density plot shows that players’ total points are centered around 50-100, although top players can score over 200 points in a season. According to FPL rules, players from different positions have different methods for calculating scoring points. Therefore, we can group players by position and observe their points distributions.

fpl_data %>%
  filter(Minutes >= 1000) %>%
  ggplot(aes(x = Points, group = Position, fill = Position)) +
  geom_density(adjust = 1.5, alpha = 0.6) +
  xlim(0, 300) +
  ggtitle("Points distribution of players") +
  labs(subtitle = "Group by Player Position") +
  theme_ipsum() +
  theme(plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 14))

According to the multiple density plot above, it shows that the distribution of defenders is the most concentrated, and they scored fewer points than other positions. While goalkeepers have a wider range of points, even the best goalkeeper cannot score very high points. The distribution of midfielders and forwards is similar, but top forwards have the potential to score the most points in a season.

Top 30 Players

Our next step is to examine the top 30 players.

fpl_data %>%
  arrange(-Points) %>%
  head(30) %>%
  mutate(Player = paste0(Player, " - ", Cost, "m", sep = " ")) %>%
  mutate(Player = fct_reorder(Player, Points)) %>%
  ggplot(aes(x = Player, y = Points, fill = Position)) +
  geom_bar(stat = "identity",
           alpha = 0.8,
           width = 0.8) +
  geom_label(
    aes(y = Points, label = Points),
    position = position_stack(vjust = 1),
    size = 2,
    show.legend = F
  ) +
  ylim(0, 300) +
  coord_flip() +
  scale_color_brewer(palette = 3) +
  ggtitle("Top 30 Players with the Highest Points") +
  xlab("") +
  theme(plot.title = element_text(hjust = .5)) +
  theme_ipsum()

The horizontal barplot confirms our observations earlier that the top 6 players, all of whom scored more than 200 points, are midfielders and forwards. Additionally, players who cost more in the same position tend to score more points. This suggests a correlation between points, positions, and values. We should explored that in the next step.

Points, Positions and Values

fpl_data %>%
  ggplot(aes(x = Level, y = Points, fill = Position)) +
  geom_boxplot() +
  stat_summary(
    fun = mean,
    geom = "point",
    shape = 20,
    size = 10,
    color = "#ec7014",
    fill = "#ec7014"
  ) +
  scale_fill_viridis(discrete = TRUE, alpha = 0.6) +
  ylim(0, 300) +
  xlab("Level") +
  ggtitle("Total points distribution across different levels") +
  theme_ipsum()

Based on the total points distribution across different levels, we can draw some observations. Goalkeepers have the lowest average cost among all positions. However, the boxplot shows that there are more outliers for GKP than for any other position. This could shows us that even if we spend less on GKP, they may have a higher chance of providing a better reward.

It can also be observed that higher level players score a higher mean of points. There may be a positive correlation between cost and points. All top level players are forwards and midfielders, as they have the most attacking resources available to help them score or assist a goal, making them scoring a higher number of FPL points.

Exploratory Actual and Expected Values

Managing Invalid Per 90 Values

Let’s begin by examining the distribution of actual per 90 values.

# create a new data frame for plot[Cost, Value, Type], the current one is not a tidy data
# reshape data here, use provide_long() to create a tidy data
performance_data[c("Minutes",
                   "Points90",
                   "Goals90",
                   "Assists90",
                   "GC90")] %>%
  pivot_longer (2:5, names_to = "Per90", values_to = "Value") %>%
  ggplot(aes(
    x = Per90,
    y = Value,
    fill = Per90,
    color = Per90
  )) +
  geom_boxplot() +
  scale_fill_viridis(discrete = T) +
  scale_color_viridis(discrete = T) +
  geom_jitter(alpha = .5, size = 1) +
  ggtitle("Distribution of per 90 Values") +
  xlab("") +
  ylab("Actual per 90") +
  theme_ipsum() +
  theme(legend.title = element_blank())

The boxplot shows that there are outliers present in the actual values that deviate significantly from the others. Let’s go back to the calculations. When calculating the per 90 values, it is important to consider that a player who has not played a total of 90 minutes in a season, which will make them have extremely high per 90 values. Therefore, it is necessary to exclude these values as they are invalid.

performance_data <- performance_data %>%
  mutate_all(~ ifelse(Minutes < 90, NA, .)) %>%
  na.omit()

Players’ Performance

To observe the correlation between a player’s cost and their performance, we can focus on players who have played more than 90 minutes and scored at least one goal or provided at least one assist for an entire season. These regular players are the foundation of every club’s attacking performance. We can use a scatter plot with the geom_smooth() function to observe any possible correlation. For a small dataset, we will use LOESS method to fit the smooth curve. [12]

# goals performance
p1 <- performance_data %>%
  filter(Goals > 1 & Minutes > 1000) %>%
  filter(Position %in% c("FWD", "MID")) %>%
  pivot_longer (12:13, names_to = "Performance", values_to = "Value") %>%
  select(c("Cost", "Performance", "Value")) %>%
  ggplot(aes(x = Cost, y = Value, color = Performance)) +
  geom_point() +
  geom_smooth(method = loess) +
  ylab("Per 90") +
  labs(color = "Attackers") +
  theme_ipsum()

# assists performance
p2 <- performance_data %>%
  filter(Assists > 1 & Minutes > 1000) %>%
  filter(Position %in% c("FWD", "MID")) %>%
  pivot_longer (14:15, names_to = "Performance", values_to = "Value") %>%
  select(c("Cost", "Performance", "Value")) %>%
  ggplot(aes(x = Cost, y = Value, color = Performance)) +
  geom_point() +
  geom_smooth(method = loess) +
  ylab("Per 90") +
  labs(color = "Attackers") +
  theme_ipsum()

# goalsConcede performance
p3 <- performance_data %>%
  filter(GC > 1 & Minutes > 1000) %>%
  filter(Position %in% c("DEF", "GKP")) %>%
  pivot_longer (16:17, names_to = "Performance", values_to = "Value") %>%
  select(c("Cost", "Performance", "Value")) %>%
  ggplot(aes(x = Cost, y = Value, color = Performance)) +
  geom_point() +
  geom_smooth(method = loess) +
  ylab("Per 90") +
  labs(color = "Defenders") +
  theme_ipsum()

suppressMessages({
  grid.arrange(p1, p2, p3, ncol = 2)
})

Examining the first row of the grid plot, we can observe the performance of attackers (forwards and midfielders) who played regular minutes in terms of Goals and Assists. The results shows that as the cost of players increases from low to premium level, their attacking performance also improves. This suggests that highly-priced players have a higher conversion ratio and are more skilled at finishing or assisting compared to other players. Therefore, investing in premium players could be a recommended strategy when playing FPL.

In the second row, we can see how defenders and goalkeepers perform. The chart shows that the actual and expected values are nearly the same across all levels. This means that investing heavily in defenders may not have better results than investing in low level ones. This may be because defense is a team effort, and having a couple of elite players may not make a significant difference.

Exploratory Club’s Penalty Takers

To visualize the performance of players who take penalty kicks and those who do not, we can use a stacked barplot.

A positive difference occurs when a player’s actual value is greater than their expected value, which means they performed better than most other players. Conversely, a negative difference occurs when a player’s actual value is less than their expected value, indicating that they performed worse than others. A higher diff value suggests that the player has better shooting technique and is more likely to score goals.

pk_taker_data %>%
  ggplot(aes(x = Club, y = Diff90, fill = PK_Taker)) +
  geom_bar(position = "stack", stat = "identity") +
  scale_fill_viridis(
    discrete = TRUE,
    alpha = 0.6,
    begin = 0.4,
    end = 0.8,
    labels = c("No", "Yes"),
    name = "PK Takers"
  ) +
  xlab("") +
  ggtitle("Comparison of PK and non-PK takers' performance") +
  theme_ipsum() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Based on the plot, it is clear that most of clubs made wise choices when selecting their penalty takers, as the performance of those who took penalties were better than those who did not. But some clubs like Arsenal, Leicester, and Liverpool, particularly Arsenal, may need to reconsider their penalty takers.

Exploratory General and Performance Statistics

As discussed at the very beginning, let’s go back for the top 30 players.

we can create a heatmap to display the correlation between player performance and positions with general statistics like points and minutes played. We will divide the statistics into three groups: general stats, which include Points90 and Minutes; offensive stats, which include Goals90, xG90, Assists90, and xA90; and defensive stats, which include GC90, xGC90, and CS.

plot_data <- performance_data %>%
  arrange(-Points) %>%
  head(30) %>%
  mutate(
    Position_name = paste(Position, Player, sep = " - "),
    Side_color = case_when(
      Position == Positions[1] ~ "#7fc97f",
      Position == Positions[2] ~ "#386cb0",
      Position == Positions[3] ~ "red",
      Position == Positions[4] ~ "#fdb462"
    )
  )

# create heatmap data
heatmap_data <- plot_data %>%
  select(Minutes:CS) %>%
  as.matrix()

# color, green for below average, blue for above average
color_1 <-
  colorRampPalette(colors = c('#00FF87', '#33CCFF'))(6)[1:5]
color_2 <- colorRampPalette(colors = c('#33CCFF', '#5190FF'))(6)

# classify general stat, attack stat and defense stat
col_side_color <-
  case_when(
    attr(heatmap_data, which = "dimnames")[[2]] %in% c("Goals90",
                                                       "xG90",
                                                       "Assists90",
                                                       "xA90") ~ "#ec7014",
    attr(heatmap_data, which = "dimnames")[[2]] %in% c("GC90",
                                                       "xGC90",
                                                       "CS") ~ "#737373",
    .default = "purple"
  )

# plot
heatmap(
  heatmap_data,
  labRow = plot_data$Position_name,
  scale = "column",
  col = c(color_1, color_2),
  cexRow = 1.2,
  cexCol = 1.2,
  RowSideColors = plot_data$Side_color,
  ColSideColors = col_side_color
)

# add legends
legend(
  "topleft",
  legend = c(
    "Position - FWD",
    "Position - MID",
    "Position - DEF",
    "Position - GKP",
    "General Stat",
    "Offensive Stat",
    "Defensive Stat",
    "Above average",
    "Average",
    "Below Average"
  ),
  cex = 0.7,
  fill = c(
    "#7fc97f",
    "#386cb0",
    "red",
    "#fdb462",
    "purple",
    "#ec7014",
    "#737373",
    "#5190FF",
    "#33CCFF",
    "#00FF87"
  )
)

The heatmap shows that attacking players have higher scores due to their strong attacking statistics. Elite players perform similarly to their expected values, Indicating their capability to translate expectations into actual results.

Among the top 30 players, defenders have strong performances in GC(Goals Conceded) and CS(Clean Sheet), and most of their points coming from defensive rewards. However, some defenders such as Tripper and Alexander-Arnold have outstanding assist performances, making them better picks.

These Observations can guide our FPL playing strategy. Firstly, select as many elite attackers as possible, because of their high conversion rates. Secondly, choose defenders who have a strong potential for assists.

Conclusion

This report examines the data from the 2022-23 season of FPL player. First, we find out a positive correlation between points and values by analyzing the points distribution. Then, we compare players’ actual and expected values, and find out that a player’s high conversion rate is likely to result in a high score. Lastly, we explore club penalty takers as well as player performance and general statistics.

We have some advice for Fantasy Premier League(FPL) players based on our exploratory findings. We should focus on premium attacking players and select defenders who have a high number of assists returns. And we should avoid wasting the budget on goalkeepers and regular defenders, as their rewards are similar across all price levels.

Reference

  1. “English football league system”. en.wikipedia.org.
  2. “How Many Play FPL?”. allaboutfpl.com.
  3. “Fantasy Premier League Dataset 2022-2023”. www.kaggle.com
  4. “Official Website Fantasy Premier League”. fantasy.premierleague.com
  5. “Fantasy football (association)”. en.wikipedia.org.
  6. “Official Website Fantasy Premier League Rules”. fantasy.premierleague.com
  7. “How are points awarded?”. fantasy.premierleague.com
  8. “expected-goals-xg-explained”. statsbomb.com
  9. “what-are-expected-assists-xa”. theanalyst.com
  10. “xG, xA, xGI and xGC: What are expected stats and who are the best FPL picks?”.
  11. “What Are Expected Goals (xG)?”. statsbomb.com
  12. “LOESS (aka LOWESS)”. nist.gov