Machine Learning Algorithms on English Premier League Data

This analysis predicts the English Premier League’s team clusters using their statistics from the current season. It employs k-means clustering, principal component analysis, and correspondence analysis to identify teams grouped with ones that are fighting for the title, qualifying for the Champions League, finishing mid-table or avoiding relegation and facing relegation. The analysis aims to reveal the factors that influence team success and provides a framework for predicting future results.

I downloaded the data for my project using the worldfootballR package, which sources its data from fbref.com. I am grateful to them for providing this valuable data for analysis.

For our analysis, we have selected a subset of columns from the original dataset that are relevant to our research question. These columns include the Squad name, Goals (Gls), Assists (Ast), Goals plus Assists (G_plus_A), Expected Goals (xG_Expected), Non-Penalty Expected Goals (npxG_Expected), Expected Assists (xAG_Expected), Non-Penalty Expected Goals plus Expected Assists (npxG_plus_xAG_Expected), Goals per Minutes played (Gls_Per_Minutes), Assists per Minutes played (Ast_Per_Minutes), and Goals plus Assists per Minutes played (G_plus_A_Per_Minutes). These variables capture important aspects of a team’s offensive performance, such as their scoring ability, creativity, and efficiency. By focusing on these variables, we hope to gain insights into which teams are likely to perform well in the league, qualify for the Champions League, and avoid relegation.

std_epl <- fb_season_team_stats("ENG", "M", 2023, "1st", "standard")
View(std_epl)

suppressWarnings({
std_epl_df <- std_epl %>%
  filter(Team_or_Opponent == "team") %>%
  dplyr::select(Squad,Gls, Ast, G_plus_A, xG_Expected, npxG_Expected, xAG_Expected,
         npxG_plus_xAG_Expected,Gls_Per_Minutes,
         Ast_Per_Minutes,G_plus_A_Per_Minutes) %>%
  as.tibble()%>%
  column_to_rownames("Squad")
})

teams <- std_epl %>%
  filter(Team_or_Opponent == "team") %>%
  dplyr::select(Squad,Gls, Ast, G_plus_A, xG_Expected, npxG_Expected, xAG_Expected,
         npxG_plus_xAG_Expected,Gls_Per_Minutes,
         Ast_Per_Minutes,G_plus_A_Per_Minutes) %>%
  as.tibble()

K- Means Clustering

k-means clustering is a popular unsupervised machine learning technique that partitions a dataset into a predetermined number of clusters based on similarities in the data. In our case, we want to use k-means clustering to identify groups of English Premier League (EPL) teams based on their performance statistics. By doing so, we can gain insights into which teams are more similar to each other in terms of their performance metrics and potentially make predictions about which teams are likely to perform well or poorly in the upcoming season

NbClust(std_epl_df, method = "complete", index = 'hartigan')$Best.nc #4

## Number_clusters     Value_Index 
##          4.0000         25.3361

k = stats::kmeans(std_epl_df, 4)

names(k)

## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

k$cluster

##         Arsenal     Aston Villa     Bournemouth       Brentford        Brighton 
##               1               2               3               2               4 
##         Chelsea  Crystal Palace         Everton          Fulham    Leeds United 
##               2               3               3               2               2 
##  Leicester City       Liverpool Manchester City  Manchester Utd   Newcastle Utd 
##               2               4               1               4               4 
## Nott'ham Forest     Southampton       Tottenham        West Ham          Wolves 
##               3               3               4               3               3

std_epl_df$cluster = k$cluster


#Visualize ============ #
#from the package factoextra; a ggplot extension

fviz_cluster(k, std_epl_df,repel = TRUE, ggtheme = theme_minimal()) +
  theme(legend.position = "none")

teams$cluster = std_epl_df$cluster

Performing clustering analysis for 4 clusters makes sense as we are trying to categorize the teams into four different groups - championship contender teams, champions league position teams, teams that will avoid relegation and teams that will be in the relegation battle. Each cluster represents a distinct set of characteristics in terms of performance metrics such as goals scored, assists made, expected goals, and expected assists. By clustering the teams into these four categories, we can better understand the similarities and differences between the teams’ performances and predict their potential outcomes in the league. For example, we can analyze which teams have the potential to win the league, which teams are likely to qualify for the Champions League, which teams will struggle to avoid relegation, and which teams will be in the relegation battle. By using clustering analysis to group teams with similar performance characteristics, we can gain valuable insights that would be difficult to obtain by simply looking at each team’s performance metrics in isolation.

We can clearly see that in the cluster map as the teams that are in the championship contention are in a seperate cluster, teams that will fight for the champions league positions are in different cluster also same for teams that will be mid table and teams that will be in the relegation battle. Let’s look at it in a table form to understand which teams are in all the different clusters.

Table to show the Team Clusters

team_clust<- kable(teams[,c("Squad", "cluster")], caption = "Squad Clustering Results")

team_clust

Squad Clustering Results
Squad	cluster
Arsenal	1
Aston Villa	2
Bournemouth	3
Brentford	2
Brighton	4
Chelsea	2
Crystal Palace	3
Everton	3
Fulham	2
Leeds United	2
Leicester City	2
Liverpool	4
Manchester City	1
Manchester Utd	4
Newcastle Utd	4
Nott’ham Forest	3
Southampton	3
Tottenham	4
West Ham	3
Wolves	3

epl_df = std_epl_df

Based on the clustering analysis, we have identified 4 distinct groups of teams in the English Premier League. The teams in Cluster 1 are Brighton, Liverpool, Tottenham, Newcastle Utd, and Manchester United . These clubs are likely to finish in the Champions League positions and are the ones fighting for the european competition positions. Cluster 2 includes Aston Villa, Brentford, Chelsea, Fulham, Leeds United, and Leicester City. Can’t say for Leicester City yet,but all these teams are likely to finish in the mid-table and will likely avoid relegation. Cluster 3 includes Arsenal and Manchester City, who are the teams fighting for the title. Finally, Cluster 4 includes Bournemouth, Crystal Palace, Everton, Nott’ham Forest, Southampton, West Ham, and Wolves. These teams are likely to be in the relegation battle and will have to fight hard to stay in the league. The clustering gets this right based on the variables we have provided as we can look at the current points table and come to a similar conclusion. This gives a better confidence on the analysis and we move further to look at Principal Component analysis(PCA) and feature importance.

Hierarchical Clustering

Before we move to PCA and Feature Importance, lets look at multiple ways to visualize these clusters. There are multiple ways to visualize clusters in Hierarchical Clustering they are as follows:

Cluster Dendogram

A cluster dendogram is a visual representation of the hierarchical clustering algorithm. It shows how the observations are grouped into clusters based on their similarity. The height of the branches in the dendrogram represents the distance between clusters. A longer branch indicates a larger distance between clusters, and a shorter branch indicates a smaller distance between clusters.

The hierarchical clusters of all different types help us see which teams are similar to each other within the clusters. In the following graphs and figures we will be able to dive deeper

suppressWarnings({

  fviz_dend(h, k = 4)

  })

This cluster dendogram shows us some interesting comparisons. Based on the graph I observe the following: starting with the fight for european competitions, the graph and some personal bias suggests that tottenham who are currently 4th in the table (as of 25th March 2023) are in contention for their spot with Liverpool who are just 7 points behind (6th and 2 games in hand). Brighton (7th) under Roberto De Zerbi have been really impressive, we have heard about this a lot in the media, but lets compare them to Manchester United (3rd), Brighton have more goals (5 more) and fewer conceeded (4 less). But I believe Manchester United would clinch the third position given their current run of form. All the 4 teams have Newcastle to worry about who have the best defensive record in the league (defensive metrics are not considered in this analysis).

For the relegation battle, there are some interesting matchups in this graph. Crystal Palce(12th) are just 4 points above last place Southampton ( both have played the same number of games) and have struggled recently to score goals and get points which lead to the sacking of their head coach. If Crystal Palace do not improve in their goal scoring they are certainly in the contention to get relegated. Wolves and Nottingham Forest are seperated by just 1 point, but in that match up I would say wolves have an upper hand even though both have scored the fewest goals in the league, but Nott’ham forest have conceeded the most goals. Everton and West Ham is the most interesting matchup to me as both team have struggled somewhat unexpectedly this season scoring so few goals and conceeding goals that have been costly. I would say neither of these teams would get relegated in the end but nothing is certain in the Premier League. Maybe they need Nate the great to help avoid getting relegated and letting AFC Richmond get the last laugh (just a Ted Lasso reference, please ignore)

Phylogenic Dendogram

This is another way to visualize clusters formed by Hierarchical Clustering. This dendrogram represents a tree-like diagram that shows the evolutionary relationships among different clusters.It’s pretty similar to the one above and we can draw the same conclusions.

fviz_dend(h, k = 4, repel = TRUE,  type = "phylogenic")

Circular Dendogram

Now I am just showing off. I swear this is the last.

fviz_dend(h, k = 4,  repel = TRUE, type = "circular")

Feature Importance

The FeatureImp plot is a graphical representation of the relative importance of the original variables in the clustering process. The plot shows the contributions of each variable to each dimension of the clustering. It provides a visual representation of how much each variable contributes to the formation of each cluster. In the plot, the x-axis represents the dimensions of clustering, and the y-axis represents the relative importance of each variable. The height of each bar indicates the importance of the variable in the dimension of clustering. The plot allows us to identify the most important variables in each cluster and helps in understanding the characteristics of each cluster.

set.seed(12345)
result = kcca(std_epl_df,k=4)

FeatureImp_res = FeatureImpCluster(result,as.data.table(std_epl_df))
plot(FeatureImp_res)

The Feature Importance plot shows the most important variables for distinguishing between the clusters. According to the plot, the most important variable for our clustering analysis is “G_plus_A” which stands for goals plus assist. The second most important variable is “npxG_plus_xA”, which stands for non-penalty expected goals plus expected assist to goal, and the third most important variable is “Ast” which stands for assists. This suggests that these three variables are critical in differentiating between the team’s performance and their placement in the different clusters. It is essential to our analysis since we can now focus on these variables when analyzing the performance of different teams and how they relate to their placement in different clusters.

epl.pca = PCA(epl_df, ncp=3, graph = FALSE)
epl.varcontrib <- get_pca_var(epl.pca)

top_vars <- row.names(head(epl.varcontrib$contrib[, 1, drop = FALSE], n = 5))

fviz_screeplot(epl.pca, addlabels = T)

The screeplot for the EPL dataset shows the proportion of variance explained by each principal component. It suggests that the first principal component capture the majority of the variance in the data, with a sharp drop-off in explained variance for subsequent components. We will look at what those are in the next graphs

fviz_contrib(epl.pca, choice = "var", axes = 1, top = 15)

All of the above variables contribute heavily to the first principal component

fviz_contrib(epl.pca, choice = "var", axes = 2, top = 10)

PCA - Variables

var_pca <-fviz_pca_var(epl.pca, select.var = list(name = top_vars))

var_pca + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.position = "none")

The fviz_pca_var plot shows the contribution of each variable to the principal components. In this plot, each variable is represented by an arrow that starts from the origin and points in the direction of its contribution to the corresponding principal component.

The length of the arrow indicates the magnitude of the contribution of the variable to the principal component, and the angle between the arrows indicates the correlation between the variables.

In this plot, all the variables are pointing towards the right, which indicates that they have a positive contribution to the principal components. This means that teams that perform well in these variables are likely to have a higher overall performance in the English Premier League.

PCA - Biplot

suppressWarnings({
biplot <- fviz_pca_biplot(epl.pca, 
                repel = TRUE,
                col.var = "blue",
                col.ind = "black", 
                legend.title = "Best Performing metrics for squads in the EPL", 
                addEllipses = TRUE,
                max.overlaps = 25,
                select.var = list(name = top_vars))
})


biplot + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.position = "none")

This graph shows the contribution of each variable to the first two principal components of the PCA. The distance between a variable and a data point indicates the importance of that variable in explaining the variation of the data point along that principal component.

Arsenal and Manchester City are located in the top right quadrant (out of the charts, literally) of the graph because they have higher values for the variables that contribute to PC1, which are likely to be offensive metrics such as goals and assists. Conversely, the teams in the top left quadrant, teams that are in relegation battle, have lower values for these metrics, which likely contributed to their poor performances and relegation battle.

Leicester City is located near the center of the graph, which suggests that their performance is not primarily explained by any one variable or metric, but rather a combination of several. This is also reflected in their strong performance in the EPL, where they finished in the top four. Overall, this graph provides insight into the relative importance of different variables in explaining the performance of EPL teams.

Perceptual Map

epl.ca = CA(epl_df, graph = FALSE) # from package FactoMineR

suppressWarnings({
plot(epl.ca) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.position = "none")
})

## Warning: ggrepel: 9 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

The CA factor map is a graphical representation of the results of a Correspondence Analysis (CA) performed on the EPL data. In this case, the CA factor map is used to visualize the relationships between the teams and their attributes.

Each point on the map represents a combination of two factors, which are linear combinations of the original variables in the data set. The distance between the points on the map indicates the degree of association between the corresponding categories or variables.

English Premier League 2023 Analysis

Akshay Kulkarni

2023-03-25