Project overview

It is becoming increasingly popular for training staffs in football to use statistics as indicators. For a long time, the predominant method of analysis was the so-called eye test, which involved simply looking at players. But over time, and with the development of technology, detailed statistics began to be used to analyse individual players. Goal of this project is to cluster football players into groups(playstyles), based on their statistics. Dataset used in this analysis are statistics from 18/19 English Premier league season

Introduction

Football analytics is an important tool for assessing player performance and detecting patterns that can assist coaches, scouts, and analysts make sound judgments. In this study, I used clustering algorithms to categorize Premier League players from the 2018-2019 season by important performance indicators. In order to identify natural groupings among players based on their positions, playing styles, and contributions to their teams I used kmeans clustering.

The dataset contains a range of statistics, including goals, assists, appearances, minutes played, and disciplinary records. These measurements enable a full evaluation of player performance, both offensively and defensively. Clustering players according to these characteristics can show unique profiles such as productive forwards, creative midfielders, and reliable defenders. The goal of this study is to investigate how clustering techniques might be applied to player data to extract useful information. The analysis will provide a more detailed knowledge of how players compare to one another and how they fit into various tactical roles.

In this study, we will identify the most relevant performance measures for clustering and test multiple clustering approaches to determine the best methodology for this dataset.

I will also be focusing on statistics per 90 minutes, in order to avoid players with more playing time to be dominant on the charts. That way players will be purely devided by their playstyle and not minutes spent on the pitch.

Dataset

The dataset utilized for this study includes information on Premier League players from the 2018-2019 season. It contains 465 observations (each representing a player) and 39 performance metrics. The data was imported using the line of code shown below:

Each observation corresponds to a distinct player, with variables representing various aspects of their performance, including minutes played, goals scored, assists, disciplinary records, and efficiency per 90 minutes. These factors provide a detailed breakdown of a player’s contributions to their team throughout the season.

Initially, the dataset included a greater range of statistics. However, many detailed statistics had defective data, which I decided to get rid of. Redundant columns, such as some situational data (for example, home versus away performance), were deleted to improve readability and simplify the study. The final dataset contains a succinct selection of attributes that best reflect the players’ playing styles and roles on the field.

In the following part, we will describe the selected variables in detail and explain their importance in understanding player performance.

Cleaning and organizing data

Packages used in the project

library(cluster)
library(factoextra)
library(flexclust)
library(fpc)
library(ClusterR)

Deleting NA’s and columns with “N/A” values

cleaned_data <- PLplayers[,colSums(is.na(PLplayers))==0]
columns_to_keep <- colSums(PLplayers == "N/A", na.rm = TRUE) == 0

Deleting duplicated rows

cleaned_data <- PLplayers[, columns_to_keep]
cleaned_data_no_dups <- cleaned_data[!duplicated(cleaned_data$full_name), ]

Setting players’ names as row names

rownames(cleaned_data_no_dups) <- cleaned_data_no_dups$full_name

Building new datasets based on positions for easier analysis

ForwardsPL <- cleaned_data_no_dups[cleaned_data_no_dups$position == "Forward", ]
MidfieldersPL <- cleaned_data_no_dups[cleaned_data_no_dups$position == "Midfielder", ]
DefendersPL <- cleaned_data_no_dups[cleaned_data_no_dups$position == "Defender", ]
GoalkeepersPL <- cleaned_data_no_dups[cleaned_data_no_dups$position == "Goalkeeper", ]

Number of clusters

Preparing data for choosing number of clusters, by creating dataset without unnecessary data with characters

no_cluster <- cleaned_data_no_dups[, !names(cleaned_data_no_dups) %in% c("full_name", "age", "birthday_GMT", "season", "birthday", "league", "position", "Current.Club", "nationality")]

Creating chart showing the most efficient number of clusters for the dataset

nc <- fviz_nbclust(no_cluster, FUNcluster = kmeans, method = "silhouette")
nc

For most charts I have decided to pick a number of clusters suggested by the graph above, however in some cases to highlight some interesting data, I chose to do the analysis with three clusters.

Forwards

This chart differnciates forwards based on number of goals and assists

 kf_mg_ma <- eclust(ForwardsPL[, c("goals_overall", "assists_overall")], 'kmeans', hc_metric = 'euclidean', k=2)

We can see how different roles some players have. We can safely assume that most players in the first cluster (red one) are centrally positioned, because of high number of goals. However players in the second cluster have a different role, which main task is to provide assists and be more of a playmaker rather than a finisher. Those attackers mostly play on the sides of the pitch

Midfielders

Chart showing midfielders playstyles.

km_g_c <- eclust(MidfieldersPL[, c('goals_involved_per_90_overall','cards_per_90_overall')], "kmeans",
hc_metric="euclidean",k=2)

Players with more goal contributions are more attacking and the ones with more yellow or red cards can be seen as defensive midfielders

To further analyze midfielders based on their attacking abilities. I used assists and goals to determine wether player is a playmaker or chooses to finish attacks on his own.

km_a_g <- eclust(MidfieldersPL[, c("assists_per_90_overall", "goals_per_90_overall")], 'kmeans',
hc_metric = 'euclidean', k=2)

Here we can see two clusters dividing midfielders based on number of assists and goals.

Defenders

Now let’s analyze defenders’ roles depending on their offensive and defensive statistics

kd_c_c <- eclust(DefendersPL[, c("assists_overall", "yellow_cards_overall")], 'kmeans', hc_metric = 'euclidean', k=3)

This chart clearly shows how players in the first cluster are providing many assists, while not collecting many yellow cards. Those players clearly are more engaged in attack rather than defending, we can assume that their role can be described as “offensive fullbacks”. Meanwhile players with many yellow cards and no assists in third cluster are centerbacks who rarely take part in attacks.