FIFA, also known as FIFA Football or FIFA Soccer, is a series of association football video games or football simulator, released annually by Electronic Arts under the EA Sports label. Football video games such as Sensible Soccer, Kick Off and Match Day had been developed since the late 1980s and already competitive in the games market when EA Sports announced a football game as the next addition to their EA Sports label. The Guardian called the series "the slickest, most polished and by far the most popular football game around.
Source: Wikipedia
Data Source: Kaggle FIFA 20 Data Set
In professional football, it is not uncommon for a player to drop out of team due to contract transfer, contract expiration or medical reasons. By the means of our analysis, we will be trying to find suitable replacements players based on their skillset, strength and physical attributes. We will be doing that by using Clustering Analysis which is an unsupervised learning method. We will be building two different models using K-means clustering and Hierarchical clustering and in the end compare both the models to see which one performs better.
We begin by loading the packages that will be required throughout the course of our analysis.
library (tidyverse)
library (ggfortify)
library (qdapTools)
library (caret)
library (tidyr)
library (corrplot)
library (ggpubr)
library(ggplot2)
library(PerformanceAnalytics)
library(factoextra)
library(fpc)
The data set was imported into R studio using the below code
fifa_data <- read.csv("fifa-20-complete-player-dataset/players_20.csv", stringsAsFactors = FALSE)
attach(fifa_data)
set.seed(123456)
Displaying the Top 10 Values of the Data
head(fifa_data, 10)
Next we use the str() function to see the structure of the dataset.
str(fifa_data)
The dataset has 18278 observations and 104 variables. The dataset has both numeric and character variables.
glimpse(fifa_data)
Both goalkeepers and outfield players have different set of attributes. It makes more sense to divide the whole dataset into two groups: Outfield players and Goal keepers so that there are no Null values as we can assign only the attributes that are relevant to each group. This will also help us to align our model to our problem statement.
data_gk <- subset(x=fifa_data, subset= player_positions == "GK", select = c(sofifa_id,short_name,age,height_cm,weight_kg,overall,potential,value_eur,wage_eur,player_positions,international_reputation,weak_foot,skill_moves,gk_diving,gk_handling,gk_kicking,gk_reflexes,gk_speed,gk_positioning,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,skill_dribbling,skill_curve,skill_fk_accuracy,skill_long_passing,skill_ball_control,movement_acceleration,movement_sprint_speed,movement_agility,movement_reactions,movement_balance,power_shot_power,power_jumping,power_stamina,power_strength,power_long_shots,mentality_aggression,mentality_interceptions,mentality_positioning,mentality_vision,mentality_penalties,mentality_composure,defending_marking,defending_standing_tackle,defending_sliding_tackle,goalkeeping_diving,goalkeeping_handling,goalkeeping_kicking,goalkeeping_positioning,goalkeeping_reflexes))
head(data_gk)
dim(data_gk)
glimpse(data_gk)
The goalkeeper dataset has 2036 observations and 53 variables.
data_outfield <- subset(x=fifa_data, subset= player_positions != "GK", select = c(sofifa_id,short_name,age,height_cm,weight_kg,overall,potential,value_eur,wage_eur,player_positions,international_reputation,weak_foot,skill_moves,pace,shooting,passing,dribbling,defending,physic,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,skill_dribbling,skill_curve,skill_fk_accuracy,skill_long_passing,skill_ball_control,movement_acceleration,movement_sprint_speed,movement_agility,movement_reactions,movement_balance,power_shot_power,power_jumping,power_stamina,power_strength,power_long_shots,mentality_aggression,mentality_interceptions,mentality_positioning,mentality_vision,mentality_penalties,mentality_composure,defending_marking,defending_standing_tackle,defending_sliding_tackle,goalkeeping_diving,goalkeeping_handling,goalkeeping_kicking,goalkeeping_positioning,goalkeeping_reflexes))
head(data_outfield)
dim(data_outfield)
glimpse(data_outfield)
The Outfield dataset has 16242 observations and 53 variables.
Since we are using physical attributes, strength and skillset to find suitable replacement players, we will be filtering the datasets to retain only these variables.
The resulting datasets have a total of 47 variables each.
data_gk_int <- subset(data_gk,select = -c(short_name, player_positions, sofifa_id, value_eur, wage_eur, skill_moves))
glimpse (data_gk_int)
data_outfield_int <- subset(data_outfield, select = -c(short_name, player_positions, sofifa_id, value_eur, wage_eur, skill_moves))
glimpse(data_outfield_int)
Next as a part of Data Pre Processing we use the scale() function to basically normalize the datasets. This function places continuous variables on unit scale by subtracting the mean of the variable and dividing the result by the variable’s standard deviation (also sometimes called z-scoring or simply scaling). The result is that the values in the transformed variable have the same relationship to one another as in the untransformed variable, but the transformed variable has mean 0 and standard deviation.
data_gk_int <- scale(data_gk_int)
data_outfield_int <- scale(data_outfield_int)
set.seed(123456)
Sometimes same player takes over several positions, sometimes only one like a goal keeper. Also notice that names of positions separated by comma and space if several are present. What we can do here is to display for each player it’s relative relation to a position. This can be accomplished in few steps.
In our case we have 15 attributes, so the table will be N x 15.
print ("Players position")
## [1] "Players position"
head (player_positions)
## [1] "RW, CF, ST" "ST, LW" "LW, CAM" "GK" "LW, CF"
## [6] "CAM, CM"
player_positions <- gsub('\\s+', '',player_positions)#remove white spaces from specified columns
players_role <- mtabulate(strsplit(player_positions, ",")) #one hot encode player position
print ("Number of entries (rows) and number of (columns) of the matrix")
## [1] "Number of entries (rows) and number of (columns) of the matrix"
dim (players_role)
## [1] 18278 15
role_frequency <- t(sort(apply (players_role, 2, sum))) #count number of players position appearing
role_frequency <- gather (data.frame(role_frequency), Position, Count) #transfer data into format for ggplot
role_frequency$Position <- reorder(role_frequency$Position, role_frequency$Count)#reorder position by ascending count
#barplot of players position frequency
ggplot(role_frequency, aes(x=Position, y = Count, fill = Position)) +
geom_bar(stat = "identity", color ="black", fill = "#B03A2E") +
coord_flip() +
theme_minimal() +
geom_text(aes(label = Count, hjust = 1, color = "#FDFEFE" , size = 3.5)) +
ggtitle("Frequency of Positions Played") +
xlab("Player Positions") +
ylab("Count") +
theme(legend.position = "none",
plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text(),
axis.ticks = element_blank())
To see correlation between values in the table (players_role) we can compute correlation between them by applying function cor. It requires only numeric columns, make sure you do not have any rows with string inside.
We can plot corrplot using corrplot function of corrplot package. One can specify various arguments of a function. corrplot:: specifies the namespace of a package. It can be helpful when two packages have different function but share same function name.
role_cor <- cor(players_role) #correlation between players role
corrplot::corrplot(role_cor, hclust.method = "ward.D", order = "hclust", method = "square", diag = F, type = "lower" ) #correlation plot
We can see that mid center players often share position of a center defending midfielder and less with center attacking midfielder. Left and right back wingers clearly have positive correlation so as right and left mid-siders.
It seems that central forward position (ST) is negatively correlated with (CM - Central middle, CDM - Central defense middle, CB - Central backfielder, RB - right backfielder) a little less with LB - left backfielder. Clearly players playing in the forward position rarely play in the defense
data_outfield_Summary <- subset(x=data_outfield, select = c(age,height_cm,weight_kg, overall, potential,pace,shooting,passing,dribbling,defending,physic))
corr1 <- cor(data_outfield_Summary)
corrplot::corrplot(corr1, hclust.method = "ward.D", order = "hclust", method = "square",diag = F, type = "lower" ) #correlation plot for Outfield Players
#chart.Correlation(data_outfield_Summary, histogram=TRUE, pch=19)
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.
More Information can be found on Wikipedia
K-Means Clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. It is popular for cluster analysis in data mining. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, Better Euclidean solutions can be found using k-medians and k-medoids.
More Information can be found on Wikipedia
We want to view the result so We can use fviz_cluster. It is function can provide a nice graph of the clusters. Usually, we have more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.
In order to use k-means method for clustering and plot results, we can use k means function in R. It will group the data into a specified number of clusters (centers = 2, 3, 4, 5). The nstart option of this function can allow the algorithm to attempt multiple initial configurations and reports on the best one.
In order to view the results, swe can use fviz_cluster function which can provide a nice graph of the clusters. Usually, we have more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.
gk2 <- kmeans(data_gk_int, centers = 2, nstart = 25)
gk3 <- kmeans(data_gk_int, centers = 3, nstart = 25)
gk4 <- kmeans(data_gk_int, centers = 4, nstart = 25)
gk5 <- kmeans(data_gk_int, centers = 5, nstart = 25)
# plots to compare
gkp1 <- fviz_cluster(gk2, geom = "point", data = data_gk_int) + ggtitle("k = 2")
gkp2 <- fviz_cluster(gk3, geom = "point", data = data_gk_int) + ggtitle("k = 3")
gkp3 <- fviz_cluster(gk4, geom = "point", data = data_gk_int) + ggtitle("k = 4")
gkp4 <- fviz_cluster(gk5, geom = "point", data = data_gk_int) + ggtitle("k = 5")
library(gridExtra)
grid.arrange(gkp1, gkp2, gkp3, gkp4, nrow = 2)
To determine the number of clusters, we use a simple within group sum of squares method. The idea is that, if the plot is an arm, the elbow of the arm is the optimal number of clusters, which is 3 in our case
# Determine number of clusters
wss <- (nrow(data_gk_int)-1)*sum(apply(data_gk_int,2,var))
for (i in 2:12) wss[i] <- sum(kmeans(data_gk_int,
centers=i)$withinss)
plot(1:12, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
library(fpc)
plotcluster(data_gk_int, gk3$cluster)
Next we use the prediction strength with a cutoff value of 0.8. This will recommends the largest number of clusters that leads to a prediction strength above 0.8 which is again 3 in our case.
prediction.strength(data_gk_int, Gmin=2, Gmax=15, M=10,cutoff=0.8)
## Prediction strength
## Clustering method: kmeans
## Maximum number of clusters: 15
## Resampled data sets: 10
## Mean pred.str. for numbers of clusters: 1 0.9465497 0.8432161 0.5933725 0.5177097 0.4696646 0.4155968 0.337572 0.3162195 0.3031324 0.303777 0.2817019 0.2539696 0.233146 0.2034054
## Cutoff value: 0.8
## Largest number of clusters better than cutoff: 3
#use the output of kmeans
gk3$centers
## age height_cm weight_kg overall potential
## 1 0.2722008 0.02397076 0.07642374 -0.07399651 -0.3579285
## 2 -0.8056819 -0.12273409 -0.30715291 -0.85725102 -0.2796168
## 3 0.6092431 0.12395932 0.28095721 1.29565980 1.0169297
## international_reputation weak_foot gk_diving gk_handling gk_kicking
## 1 -0.2350217 -0.06479749 -0.09311223 -0.1257872 -0.1234589
## 2 -0.2444925 -0.11650054 -0.79456140 -0.7315079 -0.6834791
## 3 0.7504065 0.27354667 1.24457994 1.2171482 1.1477927
## gk_reflexes gk_speed gk_positioning attacking_crossing
## 1 -0.08862617 0.3399101 -0.0388205 0.2646941
## 2 -0.79693478 -0.9586043 -0.8238856 -0.4995433
## 3 1.23981494 0.6963138 1.1877283 0.2069570
## attacking_finishing attacking_heading_accuracy attacking_short_passing
## 1 0.5003257 0.2688590 0.09051013
## 2 -0.9545552 -0.4840682 -0.49828966
## 3 0.4051983 0.1785306 0.51538745
## attacking_volleys skill_dribbling skill_curve skill_fk_accuracy
## 1 0.5142830 0.3294312 0.2206243 0.2378758
## 2 -0.9583390 -0.7315953 -0.4934196 -0.4716182
## 3 0.3854848 0.4067560 0.2771085 0.2167922
## skill_long_passing skill_ball_control movement_acceleration
## 1 0.00199558 0.2531884 0.3124755
## 2 -0.39767486 -0.7311943 -0.9113362
## 3 0.53637942 0.5419610 0.6809839
## movement_sprint_speed movement_agility movement_reactions
## 1 0.3375020 0.1554552 0.06137424
## 2 -0.9280013 -0.6517950 -0.87298190
## 3 0.6590510 0.6081713 1.07599202
## movement_balance power_shot_power power_jumping power_stamina
## 1 0.2324551 -0.1232354 0.02396498 0.2473939
## 2 -0.5164536 -0.6812621 -0.56030709 -0.7255783
## 3 0.2873176 1.1443847 0.71807279 0.5446529
## power_strength power_long_shots mentality_aggression
## 1 0.07560664 0.5100448 0.09599788
## 2 -0.49088483 -0.9914499 -0.47999630
## 3 0.53186919 0.4379863 0.48077923
## mentality_interceptions mentality_positioning mentality_vision
## 1 0.3898286 0.4477818 -0.1791500
## 2 -0.9282583 -0.9424443 -0.2905230
## 3 0.5662330 0.4823089 0.7134243
## mentality_penalties mentality_composure defending_marking
## 1 0.2483565 0.0386662 0.2111671
## 2 -0.6792757 -0.6779313 -0.5391643
## 3 0.4800728 0.8515986 0.3560557
## defending_standing_tackle defending_sliding_tackle goalkeeping_diving
## 1 0.2964850 0.3273071 -0.09311223
## 2 -0.5304807 -0.5343752 -0.79456140
## 3 0.1923582 0.1427675 1.24457994
## goalkeeping_handling goalkeeping_kicking goalkeeping_positioning
## 1 -0.1257872 -0.1234589 -0.0388205
## 2 -0.7315079 -0.6834791 -0.8238856
## 3 1.2171482 1.1477927 1.1877283
## goalkeeping_reflexes
## 1 -0.08862617
## 2 -0.79693478
## 3 1.23981494
Next we perform cluster analysis of outfield players using kmeans() function and centers value of 2, 3, 4 and 5,.
ok2 <- kmeans(data_outfield_int, centers = 2, nstart = 25)
ok3 <- kmeans(data_outfield_int, centers = 3, nstart = 25)
ok4 <- kmeans(data_outfield_int, centers = 4, nstart = 25)
ok5 <- kmeans(data_outfield_int, centers = 5, nstart = 25)
# plots to compare
op1 <- fviz_cluster(ok2, geom = "point", data = data_outfield_int) + ggtitle("k = 2")
op2 <- fviz_cluster(ok3, geom = "point", data = data_outfield_int) + ggtitle("k = 3")
op3 <- fviz_cluster(ok4, geom = "point", data = data_outfield_int) + ggtitle("k = 4")
op4 <- fviz_cluster(ok5, geom = "point", data = data_outfield_int) + ggtitle("k = 5")
library(gridExtra)
grid.arrange(op1, op2, op3, op4, nrow = 2)
Using the sum of squares method method, we find the optimal number of clusters to be 3.
# Determine number of clusters
wss_o <- (nrow(data_outfield_int)-1)*sum(apply(data_outfield_int,2,var))
for (i in 2:12) wss_o[i] <- sum(kmeans(data_outfield_int,
centers=i)$withinss)
plot(1:12, wss_o, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
library(fpc)
plotcluster(data_outfield_int, ok3$cluster)
Using prediction strength function with a cut-off value of 0.8, we get the largest number of clusters to be 3.
prediction.strength(data_outfield_int, Gmin=2, Gmax=15, M=10,cutoff=0.8)
## Prediction strength
## Clustering method: kmeans
## Maximum number of clusters: 15
## Resampled data sets: 10
## Mean pred.str. for numbers of clusters: 1 0.9557498 0.9313265 0.7640775 0.5269017 0.628195 0.4823138 0.6050884 0.4981196 0.4508408 0.3644348 0.3822084 0.334616 0.339305 0.2876533
## Cutoff value: 0.8
## Largest number of clusters better than cutoff: 3
#use the output of kmeans
ok3$centers
## age height_cm weight_kg overall potential
## 1 -0.08568948 0.5346909 0.41946219 -0.3798785 -0.3049445
## 2 -0.42969900 -0.3353175 -0.37083635 -0.5326929 -0.2225367
## 3 0.45351852 -0.1524315 -0.02391674 0.7925551 0.4541936
## international_reputation weak_foot pace shooting passing
## 1 -0.1971186 -0.37270635 -0.6383253 -1.1528294 -0.8197510
## 2 -0.2618142 0.04421257 0.3437641 0.3439705 -0.2548250
## 3 0.3982977 0.27446737 0.2321643 0.6650429 0.9161684
## dribbling defending physic attacking_crossing attacking_finishing
## 1 -1.0279982 0.6420237 0.3122345 -0.7007539 -1.1229190
## 2 0.1232610 -1.0483196 -0.7949969 -0.1831150 0.4803348
## 3 0.7558693 0.3900735 0.4428148 0.7523592 0.5188326
## attacking_heading_accuracy attacking_short_passing attacking_volleys
## 1 0.2615464 -0.5857615 -1.0113119
## 2 -0.5048306 -0.3728760 0.2651309
## 3 0.2279329 0.8239977 0.6159031
## skill_dribbling skill_curve skill_fk_accuracy skill_long_passing
## 1 -1.0238832 -0.94549696 -0.80054354 -0.4196781
## 2 0.1833426 0.01184035 -0.08322047 -0.4657584
## 3 0.6990784 0.78532239 0.74768830 0.7666449
## skill_ball_control movement_acceleration movement_sprint_speed
## 1 -0.89994868 -0.6525584 -0.5904491
## 2 -0.05558207 0.3472162 0.3219244
## 3 0.80682699 0.2410804 0.2112510
## movement_agility movement_reactions movement_balance power_shot_power
## 1 -0.8136287 -0.3707109 -0.6576659 -0.92999730
## 2 0.2986811 -0.5433700 0.2830223 0.08213972
## 3 0.4197332 0.7943154 0.3023568 0.70987977
## power_jumping power_stamina power_strength power_long_shots
## 1 0.1684715 -0.1782467 0.4084593 -1.0598994
## 2 -0.3729318 -0.5034700 -0.5813399 0.1597833
## 3 0.1892038 0.5969024 0.1721837 0.7503042
## mentality_aggression mentality_interceptions mentality_positioning
## 1 0.3103383 0.5856178 -1.0830421
## 2 -0.8551033 -1.0325543 0.2811159
## 3 0.4977601 0.4235577 0.6620911
## mentality_vision mentality_penalties mentality_composure
## 1 -0.97396117 -0.9072036 -0.5247257
## 2 0.01122459 0.2717303 -0.3940798
## 3 0.80982746 0.5224170 0.7914436
## defending_marking defending_standing_tackle defending_sliding_tackle
## 1 0.5870453 0.6466842 0.6604607
## 2 -0.9742263 -1.0114443 -0.9716489
## 3 0.3705854 0.3534209 0.3065035
## goalkeeping_diving goalkeeping_handling goalkeeping_kicking
## 1 -0.01851878 -0.01831813 -0.04045735
## 2 -0.04997539 -0.05347249 -0.04340598
## 3 0.05994464 0.06287972 0.07257961
## goalkeeping_positioning goalkeeping_reflexes
## 1 -0.02397761 -0.03111060
## 2 -0.05132719 -0.03430055
## 3 0.06573921 0.05663056
In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters.
Strategies for hierarchical clustering generally fall into two types:
In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering[2] are usually presented in a dendrogram.
More Information can be found on Wikipedia
#Wards Method or Hierarchical clustering
#Calculate the distance matrix
gk.dist = dist(data_gk_int)
#Obtain clusters using the Wards method
gk.hclust = hclust(gk.dist, method = "ward")
plot(gk.hclust)
rect.hclust(hclust(gk.dist, method = "ward"), h = 500)
#Cut dendrogram at the 3 clusters level and obtain cluster membership
gk.3clust = cutree(gk.hclust,k=3)
#See exactly which item are in third group
data_gk_int[gk.3clust==3,]
#get cluster means for raw data
#Centroid Plot against 1st 2 discriminant functions
#Load the fpc library needed for plotcluster function
library(fpc)
#plotcluster(ZooFood, fit$cluster)
plotcluster(data_gk_int, gk.3clust)
#Wards Method or Hierarchical clustering
#Calculate the distance matrix
outfield.dist = dist(data_outfield_int)
#Obtain clusters using the Wards method
outfield.hclust = hclust(outfield.dist, method = "ward")
plot(outfield.hclust)
#Cut dendrogram at the 3 clusters level and obtain cluster membership
outfield.3clust = cutree(outfield.hclust,k=3)
#See exactly which item are in third group
data_outfield_int[outfield.3clust==3,]
#get cluster means for raw data
#Centroid Plot against 1st 2 discriminant functions
#Load the fpc library needed for plotcluster function
library(fpc)
#plotcluster(ZooFood, fit$cluster)
plotcluster(data_outfield_int, outfield.3clust)
Hierarchical clustering builds clusters within clusters and does not require a pre-specified number of clusters like k means. Now we discuss this difference in details. The most important difference is the hierarchy. Actually, there are two different approaches that fall under this name: top-down and bottom-up.
In top-down hierarchical clustering, we divide the data into 2 clusters (using k-means with k=2k=2, for example). Then, for each cluster, we can repeat this process, until all the clusters are too small or too similar for further clustering to make sense, or until we reach a preset number of clusters.
In bottom-up hierarchical clustering, we start with each data item having its own cluster. We then look for the two items that are most similar and combine them in a larger cluster. We keep repeating until all the clusters we have left are too dissimilar to be gathered together, or until we reach a preset number of clusters.
In k-means clustering, we try to identify the best way to divide the data into k sets simultaneously. A good approach is to take k items from the data set as initial cluster representatives, assign all items to the cluster whose representative is closest, and then calculate the cluster mean as a new representative; until it converges (all clusters stay the same)
Reference: StepupAnalytics.com
# Apply cutree() to gk.hclust : gk.cut
gk.cut <- cutree(gk.hclust , k = 3)
# Compare methods
table(gk3$cluster, gk.cut)
## gk.cut
## 1 2 3
## 1 27 800 49
## 2 0 55 613
## 3 390 83 19
# Apply cutree() to gk.hclust : gk.cut
outfield.cut <- cutree(outfield.hclust , k = 3)
# Compare methods
table(ok3$cluster, outfield.cut)
## outfield.cut
## 1 2 3
## 1 372 2479 2158
## 2 2552 1 2729
## 3 5652 108 191
In professional football, it is not uncommon for a player to drop out of team due to contract transfer, contract expiration or medical reasons. By the means of our analysis, we tried to find suitable replacements players based on their skillset, strength and physical attributes. We performed Clustering Analysis which, an unsupervised learning method and built two different models using K-means clustering and Hierarchical clustering and in the end compared both the models to see which one performs better.
The methodology involved cleaning the dataset and then splitting it in 2 parts - and Outfield Players. And then by only taking the physical attributes into consideration , performing clustering analysis - K mean Cluster Analysis and Hierarchical Analysis on both of the datasets separately to cluster similar physical type of player together and making it easier for the team management to find a suitable replacement in case of contract transfer, contract expiration or medical reasons.
This will be useful for team management as it will narrow down their search for a suitable replacement.