The aim of this article is to use dimension reduction techniques as a pre-step method for clustering analysis. The purpose is to extract intrinsic dimension from footballers data set. Initially the data set has more than 50 variables but most of them seem to be correlated. Therefore we will see how many dimensions do we really need to explain greater share of the variance.
The data set is powered by kaggle: https://www.kaggle.com/datasets/maso0dahmed/football-players-data
This comprehensive dataset offers detailed information on approximately
17,000 FIFA football players, meticulously scraped from SoFIFA.com. It
encompasses a wide array of player-specific data points, including but
not limited to player names, nationalities, clubs, player ratings,
potential, positions, ages, and various skill attributes. This dataset
is ideal for football enthusiasts, data analysts, and researchers
seeking to conduct in-depth analysis, statistical studies, or machine
learning projects related to football players’ performance,
characteristics, and career progressions.
The variables used in the data set are self explanatory, so I will
just put them without explaining.
List of variables: name, full_name, birth_date, age,
height_cm, weight_kgs, positions, nationality, overall_rating,
potential, value_euro, wage_euro, preferred_foot,
international_reputation(1-5), weak_foot(1-5), skill_moves(1-5),
body_type, release_clause_euro, national_team, national_rating,
national_team_position, national_jersey_number, crossing, finishing,
heading_accuracy, short_passing, volleys, dribbling, curve,
freekick_accuracy, long_passing, ball_control, acceleration,
sprint_speed, agility, reactions, balance, shot_power, jumping, stamina,
strength, long_shots, aggression, interceptions, positioning, vision,
penalties, composure, marking, standing_tackle and sliding_tackle
PCA is a linear dimension reduction technique. Performs a linear mapping of the data to a lower-dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized. The idea of this method is behind the covariance matrix. The eigenvectors that correspond to the largest eigenvalues (the principal components) are used to reconstruct a large fraction of the variance of the original data.
t-SNE is a statistical method efficient in visualizing high-dimenisional data. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. The interpretation of this technique is simple - if the points are close to each other in the transformed space means that they are similar in the initial manifold.
Football is the most commonly followed sport by people. For its
development, different tools are used to describe the statistical events
that occur during high-level sports competitions.They allow to
understand and identify the most important features that are related to
the success of a failure of observed team.
In the article “In-game behaviour analysis of football players using machine learning techniques based on player statistics” there were machine learning algorithms used to determine the on-field playing positions of a group of football players based on their technical-tactical behaviour. To visualize connections between the points in two dimensional projection, t-SNE unsupervised learning method were used. This dimension reduction technique allowed to see some grouping (4 groups). From the t-SNE plot it was observed, for example, that the central defenders are the farthest from the strikers, the wingers are the closest to the strikers, or the full-backs are placed among central defenders, midfielders and wingers.
There are important researches on techniques that can be brought to `athletic training and be also useful in any organization that is dealing with training multitudes of athletes. In general, with increasing wearable technology usage there are more parameters that can be used in K-means clustering. It might be a tooAl to create informed training groups that will best maximize the strength coaches time and the athlete’s training. (Reuben et al., 2020). Wearable technology will give huge opportunities to the coaches and the staff. In utopiian scenario for football managers and their analysts, the data taken from players will be available live which means that the coaches will no longer have to rely on gut instinct alone. It will allow to to maximize team utility, and take coach role to a higher level. (Creasey, 2015)
# Set the seed to make the code repeatable
set.seed(123)
# Libraries
library(ggplot2)
library(corrplot)
## Warning: pakiet 'corrplot' został zbudowany w wersji R 4.3.2
## corrplot 0.92 loaded
library(stats)
library(factoextra)
## Warning: pakiet 'factoextra' został zbudowany w wersji R 4.3.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(FactoMineR)
## Warning: pakiet 'FactoMineR' został zbudowany w wersji R 4.3.2
library(gridExtra)
## Warning: pakiet 'gridExtra' został zbudowany w wersji R 4.3.2
library(psych)
##
## Dołączanie pakietu: 'psych'
## Następujące obiekty zostały zakryte z 'package:ggplot2':
##
## %+%, alpha
library(Rtsne)
## Warning: pakiet 'Rtsne' został zbudowany w wersji R 4.3.2
# Read a data set
players <- read.csv("C:/Users/User/Desktop/Dimension Reduction/fifa_players.csv")
str(players)
## 'data.frame': 17954 obs. of 51 variables:
## $ name : chr "L. Messi" "C. Eriksen" "P. Pogba" "L. Insigne" ...
## $ full_name : chr "Lionel Andrés Messi Cuccittini" "Christian Dannemann Eriksen" "Paul Pogba" "Lorenzo Insigne" ...
## $ birth_date : chr "6/24/1987" "2/14/1992" "3/15/1993" "6/4/1991" ...
## $ age : int 31 27 25 27 27 27 20 30 32 32 ...
## $ height_cm : num 170 155 190 163 188 ...
## $ weight_kgs : num 72.1 76.2 83.9 59 88.9 92.1 73 69.9 92.1 77.1 ...
## $ positions : chr "CF,RW,ST" "CAM,RM,CM" "CM,CAM" "LW,ST" ...
## $ nationality : chr "Argentina" "Denmark" "France" "Italy" ...
## $ overall_rating : int 94 88 88 88 88 88 88 89 89 89 ...
## $ potential : int 94 89 91 88 91 90 95 89 89 89 ...
## $ value_euro : int 110500000 69500000 73000000 62000000 60000000 59500000 81000000 64500000 38000000 60000000 ...
## $ wage_euro : int 565000 205000 255000 165000 135000 215000 100000 300000 130000 200000 ...
## $ preferred_foot : chr "Left" "Right" "Right" "Right" ...
## $ international_reputation.1.5.: int 5 3 4 3 3 3 3 4 5 4 ...
## $ weak_foot.1.5. : int 4 5 4 4 3 3 4 4 4 4 ...
## $ skill_moves.1.5. : int 4 4 5 4 2 2 5 4 1 3 ...
## $ body_type : chr "Messi" "Lean" "Normal" "Normal" ...
## $ release_clause_euro : int 226500000 133800000 144200000 105400000 106500000 114500000 166100000 119300000 62700000 111000000 ...
## $ national_team : chr "Argentina" "Denmark" "France" "Italy" ...
## $ national_rating : int 82 78 84 83 NA 81 84 82 85 81 ...
## $ national_team_position : chr "RF" "CAM" "RDM" "LW" ...
## $ national_jersey_number : int 10 10 6 10 NA 4 10 11 1 21 ...
## $ crossing : int 86 88 80 86 30 53 77 70 15 70 ...
## $ finishing : int 95 81 75 77 22 52 88 93 13 89 ...
## $ heading_accuracy : int 70 52 75 56 83 83 77 77 25 89 ...
## $ short_passing : int 92 91 86 85 68 79 82 81 55 78 ...
## $ volleys : int 86 80 85 74 14 45 78 85 11 90 ...
## $ dribbling : int 97 84 87 90 69 70 90 89 30 80 ...
## $ curve : int 93 86 85 87 28 60 77 82 14 77 ...
## $ freekick_accuracy : int 94 87 82 77 28 70 63 73 11 76 ...
## $ long_passing : int 89 89 90 78 60 81 73 64 59 52 ...
## $ ball_control : int 96 91 90 93 63 76 91 89 46 82 ...
## $ acceleration : int 91 76 71 94 70 74 96 88 54 75 ...
## $ sprint_speed : int 86 73 79 86 75 77 96 80 60 76 ...
## $ agility : int 93 80 76 94 50 61 92 86 51 77 ...
## $ reactions : int 95 88 82 83 82 87 87 90 84 91 ...
## $ balance : int 95 81 66 93 40 49 83 91 35 59 ...
## $ shot_power : int 85 84 90 75 55 81 79 88 25 87 ...
## $ jumping : int 68 50 83 53 81 88 75 81 77 88 ...
## $ stamina : int 72 92 88 75 75 75 83 76 43 92 ...
## $ strength : int 66 58 87 44 94 92 71 73 80 78 ...
## $ long_shots : int 94 89 82 84 15 64 78 83 16 79 ...
## $ aggression : int 48 46 78 34 87 82 62 65 29 84 ...
## $ interceptions : int 22 56 64 26 88 88 38 24 30 48 ...
## $ positioning : int 94 84 82 83 24 41 88 92 12 93 ...
## $ vision : int 94 91 88 87 49 60 82 83 70 77 ...
## $ penalties : int 75 67 82 61 33 62 70 83 47 85 ...
## $ composure : int 96 88 87 83 80 87 86 90 70 82 ...
## $ marking : int 33 59 63 51 91 90 34 30 17 52 ...
## $ standing_tackle : int 28 57 67 24 88 89 34 20 10 45 ...
## $ sliding_tackle : int 26 22 67 22 87 84 32 12 11 39 ...
Firstly, we will check if there are any missing values.
# Number of observations in the data set
dim(players)[1]
## [1] 17954
# Number of missing values via column
sort(colSums(is.na(players)), decreasing = T)
## national_rating national_jersey_number
## 17097 17097
## release_clause_euro value_euro
## 1837 255
## wage_euro name
## 246 0
## full_name birth_date
## 0 0
## age height_cm
## 0 0
## weight_kgs positions
## 0 0
## nationality overall_rating
## 0 0
## potential preferred_foot
## 0 0
## international_reputation.1.5. weak_foot.1.5.
## 0 0
## skill_moves.1.5. body_type
## 0 0
## national_team national_team_position
## 0 0
## crossing finishing
## 0 0
## heading_accuracy short_passing
## 0 0
## volleys dribbling
## 0 0
## curve freekick_accuracy
## 0 0
## long_passing ball_control
## 0 0
## acceleration sprint_speed
## 0 0
## agility reactions
## 0 0
## balance shot_power
## 0 0
## jumping stamina
## 0 0
## strength long_shots
## 0 0
## aggression interceptions
## 0 0
## positioning vision
## 0 0
## penalties composure
## 0 0
## marking standing_tackle
## 0 0
## sliding_tackle
## 0
As we can see there are missing values in few columns. Columns national_rating and national_jersey_number contain 17097 NA’s out of 17954 records. Therefore it was decided to drop them both. The rest of columns with missing values are release_clause_euro, value_euro, wage_euro. In this case we observe much less missing values so we will input them with the mean.
# Data set without columns national_rating and national_jersey_number
players <- subset(players, select = -c(national_rating, national_jersey_number))
# Now there are 3 columns with missing values left - release_clause_euro, value_euro, wage_euro
columns_to_input_NA = c("release_clause_euro", "value_euro", "wage_euro")
# Loop to input missing values with the mean
for (column in columns_to_input_NA){
players[,column] <- ifelse(is.na(players[,column]), mean(players[,column], na.rm = T),
players[,column])
}
Now we will visualize some variables to get familiar with the data
set.
From the distribution of variable age we can see that the distribution
is right skewed (there are more extreme values in the right tail). We
can also observe that greatest share of players in our data set are
21-27 years old.
# Distribution of the variable age
ggplot(data = players, aes(x = age)) +
geom_histogram(bins = 30, colour = "black", fill = "lightblue")
When we move onto variable height_cm the observe from the distribution that greatest share of football players have 181-186cm or 151-156 height.
# Distribution of the variable height_cm
ggplot(data = players, aes(x = height_cm)) +
geom_histogram(bins = 10, colour = "black", fill = "lightblue")
# How many players have preffered left foot comparing to right?
ggplot(data = players, aes(x = preferred_foot)) +
geom_bar(fill = c("grey", "lightblue"))
In this step we will prepare our data set for dimension reduction. To do so, we have to extract numeric columns from the data set and standardize the data. Before processing the PCA - it is important to standardize the data (to give every feature “a chance” data needs to be scaled so that features have equal variance).
# Preparing data to dimension reduction
numeric_columns = c()
for(colname in colnames(players)){
if (!is.character(players[, colname])){
numeric_columns <- append(numeric_columns, colname)
}
}
# Extracting numeric variables from the data set
players_numeric <- players[, numeric_columns]
# Statistical glimpse at the numeric data
summary(players_numeric)
## age height_cm weight_kgs overall_rating
## Min. :17.00 Min. :152.4 Min. : 49.9 Min. :47.00
## 1st Qu.:22.00 1st Qu.:154.9 1st Qu.: 69.9 1st Qu.:62.00
## Median :25.00 Median :175.3 Median : 74.8 Median :66.00
## Mean :25.57 Mean :174.9 Mean : 75.3 Mean :66.24
## 3rd Qu.:29.00 3rd Qu.:185.4 3rd Qu.: 79.8 3rd Qu.:71.00
## Max. :46.00 Max. :205.7 Max. :110.2 Max. :94.00
## potential value_euro wage_euro
## Min. :48.00 Min. : 10000 Min. : 1000
## 1st Qu.:67.00 1st Qu.: 325000 1st Qu.: 1000
## Median :71.00 Median : 725000 Median : 3000
## Mean :71.43 Mean : 2479280 Mean : 9902
## 3rd Qu.:75.00 3rd Qu.: 2300000 3rd Qu.: 9902
## Max. :95.00 Max. :110500000 Max. :565000
## international_reputation.1.5. weak_foot.1.5. skill_moves.1.5.
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:2.000
## Median :1.000 Median :3.000 Median :2.000
## Mean :1.112 Mean :2.946 Mean :2.361
## 3rd Qu.:1.000 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000
## release_clause_euro crossing finishing heading_accuracy
## Min. : 13000 Min. : 5.0 Min. : 2.00 Min. : 4.00
## 1st Qu.: 581000 1st Qu.:38.0 1st Qu.:30.00 1st Qu.:44.00
## Median : 1400000 Median :54.0 Median :49.00 Median :56.00
## Mean : 4622522 Mean :49.7 Mean :45.36 Mean :52.15
## 3rd Qu.: 4622522 3rd Qu.:64.0 3rd Qu.:62.00 3rd Qu.:64.00
## Max. :226500000 Max. :93.0 Max. :95.00 Max. :94.00
## short_passing volleys dribbling curve
## Min. : 7.00 Min. : 3.00 Min. : 4.00 Min. : 6.0
## 1st Qu.:53.00 1st Qu.:30.00 1st Qu.:49.00 1st Qu.:34.0
## Median :62.00 Median :44.00 Median :61.00 Median :49.0
## Mean :58.57 Mean :42.76 Mean :55.28 Mean :47.1
## 3rd Qu.:68.00 3rd Qu.:57.00 3rd Qu.:68.00 3rd Qu.:62.0
## Max. :93.00 Max. :90.00 Max. :97.00 Max. :94.0
## freekick_accuracy long_passing ball_control acceleration
## Min. : 3.00 Min. : 9.00 Min. : 5.00 Min. :12.0
## 1st Qu.:30.00 1st Qu.:43.00 1st Qu.:54.00 1st Qu.:57.0
## Median :41.00 Median :56.00 Median :63.00 Median :67.0
## Mean :42.69 Mean :52.67 Mean :58.22 Mean :64.7
## 3rd Qu.:56.00 3rd Qu.:64.00 3rd Qu.:69.00 3rd Qu.:75.0
## Max. :94.00 Max. :93.00 Max. :96.00 Max. :97.0
## sprint_speed agility reactions balance shot_power
## Min. :12.0 Min. :11.00 Min. :24.00 Min. :16.00 Min. : 2.00
## 1st Qu.:58.0 1st Qu.:55.00 1st Qu.:56.00 1st Qu.:56.00 1st Qu.:45.00
## Median :67.0 Median :66.00 Median :62.00 Median :66.00 Median :59.00
## Mean :64.8 Mean :63.38 Mean :61.82 Mean :63.87 Mean :55.32
## 3rd Qu.:75.0 3rd Qu.:74.00 3rd Qu.:68.00 3rd Qu.:74.00 3rd Qu.:68.00
## Max. :96.0 Max. :96.00 Max. :96.00 Max. :96.00 Max. :95.00
## jumping stamina strength long_shots
## Min. :15.00 Min. :12.00 Min. :20.00 Min. : 3.00
## 1st Qu.:58.00 1st Qu.:56.00 1st Qu.:58.00 1st Qu.:32.00
## Median :66.00 Median :66.00 Median :66.00 Median :51.00
## Mean :64.96 Mean :63.13 Mean :65.16 Mean :46.85
## 3rd Qu.:73.00 3rd Qu.:74.00 3rd Qu.:74.00 3rd Qu.:62.00
## Max. :95.00 Max. :97.00 Max. :97.00 Max. :94.00
## aggression interceptions positioning vision
## Min. :11.00 Min. : 3.00 Min. : 2.00 Min. :10.00
## 1st Qu.:44.00 1st Qu.:26.00 1st Qu.:38.00 1st Qu.:44.00
## Median :59.00 Median :52.00 Median :55.00 Median :55.00
## Mean :55.82 Mean :46.66 Mean :49.86 Mean :53.41
## 3rd Qu.:69.00 3rd Qu.:64.00 3rd Qu.:64.00 3rd Qu.:64.00
## Max. :95.00 Max. :92.00 Max. :95.00 Max. :94.00
## penalties composure marking standing_tackle
## Min. : 5.00 Min. :12.00 Min. : 3.00 Min. : 2.00
## 1st Qu.:38.00 1st Qu.:51.00 1st Qu.:30.00 1st Qu.:27.00
## Median :49.00 Median :60.00 Median :52.50 Median :55.00
## Mean :48.36 Mean :58.68 Mean :47.16 Mean :47.73
## 3rd Qu.:60.00 3rd Qu.:67.00 3rd Qu.:64.00 3rd Qu.:66.00
## Max. :92.00 Max. :96.00 Max. :94.00 Max. :93.00
## sliding_tackle
## Min. : 3.00
## 1st Qu.:24.00
## Median :52.00
## Mean :45.71
## 3rd Qu.:64.00
## Max. :90.00
# Standardization of the data
players_normalized <- as.data.frame(lapply(players_numeric, scale))
As a pre step before PCA, we check correlation of the variables. From the correlation plot we can see that most of the variables are highly linearly correlated (pearson correlation). In terms of dimension reduction, it presage that PCA will perform well on this data set. Data processed by dimension reduction technique will extract most compact representation of the samples - intrinsic dimension.
# Correlation plot
corrplot(cor(players_normalized, method = "pearson"), order ="alphabet", tl.cex=0.6, diag = F, method = "square", type = "upper")
We calculate covariance matrix and then its eigenvalues. The purpose of this step is to make some judgements before performing principal component analysis which base on the covariance matrix.
# Calculate covariance matrix
players_normalized.cov <- cov(players_normalized)
# Calculate eigenvalues
players_normalized.eig <- eigen(players_normalized.cov)
# Eigenvalues
players_normalized.eig$values
## [1] 17.79950825 5.22103481 4.03978147 2.25898981 1.42468465 1.24156248
## [7] 0.93528607 0.82791238 0.66260601 0.57864851 0.45357312 0.43130198
## [13] 0.34899667 0.30009003 0.26842081 0.26023164 0.23818763 0.23177669
## [19] 0.21832030 0.20085740 0.19385009 0.19064559 0.18161265 0.16790215
## [25] 0.16509808 0.14015368 0.13076938 0.12016480 0.11302460 0.10170080
## [31] 0.08664520 0.07830514 0.06905778 0.06469179 0.06307170 0.05817814
## [37] 0.05130063 0.03564935 0.02398953 0.02241819
Dimension reduction technique which will be used to gather intrinsic dimension from our data set is Principal Component Analysis (PCA).
How many principal components to choose as a
representation?
We have to specify appropriate number based on cumulative variance
explained and additional plots.
# Perform PCA
PCA <- prcomp(players_normalized, center = F, scale = F)
# Summary of principal components
summary(PCA)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 4.219 2.2850 2.0099 1.50299 1.19360 1.11425 0.96710
## Proportion of Variance 0.445 0.1305 0.1010 0.05647 0.03562 0.03104 0.02338
## Cumulative Proportion 0.445 0.5755 0.6765 0.73298 0.76860 0.79964 0.82302
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.9099 0.81401 0.76069 0.67348 0.65674 0.59076 0.5478
## Proportion of Variance 0.0207 0.01657 0.01447 0.01134 0.01078 0.00872 0.0075
## Cumulative Proportion 0.8437 0.86028 0.87475 0.88609 0.89687 0.90560 0.9131
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.51809 0.51013 0.48804 0.48143 0.46725 0.44817 0.44028
## Proportion of Variance 0.00671 0.00651 0.00595 0.00579 0.00546 0.00502 0.00485
## Cumulative Proportion 0.91981 0.92632 0.93227 0.93806 0.94352 0.94854 0.95339
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.43663 0.42616 0.4098 0.40632 0.3744 0.36162 0.3466
## Proportion of Variance 0.00477 0.00454 0.0042 0.00413 0.0035 0.00327 0.0030
## Cumulative Proportion 0.95816 0.96270 0.9669 0.97102 0.9745 0.97780 0.9808
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.33619 0.31891 0.29436 0.27983 0.26279 0.25435 0.25114
## Proportion of Variance 0.00283 0.00254 0.00217 0.00196 0.00173 0.00162 0.00158
## Cumulative Proportion 0.98362 0.98617 0.98833 0.99029 0.99202 0.99363 0.99521
## PC36 PC37 PC38 PC39 PC40
## Standard deviation 0.24120 0.22650 0.18881 0.1549 0.14973
## Proportion of Variance 0.00145 0.00128 0.00089 0.0006 0.00056
## Cumulative Proportion 0.99667 0.99795 0.99884 0.9994 1.00000
In terms of principal component analysis, we will consider principal component as significant when it explains more than 10% of the total variance. Based on the rule applied to the scree plot mentioned below, we choose 3 principal components as significant to explain whole data set. Cumulative variance explained via the principal components is about 68%. It is well dimension reduction as from 40 dimensions we obtain only 3 intrinsic dimensions.
# Plot of explained variance via principal components
fviz_eig(PCA, main = "Scree plot", barfill = "lightblue", barcolor = "lightblue")
get_eigenvalue(PCA)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 17.79950825 44.49877062 44.49877
## Dim.2 5.22103481 13.05258703 57.55136
## Dim.3 4.03978147 10.09945367 67.65081
## Dim.4 2.25898981 5.64747451 73.29829
## Dim.5 1.42468465 3.56171163 76.86000
## Dim.6 1.24156248 3.10390621 79.96390
## Dim.7 0.93528607 2.33821518 82.30212
## Dim.8 0.82791238 2.06978096 84.37190
## Dim.9 0.66260601 1.65651503 86.02841
## Dim.10 0.57864851 1.44662127 87.47504
## Dim.11 0.45357312 1.13393279 88.60897
## Dim.12 0.43130198 1.07825496 89.68722
## Dim.13 0.34899667 0.87249166 90.55972
## Dim.14 0.30009003 0.75022509 91.30994
## Dim.15 0.26842081 0.67105203 91.98099
## Dim.16 0.26023164 0.65057909 92.63157
## Dim.17 0.23818763 0.59546907 93.22704
## Dim.18 0.23177669 0.57944171 93.80648
## Dim.19 0.21832030 0.54580075 94.35228
## Dim.20 0.20085740 0.50214351 94.85443
## Dim.21 0.19385009 0.48462523 95.33905
## Dim.22 0.19064559 0.47661398 95.81567
## Dim.23 0.18161265 0.45403163 96.26970
## Dim.24 0.16790215 0.41975538 96.68945
## Dim.25 0.16509808 0.41274521 97.10220
## Dim.26 0.14015368 0.35038420 97.45258
## Dim.27 0.13076938 0.32692345 97.77951
## Dim.28 0.12016480 0.30041201 98.07992
## Dim.29 0.11302460 0.28256151 98.36248
## Dim.30 0.10170080 0.25425201 98.61673
## Dim.31 0.08664520 0.21661299 98.83334
## Dim.32 0.07830514 0.19576284 99.02911
## Dim.33 0.06905778 0.17264445 99.20175
## Dim.34 0.06469179 0.16172948 99.36348
## Dim.35 0.06307170 0.15767924 99.52116
## Dim.36 0.05817814 0.14544536 99.66661
## Dim.37 0.05130063 0.12825157 99.79486
## Dim.38 0.03564935 0.08912338 99.88398
## Dim.39 0.02398953 0.05997383 99.94395
## Dim.40 0.02241819 0.05604546 100.00000
The blue vertical line in our cumulative variance plot is a number of principal components used in further analysis.
# Cumulative variance plot
plot(summary(PCA)$importance[3,],type="l", ylab = "Variance explained (%)",
xlab = "Number of principal components", main = "Cumulative variance explained (%)")
abline(v = 3, col = "lightblue", lwd = 2, lty = 2)
The table contains variables ordered from most significant to least significant in terms of contribution to the first principal component explained variance. We will also mention bar plot to visualize which ones are the most important in terms of first, second and third principal components.
loading_scores_PC_1<-PCA$rotation[,1]
fac_scores_PC_1<-abs(loading_scores_PC_1)
fac_scores_PC_1_ranked<-names(sort(fac_scores_PC_1, decreasing=T))
PCA$rotation[fac_scores_PC_1_ranked, 1]
## ball_control dribbling
## -0.22551237 -0.21963050
## short_passing positioning
## -0.21761239 -0.20858994
## curve crossing
## -0.20858367 -0.20796421
## long_shots shot_power
## -0.20763032 -0.20496157
## skill_moves.1.5. volleys
## -0.19792516 -0.19696285
## long_passing freekick_accuracy
## -0.19654245 -0.19498512
## finishing penalties
## -0.19006914 -0.18837896
## vision stamina
## -0.18543801 -0.18523694
## composure agility
## -0.18137241 -0.17728369
## acceleration sprint_speed
## -0.16884020 -0.16588564
## heading_accuracy balance
## -0.15485224 -0.15172009
## aggression overall_rating
## -0.14374574 -0.13642395
## reactions marking
## -0.13431661 -0.11669634
## interceptions standing_tackle
## -0.11165036 -0.10839623
## sliding_tackle value_euro
## -0.10055821 -0.09971576
## potential release_clause_euro
## -0.09624621 -0.09446493
## weak_foot.1.5. wage_euro
## -0.09248866 -0.09184609
## height_cm weight_kgs
## 0.08296352 0.07939504
## international_reputation.1.5. jumping
## -0.07475227 -0.06496160
## age strength
## -0.04085795 -0.02902935
For the first principal component the 5 most important
variables in explaining greatest share of variance are:
- ball_control
- dribbling
- short_passing
- positioning
- curve
fviz_contrib(PCA, "var", axes=1, xtickslab.rt=90, color = "lightblue", fill = "lightblue")
For the second principal component the 5 top variables
are:
- strength
- interceptions
- standing_tackle
- sliding_tackle
- marking
fviz_contrib(PCA, "var", axes=2, xtickslab.rt=90, color = "lightblue", fill = "lightblue")
Whereas for the third principal component the list looks like
below:
- value_euro
- release_clause_euro
- wage_euro
- international_reputation.1.5.
- sliding tackle
fviz_contrib(PCA, "var", axes=3, xtickslab.rt=90, color = "lightblue", fill = "lightblue")
Correlation biplot for the first two principal components can give us lots of information about the variables. Positively correlated variables are grouped together, whereas negatively correlated variables are positioned on opposite sides of the plot origin (opposed quadrants). The distance between variables and the origin measures the quality of the variables on the factor map. Variables that are away from the origin are well represented on the factor map.
fviz_pca_var(PCA, col.var="lightblue", repel = T)
Based on the analysis above, we decided to stay with 3
principal components as intrinsic.
Let’s check quality of the representation using cos2 (squared
coordinates).
# PCA with 3 principal components
res.pca_3comp <- PCA(players_normalized, scale.unit = F, ncp = 3, graph = F)
var <- get_pca_var(res.pca_3comp)
# Visualize the cos2 of variables on all the dimensions
corrplot(var$cos2, is.corr = F)
A high cos2 indicates a good representation of the variable on the principal component, whereas low cos2 indicates that the variable is not perfectly represented by the PCs. It is similar as with the correlation circle plot - the closer a variable is to the circle of correlations (the higher cos2), the better its representation on the factor map (and the more important it is to interpret these components).
# Color by cos2 values: quality on the factor map
fviz_pca_var(res.pca_3comp, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = T, alpha.var = "cos2")
rot_pca <- principal(players_normalized, nfactors=3, rotate="varimax")
rot_pca
## Principal Components Analysis
## Call: principal(r = players_normalized, nfactors = 3, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 RC3 h2 u2 com
## age -0.05 0.24 0.35 0.18 0.816 1.8
## height_cm -0.51 0.01 0.22 0.31 0.691 1.4
## weight_kgs -0.58 0.10 0.34 0.46 0.539 1.7
## overall_rating 0.22 0.27 0.81 0.77 0.227 1.4
## potential 0.21 0.06 0.59 0.40 0.600 1.3
## value_euro 0.15 0.01 0.85 0.74 0.263 1.1
## wage_euro 0.11 0.04 0.81 0.66 0.338 1.0
## international_reputation.1.5. 0.07 0.03 0.71 0.52 0.484 1.0
## weak_foot.1.5. 0.39 0.00 0.17 0.18 0.819 1.4
## skill_moves.1.5. 0.82 0.14 0.23 0.75 0.251 1.2
## release_clause_euro 0.13 0.01 0.82 0.69 0.313 1.1
## crossing 0.81 0.36 0.13 0.80 0.197 1.4
## finishing 0.87 -0.05 0.24 0.82 0.184 1.2
## heading_accuracy 0.39 0.65 0.20 0.61 0.390 1.8
## short_passing 0.74 0.50 0.26 0.87 0.133 2.1
## volleys 0.84 0.05 0.29 0.79 0.213 1.2
## dribbling 0.91 0.24 0.15 0.92 0.084 1.2
## curve 0.85 0.20 0.23 0.81 0.188 1.3
## freekick_accuracy 0.77 0.21 0.23 0.70 0.303 1.3
## long_passing 0.62 0.54 0.25 0.74 0.259 2.3
## ball_control 0.85 0.38 0.23 0.92 0.083 1.5
## acceleration 0.81 0.11 -0.05 0.67 0.333 1.0
## sprint_speed 0.77 0.15 -0.03 0.61 0.390 1.1
## agility 0.84 0.06 0.02 0.72 0.284 1.0
## reactions 0.25 0.28 0.72 0.65 0.347 1.6
## balance 0.78 0.05 -0.13 0.63 0.371 1.1
## shot_power 0.78 0.25 0.29 0.76 0.243 1.5
## jumping 0.09 0.38 0.14 0.17 0.828 1.4
## stamina 0.61 0.60 0.09 0.73 0.271 2.0
## strength -0.25 0.52 0.36 0.47 0.531 2.3
## long_shots 0.85 0.15 0.27 0.82 0.180 1.3
## aggression 0.26 0.82 0.18 0.77 0.233 1.3
## interceptions 0.09 0.93 0.07 0.88 0.116 1.0
## positioning 0.89 0.13 0.20 0.85 0.145 1.1
## vision 0.73 0.08 0.37 0.68 0.319 1.5
## penalties 0.79 0.09 0.25 0.70 0.304 1.2
## composure 0.50 0.37 0.57 0.71 0.293 2.7
## marking 0.14 0.91 0.04 0.85 0.150 1.1
## standing_tackle 0.10 0.94 0.00 0.90 0.103 1.0
## sliding_tackle 0.07 0.93 -0.03 0.88 0.123 1.0
##
## RC1 RC2 RC3
## SS loadings 14.34 6.83 5.89
## Proportion Var 0.36 0.17 0.15
## Cumulative Var 0.36 0.53 0.68
## Proportion Explained 0.53 0.25 0.22
## Cumulative Proportion 0.53 0.78 1.00
##
## Mean item complexity = 1.4
## Test of the hypothesis that 3 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.06
## with the empirical chi square 116327.5 with prob < 0
##
## Fit based upon off diagonal values = 0.98
Complexity and uniqueness of the variables
In simple words, the complexity of the variable gives
us information about how many variables constitute single factor. The
lower complexity the better as it involves easier interpretation of the
factor. To get the intuition - complexity close to 1 means that the
loading of only one factor is relatively large and the remaining ones
are close to zero.
Variables in our initial data set that have the highest complexity
are:
- composure
- long_passing
- strength
- short_passing
- stamina
# Complexity plot
plot(rot_pca$complexity, ylab = "Complexity")
abline(h = mean(rot_pca$complexity), col = "lightblue", lwd = 1, lty = 2)
# Data frame with complexity and uniqueness values for every factor
comp_uniq <- data.frame(complexity = rot_pca$complexity, uniqueness = rot_pca$uniquenesses)
# Top 5 complexity values
head(comp_uniq[order(comp_uniq$complexity, decreasing = T),], 5)
## complexity uniqueness
## composure 2.729481 0.2929731
## long_passing 2.302919 0.2585065
## strength 2.266275 0.5307218
## short_passing 2.050271 0.1328398
## stamina 2.042112 0.2710557
Uniqueness of the variable is the proportion of variance that is not shared with other variables. The desire is to maintain uniqueness on low levels as it is easier to reduce the space to a smaller number of dimensions (we will maintain higher piece of variance). If the uniqueness of factor is low it also means that the variable does not carry additional information in relation to other variables contained in the model.
Variables in our initial data set that have the highest uniqueness
are:
- jumping
- weak_foot.1.5.
- age
- height_cm
- potential
# Uniqueness plot
plot(rot_pca$uniqueness, ylab = "Uniqueness")
abline(h = mean(rot_pca$uniqueness), col = "lightblue", lwd = 1, lty = 2)
# Top 5 uniqueness values
head(comp_uniq[order(comp_uniq$uniqueness, decreasing = T),], 5)
## complexity uniqueness
## jumping 1.375521 0.8283828
## weak_foot.1.5. 1.385093 0.8189360
## age 1.811937 0.8164792
## height_cm 1.376557 0.6914984
## potential 1.268423 0.5999152
As a second dimension reduction method we will use t-SNE (t-Distributed Stochastic Neighbor Embedding). This method is really helpful to have a look at some clustering abilities of the data set. It is a nonlinear dimensionality reduction technique for embedding high-dimensional data for visualization in a low-dimensional (in our case we obtain two dimensions which we immediately visualize). The nonlinear dimensionality reduction provides the mapping that points in lower (usually 2 or 3) dimension manifold which are close to each other, are also similar in the input (high-dimensional) manifold. From the t-SNE plot we can see how many clusters should we expect as optimal after performing some clustering techniques that base on within sum of squares minimization.
# Perform t-SNE dimension reduction technique
tsne <- Rtsne(players_normalized)
# Change 2D output to data frame
tsne_df <- data.frame(x = tsne$Y[,1], y = tsne$Y[,2])
# Visualization in 2D scatter plot
ggplot(tsne_df, aes(x = x, y = y)) +
geom_point(color = "tomato") +
ggtitle("t-SNE plot") +
theme(plot.title = element_text(hjust = 0.5))
Initially the data set contains 40 numeric variables, whereas due to principal component analysis we obtain only 3 dimensions which totally explain approximately 68% of total variance. The highest influence on the variance of the first principal component had variables that seem to be most important for offensive players (ball_control, dribbling, curve). When we move onto second principal component we may observe that the abilities with highest influence on the variance suit best to defensive players (strength, interceptions, standing_tackle). The attributes that have the highest influence on variance of the third principal component are beyond the football skills. They are highly connected with the contracts and marketing (value_euro, release_clause_euro, wage_euro, international_reputation.1.5.).