A. Introduction
In this assignment, I am going to apply dimension reduction to analyse the dataset of the game FIFA 19 players. The dataset contains data of more than 18,000 players with 34 skills metrics. I am going to reduce those metrics to a smaller number.
B. Data preparation
At data preparation step, firstly I clean the players with NA data. Then I selected column 55 to 88 which are the metrics of players for analysis. My goal is to reduce from 34 dimensions to a lower number of dimension.
getwd()
## [1] "G:/My Drive/UL/Assignment 2"
setwd("G:/My Drive/UL/Assignment 2")
df <- read.csv("data.csv")
#Check dataset
str(df)
## 'data.frame': 18207 obs. of 89 variables:
## $ ï.. : int 0 1 2 3 4 5 6 7 8 9 ...
## $ ID : int 158023 20801 190871 193080 192985 183277 177003 176580 155862 200389 ...
## $ Name : chr "L. Messi" "Cristiano Ronaldo" "Neymar Jr" "De Gea" ...
## $ Age : int 31 33 26 27 27 27 32 31 32 25 ...
## $ Photo : chr "https://cdn.sofifa.org/players/4/19/158023.png" "https://cdn.sofifa.org/players/4/19/20801.png" "https://cdn.sofifa.org/players/4/19/190871.png" "https://cdn.sofifa.org/players/4/19/193080.png" ...
## $ Nationality : chr "Argentina" "Portugal" "Brazil" "Spain" ...
## $ Flag : chr "https://cdn.sofifa.org/flags/52.png" "https://cdn.sofifa.org/flags/38.png" "https://cdn.sofifa.org/flags/54.png" "https://cdn.sofifa.org/flags/45.png" ...
## $ Overall : int 94 94 92 91 91 91 91 91 91 90 ...
## $ Potential : int 94 94 93 93 92 91 91 91 91 93 ...
## $ Club : chr "FC Barcelona" "Juventus" "Paris Saint-Germain" "Manchester United" ...
## $ Club.Logo : chr "https://cdn.sofifa.org/teams/2/light/241.png" "https://cdn.sofifa.org/teams/2/light/45.png" "https://cdn.sofifa.org/teams/2/light/73.png" "https://cdn.sofifa.org/teams/2/light/11.png" ...
## $ Value : chr "€110.5M" "€77M" "€118.5M" "€72M" ...
## $ Wage : chr "€565K" "€405K" "€290K" "€260K" ...
## $ Special : int 2202 2228 2143 1471 2281 2142 2280 2346 2201 1331 ...
## $ Preferred.Foot : chr "Left" "Right" "Right" "Right" ...
## $ International.Reputation: int 5 5 5 4 4 4 4 5 4 3 ...
## $ Weak.Foot : int 4 4 5 3 5 4 4 4 3 3 ...
## $ Skill.Moves : int 4 5 5 1 4 4 4 3 3 1 ...
## $ Work.Rate : chr "Medium/ Medium" "High/ Low" "High/ Medium" "Medium/ Medium" ...
## $ Body.Type : chr "Messi" "C. Ronaldo" "Neymar" "Lean" ...
## $ Real.Face : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Position : chr "RF" "ST" "LW" "GK" ...
## $ Jersey.Number : int 10 7 10 1 7 10 10 9 15 1 ...
## $ Joined : chr "Jul 1, 2004" "Jul 10, 2018" "Aug 3, 2017" "Jul 1, 2011" ...
## $ Loaned.From : chr "" "" "" "" ...
## $ Contract.Valid.Until : chr "2021" "2022" "2022" "2020" ...
## $ Height : chr "5'7" "6'2" "5'9" "6'4" ...
## $ Weight : chr "159lbs" "183lbs" "150lbs" "168lbs" ...
## $ LS : chr "88+2" "91+3" "84+3" "" ...
## $ ST : chr "88+2" "91+3" "84+3" "" ...
## $ RS : chr "88+2" "91+3" "84+3" "" ...
## $ LW : chr "92+2" "89+3" "89+3" "" ...
## $ LF : chr "93+2" "90+3" "89+3" "" ...
## $ CF : chr "93+2" "90+3" "89+3" "" ...
## $ RF : chr "93+2" "90+3" "89+3" "" ...
## $ RW : chr "92+2" "89+3" "89+3" "" ...
## $ LAM : chr "93+2" "88+3" "89+3" "" ...
## $ CAM : chr "93+2" "88+3" "89+3" "" ...
## $ RAM : chr "93+2" "88+3" "89+3" "" ...
## $ LM : chr "91+2" "88+3" "88+3" "" ...
## $ LCM : chr "84+2" "81+3" "81+3" "" ...
## $ CM : chr "84+2" "81+3" "81+3" "" ...
## $ RCM : chr "84+2" "81+3" "81+3" "" ...
## $ RM : chr "91+2" "88+3" "88+3" "" ...
## $ LWB : chr "64+2" "65+3" "65+3" "" ...
## $ LDM : chr "61+2" "61+3" "60+3" "" ...
## $ CDM : chr "61+2" "61+3" "60+3" "" ...
## $ RDM : chr "61+2" "61+3" "60+3" "" ...
## $ RWB : chr "64+2" "65+3" "65+3" "" ...
## $ LB : chr "59+2" "61+3" "60+3" "" ...
## $ LCB : chr "47+2" "53+3" "47+3" "" ...
## $ CB : chr "47+2" "53+3" "47+3" "" ...
## $ RCB : chr "47+2" "53+3" "47+3" "" ...
## $ RB : chr "59+2" "61+3" "60+3" "" ...
## $ Crossing : int 84 84 79 17 93 81 86 77 66 13 ...
## $ Finishing : int 95 94 87 13 82 84 72 93 60 11 ...
## $ HeadingAccuracy : int 70 89 62 21 55 61 55 77 91 15 ...
## $ ShortPassing : int 90 81 84 50 92 89 93 82 78 29 ...
## $ Volleys : int 86 87 84 13 82 80 76 88 66 13 ...
## $ Dribbling : int 97 88 96 18 86 95 90 87 63 12 ...
## $ Curve : int 93 81 88 21 85 83 85 86 74 13 ...
## $ FKAccuracy : int 94 76 87 19 83 79 78 84 72 14 ...
## $ LongPassing : int 87 77 78 51 91 83 88 64 77 26 ...
## $ BallControl : int 96 94 95 42 91 94 93 90 84 16 ...
## $ Acceleration : int 91 89 94 57 78 94 80 86 76 43 ...
## $ SprintSpeed : int 86 91 90 58 76 88 72 75 75 60 ...
## $ Agility : int 91 87 96 60 79 95 93 82 78 67 ...
## $ Reactions : int 95 96 94 90 91 90 90 92 85 86 ...
## $ Balance : int 95 70 84 43 77 94 94 83 66 49 ...
## $ ShotPower : int 85 95 80 31 91 82 79 86 79 22 ...
## $ Jumping : int 68 95 61 67 63 56 68 69 93 76 ...
## $ Stamina : int 72 88 81 43 90 83 89 90 84 41 ...
## $ Strength : int 59 79 49 64 75 66 58 83 83 78 ...
## $ LongShots : int 94 93 82 12 91 80 82 85 59 12 ...
## $ Aggression : int 48 63 56 38 76 54 62 87 88 34 ...
## $ Interceptions : int 22 29 36 30 61 41 83 41 90 19 ...
## $ Positioning : int 94 95 89 12 87 87 79 92 60 11 ...
## $ Vision : int 94 82 87 68 94 89 92 84 63 70 ...
## $ Penalties : int 75 85 81 40 79 86 82 85 75 11 ...
## $ Composure : int 96 95 94 68 88 91 84 85 82 70 ...
## $ Marking : int 33 28 27 15 68 34 60 62 87 27 ...
## $ StandingTackle : int 28 31 24 21 58 27 76 45 92 12 ...
## $ SlidingTackle : int 26 23 33 13 51 22 73 38 91 18 ...
## $ GKDiving : int 6 7 9 90 15 11 13 27 11 86 ...
## $ GKHandling : int 11 11 9 85 13 12 9 25 8 92 ...
## $ GKKicking : int 15 15 15 87 5 6 7 31 9 78 ...
## $ GKPositioning : int 14 14 15 88 10 8 14 33 7 88 ...
## $ GKReflexes : int 8 11 11 94 13 8 9 37 11 89 ...
## $ Release.Clause : chr "€226.5M" "€127.1M" "€228.1M" "€138.6M" ...
#Clean NA
df <- na.omit(df)
#Select the continuous variables that describe the characteristics of players
main <- df[,55:88]
summary(main)
## Crossing Finishing HeadingAccuracy ShortPassing Volleys
## Min. : 5.00 Min. : 2.00 Min. : 4.0 Min. : 7.0 Min. : 4.00
## 1st Qu.:38.00 1st Qu.:30.00 1st Qu.:44.0 1st Qu.:54.0 1st Qu.:30.00
## Median :54.00 Median :49.00 Median :56.0 Median :62.0 Median :44.00
## Mean :49.74 Mean :45.55 Mean :52.3 Mean :58.7 Mean :42.91
## 3rd Qu.:64.00 3rd Qu.:62.00 3rd Qu.:64.0 3rd Qu.:68.0 3rd Qu.:57.00
## Max. :93.00 Max. :95.00 Max. :94.0 Max. :93.0 Max. :90.00
## Dribbling Curve FKAccuracy LongPassing
## Min. : 4.00 Min. : 6.00 Min. : 3.00 Min. : 9.00
## 1st Qu.:49.00 1st Qu.:34.00 1st Qu.:31.00 1st Qu.:43.00
## Median :61.00 Median :48.00 Median :41.00 Median :56.00
## Mean :55.38 Mean :47.18 Mean :42.87 Mean :52.72
## 3rd Qu.:68.00 3rd Qu.:62.00 3rd Qu.:57.00 3rd Qu.:64.00
## Max. :97.00 Max. :94.00 Max. :94.00 Max. :93.00
## BallControl Acceleration SprintSpeed Agility Reactions
## Min. : 5.00 Min. :12.00 Min. :12.00 Min. :14.0 Min. :21.00
## 1st Qu.:54.00 1st Qu.:57.00 1st Qu.:57.00 1st Qu.:55.0 1st Qu.:56.00
## Median :63.00 Median :67.00 Median :67.00 Median :66.0 Median :62.00
## Mean :58.37 Mean :64.61 Mean :64.73 Mean :63.5 Mean :61.84
## 3rd Qu.:69.00 3rd Qu.:75.00 3rd Qu.:75.00 3rd Qu.:74.0 3rd Qu.:68.00
## Max. :96.00 Max. :97.00 Max. :96.00 Max. :96.0 Max. :96.00
## Balance ShotPower Jumping Stamina
## Min. :16.00 Min. : 2.00 Min. :15.00 Min. :12.00
## 1st Qu.:56.00 1st Qu.:45.00 1st Qu.:58.00 1st Qu.:56.00
## Median :66.00 Median :59.00 Median :66.00 Median :66.00
## Mean :63.96 Mean :55.47 Mean :65.09 Mean :63.22
## 3rd Qu.:74.00 3rd Qu.:68.00 3rd Qu.:73.00 3rd Qu.:74.00
## Max. :96.00 Max. :95.00 Max. :95.00 Max. :96.00
## Strength LongShots Aggression Interceptions Positioning
## Min. :17.00 Min. : 3.00 Min. :11.00 Min. : 3.0 Min. : 2.00
## 1st Qu.:58.00 1st Qu.:33.00 1st Qu.:44.00 1st Qu.:26.0 1st Qu.:38.00
## Median :67.00 Median :51.00 Median :59.00 Median :52.0 Median :55.00
## Mean :65.32 Mean :47.11 Mean :55.88 Mean :46.7 Mean :49.96
## 3rd Qu.:74.00 3rd Qu.:62.00 3rd Qu.:69.00 3rd Qu.:64.0 3rd Qu.:64.00
## Max. :97.00 Max. :94.00 Max. :95.00 Max. :92.0 Max. :95.00
## Vision Penalties Composure Marking StandingTackle
## Min. :10.00 Min. : 5.00 Min. : 3.00 Min. : 3.00 Min. : 2.0
## 1st Qu.:44.00 1st Qu.:39.00 1st Qu.:51.00 1st Qu.:30.00 1st Qu.:27.0
## Median :55.00 Median :49.00 Median :60.00 Median :53.00 Median :55.0
## Mean :53.41 Mean :48.55 Mean :58.65 Mean :47.29 Mean :47.7
## 3rd Qu.:64.00 3rd Qu.:60.00 3rd Qu.:67.00 3rd Qu.:64.00 3rd Qu.:66.0
## Max. :94.00 Max. :92.00 Max. :96.00 Max. :94.00 Max. :93.0
## SlidingTackle GKDiving GKHandling GKKicking
## Min. : 3.00 Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.:24.00 1st Qu.: 8.00 1st Qu.: 8.00 1st Qu.: 8.00
## Median :52.00 Median :11.00 Median :11.00 Median :11.00
## Mean :45.67 Mean :16.62 Mean :16.39 Mean :16.23
## 3rd Qu.:64.00 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00
## Max. :91.00 Max. :90.00 Max. :92.00 Max. :91.00
## GKPositioning GKReflexes
## Min. : 1.00 Min. : 1.00
## 1st Qu.: 8.00 1st Qu.: 8.00
## Median :11.00 Median :11.00
## Mean :16.39 Mean :16.71
## 3rd Qu.:14.00 3rd Qu.:14.00
## Max. :90.00 Max. :94.00
Then, I check the relations between variables, and normalize data.
#check relations between variables
corr_simple <- function(data=main,sig=0.5){
corr <- cor(main)
#prepare to drop duplicates and correlations of 1
corr[lower.tri(corr,diag=TRUE)] <- NA
#drop perfect correlations
corr[corr == 1] <- NA
#turn into a 3-column table
corr <- as.data.frame(as.table(corr))
#remove the NA values from above
corr <- na.omit(corr)
#select significant values
corr <- subset(corr, abs(Freq) > sig)
#sort by highest correlation
corr <- corr[order(-abs(corr$Freq)),]
#turn corr back into matrix in order to plot with corrplot
mtx_corr <- reshape2::acast(corr, Var1~Var2, value.var="Freq")
#plot correlations visually
corrplot(mtx_corr, is.corr=FALSE, tl.col="black", na.label=" ")
}
corr_simple()
# normalization of data
main_preproc <- preProcess(main, method=c("center", "scale"))
main_z <- predict(main_preproc, main)
Because we have so many variables which will not be visible if all are shown. So, I just select to show the correlation that is greater than or equal to 0.5 and not equal to 1. Look at the map, we can see that “composure” which is the ability to play under pressure, and “positioning” has positive correlation with various types of skills. And Goal Keepers is a special position on field whose particular skills have negative correlations with other positions’ skills.
C. PCA application
Due to the size of the dataset, MDS is very expensive in term of computing cost. Therefore, in this assignment, I choose the PCA method.
#apply PCA
main_z.pca <- prcomp(main_z, center = FALSE,scale. = FALSE)
summary(main_z.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 4.3419 2.2158 1.58430 1.30628 1.14426 0.79216 0.66251
## Proportion of Variance 0.5545 0.1444 0.07382 0.05019 0.03851 0.01846 0.01291
## Cumulative Proportion 0.5545 0.6989 0.77269 0.82288 0.86139 0.87985 0.89276
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.59924 0.54908 0.52604 0.49891 0.48207 0.47250 0.45325
## Proportion of Variance 0.01056 0.00887 0.00814 0.00732 0.00683 0.00657 0.00604
## Cumulative Proportion 0.90332 0.91218 0.92032 0.92764 0.93448 0.94105 0.94709
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.44747 0.42779 0.4125 0.37214 0.35811 0.35283 0.34384
## Proportion of Variance 0.00589 0.00538 0.0050 0.00407 0.00377 0.00366 0.00348
## Cumulative Proportion 0.95298 0.95836 0.9634 0.96744 0.97121 0.97487 0.97835
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.32802 0.30413 0.28872 0.26866 0.26127 0.25516 0.24776
## Proportion of Variance 0.00316 0.00272 0.00245 0.00212 0.00201 0.00191 0.00181
## Cumulative Proportion 0.98151 0.98423 0.98669 0.98881 0.99082 0.99273 0.99454
## PC29 PC30 PC31 PC32 PC33 PC34
## Standard deviation 0.1937 0.1931 0.17616 0.17426 0.16318 0.15144
## Proportion of Variance 0.0011 0.0011 0.00091 0.00089 0.00078 0.00067
## Cumulative Proportion 0.9956 0.9967 0.99765 0.99854 0.99933 1.00000
#plot
fviz_pca_var(main_z.pca, col.var="steelblue")
# visusalisation of quality
fviz_eig(main_z.pca, choice='eigenvalue') # eigenvalues on y-axis
eig.val<-get_eigenvalue(main_z.pca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 18.85180556 55.44648694 55.44649
## Dim.2 4.90975256 14.44044871 69.88694
## Dim.3 2.51001675 7.38240221 77.26934
## Dim.4 1.70636728 5.01872730 82.28807
## Dim.5 1.30932681 3.85096121 86.13903
## Dim.6 0.62751831 1.84564208 87.98467
## Dim.7 0.43892341 1.29095120 89.27562
## Dim.8 0.35909034 1.05614805 90.33177
## Dim.9 0.30148807 0.88672962 91.21850
## Dim.10 0.27671920 0.81388001 92.03238
## Dim.11 0.24890902 0.73208534 92.76446
## Dim.12 0.23238685 0.68349072 93.44795
## Dim.13 0.22325759 0.65663998 94.10459
## Dim.14 0.20543604 0.60422363 94.70882
## Dim.15 0.20022505 0.58889719 95.29771
## Dim.16 0.18300700 0.53825590 95.83597
## Dim.17 0.17016869 0.50049615 96.33647
## Dim.18 0.13848727 0.40731551 96.74378
## Dim.19 0.12824515 0.37719161 97.12097
## Dim.20 0.12448956 0.36614576 97.48712
## Dim.21 0.11822780 0.34772881 97.83485
## Dim.22 0.10759798 0.31646466 98.15131
## Dim.23 0.09249322 0.27203888 98.42335
## Dim.24 0.08336001 0.24517651 98.66853
## Dim.25 0.07218008 0.21229437 98.88082
## Dim.26 0.06826044 0.20076600 99.08159
## Dim.27 0.06510532 0.19148625 99.27307
## Dim.28 0.06138729 0.18055086 99.45363
## Dim.29 0.03751863 0.11034891 99.56397
## Dim.30 0.03728700 0.10966764 99.67364
## Dim.31 0.03103274 0.09127276 99.76491
## Dim.32 0.03036686 0.08931430 99.85423
## Dim.33 0.02662815 0.07831808 99.93255
## Dim.34 0.02293397 0.06745286 100.00000
Due to Kaiser criterion, 5 first dimensions are selected for further evaluation as they have eigenvalues which are greater than mean of all eigenvalues. In addition, Dim.1 and Dim.2 are 2 most important dimensions and should be focused on, as they account for almost 70% variance.
a<-fviz_contrib(main_z.pca, "var", axes=1, xtickslab.rt=90)
a
In Dim.1, the below variables contributes the most, in other words, we can read the values of these variables in Dim.1.
BallControl
Crossing
Curve
Dribbling
FKAccuracy
GKDiving
GKHandling
GKKicking
GKPositioning
GKReflexes
LongPassing
LongShots
Penalties
Positioning
ShortPassing
ShotPower
Stamina
Volleys
This dimension can be called “Attack”. BallControl, Dribbling and ShortPassing are the most important variables among all in this dimension
b<-fviz_contrib(main_z.pca, "var", axes=2, xtickslab.rt=90)
b
In Dim.2, the below variables contributes the most.
Aggression
Finishing
Interceptions
Marking
SlidingTackle
StandingTackle
Strength
Volleys
This dimension can be called “Defence” as the most 4 important variables contributing to it are SlidingTackle, StandingTackle, Interceptions and Marking
c<-fviz_contrib(main_z.pca, "var", axes=3, xtickslab.rt=90)
c
In Dim.3, the below variables contributes the most.
Acceleration
Composure
GKDiving
GKHandling
GKKicking
GKPositioning
GKReflexes
Reactions
SprintSpeed
Strength
Vision
This dimension can be called “Goalkeeper”. In this dimension, followed by the set of goal keeper skills are reaction and composure which are also important to the goalkeeper. In addition, those 2 variables contribute the most to this dimension.
d<-fviz_contrib(main_z.pca, "var", axes=4, xtickslab.rt=90)
d
In Dim.4, the below variables contributes the most. - Acceleration - Agility - Balance - Crossing - Finishing
HeadingAccuracy
Interceptions
Penalties
SlidingTackle
Strength
The 4 attributes contributing the most are agility, balance, heading accuracy and strength. So, this dimension can be call Physique
e<-fviz_contrib(main_z.pca, "var", axes=5, xtickslab.rt=90)
e
In Dim.5, the below variables contributes the most.
Acceleration
Agility
FKAccuracy
Jumping
LongPassing
SprintSpeed
Stamina
Strength
The top contributions are acceleration, jumping and sprintspeed. Therefore, the dimension can be named Athletics.
Conclusion
Thanks to PCA method, from 34 metrics, the data can be reduced to 5 dimensions: Attack, Defense, Goalkeeper, Physique and Athletics. People can either evaluate each player by looking at the detailed 34 attributes or analyze them faster with 5 dimensions.