A. Introduction

In this assignment, I am going to apply dimension reduction to analyse the dataset of the game FIFA 19 players. The dataset contains data of more than 18,000 players with 34 skills metrics. I am going to reduce those metrics to a smaller number.

B. Data preparation

At data preparation step, firstly I clean the players with NA data. Then I selected column 55 to 88 which are the metrics of players for analysis. My goal is to reduce from 34 dimensions to a lower number of dimension.

getwd()
## [1] "G:/My Drive/UL/Assignment 2"
setwd("G:/My Drive/UL/Assignment 2")
df <- read.csv("data.csv")
#Check dataset
str(df)
## 'data.frame':    18207 obs. of  89 variables:
##  $ ï..                     : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ ID                      : int  158023 20801 190871 193080 192985 183277 177003 176580 155862 200389 ...
##  $ Name                    : chr  "L. Messi" "Cristiano Ronaldo" "Neymar Jr" "De Gea" ...
##  $ Age                     : int  31 33 26 27 27 27 32 31 32 25 ...
##  $ Photo                   : chr  "https://cdn.sofifa.org/players/4/19/158023.png" "https://cdn.sofifa.org/players/4/19/20801.png" "https://cdn.sofifa.org/players/4/19/190871.png" "https://cdn.sofifa.org/players/4/19/193080.png" ...
##  $ Nationality             : chr  "Argentina" "Portugal" "Brazil" "Spain" ...
##  $ Flag                    : chr  "https://cdn.sofifa.org/flags/52.png" "https://cdn.sofifa.org/flags/38.png" "https://cdn.sofifa.org/flags/54.png" "https://cdn.sofifa.org/flags/45.png" ...
##  $ Overall                 : int  94 94 92 91 91 91 91 91 91 90 ...
##  $ Potential               : int  94 94 93 93 92 91 91 91 91 93 ...
##  $ Club                    : chr  "FC Barcelona" "Juventus" "Paris Saint-Germain" "Manchester United" ...
##  $ Club.Logo               : chr  "https://cdn.sofifa.org/teams/2/light/241.png" "https://cdn.sofifa.org/teams/2/light/45.png" "https://cdn.sofifa.org/teams/2/light/73.png" "https://cdn.sofifa.org/teams/2/light/11.png" ...
##  $ Value                   : chr  "€110.5M" "€77M" "€118.5M" "€72M" ...
##  $ Wage                    : chr  "€565K" "€405K" "€290K" "€260K" ...
##  $ Special                 : int  2202 2228 2143 1471 2281 2142 2280 2346 2201 1331 ...
##  $ Preferred.Foot          : chr  "Left" "Right" "Right" "Right" ...
##  $ International.Reputation: int  5 5 5 4 4 4 4 5 4 3 ...
##  $ Weak.Foot               : int  4 4 5 3 5 4 4 4 3 3 ...
##  $ Skill.Moves             : int  4 5 5 1 4 4 4 3 3 1 ...
##  $ Work.Rate               : chr  "Medium/ Medium" "High/ Low" "High/ Medium" "Medium/ Medium" ...
##  $ Body.Type               : chr  "Messi" "C. Ronaldo" "Neymar" "Lean" ...
##  $ Real.Face               : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Position                : chr  "RF" "ST" "LW" "GK" ...
##  $ Jersey.Number           : int  10 7 10 1 7 10 10 9 15 1 ...
##  $ Joined                  : chr  "Jul 1, 2004" "Jul 10, 2018" "Aug 3, 2017" "Jul 1, 2011" ...
##  $ Loaned.From             : chr  "" "" "" "" ...
##  $ Contract.Valid.Until    : chr  "2021" "2022" "2022" "2020" ...
##  $ Height                  : chr  "5'7" "6'2" "5'9" "6'4" ...
##  $ Weight                  : chr  "159lbs" "183lbs" "150lbs" "168lbs" ...
##  $ LS                      : chr  "88+2" "91+3" "84+3" "" ...
##  $ ST                      : chr  "88+2" "91+3" "84+3" "" ...
##  $ RS                      : chr  "88+2" "91+3" "84+3" "" ...
##  $ LW                      : chr  "92+2" "89+3" "89+3" "" ...
##  $ LF                      : chr  "93+2" "90+3" "89+3" "" ...
##  $ CF                      : chr  "93+2" "90+3" "89+3" "" ...
##  $ RF                      : chr  "93+2" "90+3" "89+3" "" ...
##  $ RW                      : chr  "92+2" "89+3" "89+3" "" ...
##  $ LAM                     : chr  "93+2" "88+3" "89+3" "" ...
##  $ CAM                     : chr  "93+2" "88+3" "89+3" "" ...
##  $ RAM                     : chr  "93+2" "88+3" "89+3" "" ...
##  $ LM                      : chr  "91+2" "88+3" "88+3" "" ...
##  $ LCM                     : chr  "84+2" "81+3" "81+3" "" ...
##  $ CM                      : chr  "84+2" "81+3" "81+3" "" ...
##  $ RCM                     : chr  "84+2" "81+3" "81+3" "" ...
##  $ RM                      : chr  "91+2" "88+3" "88+3" "" ...
##  $ LWB                     : chr  "64+2" "65+3" "65+3" "" ...
##  $ LDM                     : chr  "61+2" "61+3" "60+3" "" ...
##  $ CDM                     : chr  "61+2" "61+3" "60+3" "" ...
##  $ RDM                     : chr  "61+2" "61+3" "60+3" "" ...
##  $ RWB                     : chr  "64+2" "65+3" "65+3" "" ...
##  $ LB                      : chr  "59+2" "61+3" "60+3" "" ...
##  $ LCB                     : chr  "47+2" "53+3" "47+3" "" ...
##  $ CB                      : chr  "47+2" "53+3" "47+3" "" ...
##  $ RCB                     : chr  "47+2" "53+3" "47+3" "" ...
##  $ RB                      : chr  "59+2" "61+3" "60+3" "" ...
##  $ Crossing                : int  84 84 79 17 93 81 86 77 66 13 ...
##  $ Finishing               : int  95 94 87 13 82 84 72 93 60 11 ...
##  $ HeadingAccuracy         : int  70 89 62 21 55 61 55 77 91 15 ...
##  $ ShortPassing            : int  90 81 84 50 92 89 93 82 78 29 ...
##  $ Volleys                 : int  86 87 84 13 82 80 76 88 66 13 ...
##  $ Dribbling               : int  97 88 96 18 86 95 90 87 63 12 ...
##  $ Curve                   : int  93 81 88 21 85 83 85 86 74 13 ...
##  $ FKAccuracy              : int  94 76 87 19 83 79 78 84 72 14 ...
##  $ LongPassing             : int  87 77 78 51 91 83 88 64 77 26 ...
##  $ BallControl             : int  96 94 95 42 91 94 93 90 84 16 ...
##  $ Acceleration            : int  91 89 94 57 78 94 80 86 76 43 ...
##  $ SprintSpeed             : int  86 91 90 58 76 88 72 75 75 60 ...
##  $ Agility                 : int  91 87 96 60 79 95 93 82 78 67 ...
##  $ Reactions               : int  95 96 94 90 91 90 90 92 85 86 ...
##  $ Balance                 : int  95 70 84 43 77 94 94 83 66 49 ...
##  $ ShotPower               : int  85 95 80 31 91 82 79 86 79 22 ...
##  $ Jumping                 : int  68 95 61 67 63 56 68 69 93 76 ...
##  $ Stamina                 : int  72 88 81 43 90 83 89 90 84 41 ...
##  $ Strength                : int  59 79 49 64 75 66 58 83 83 78 ...
##  $ LongShots               : int  94 93 82 12 91 80 82 85 59 12 ...
##  $ Aggression              : int  48 63 56 38 76 54 62 87 88 34 ...
##  $ Interceptions           : int  22 29 36 30 61 41 83 41 90 19 ...
##  $ Positioning             : int  94 95 89 12 87 87 79 92 60 11 ...
##  $ Vision                  : int  94 82 87 68 94 89 92 84 63 70 ...
##  $ Penalties               : int  75 85 81 40 79 86 82 85 75 11 ...
##  $ Composure               : int  96 95 94 68 88 91 84 85 82 70 ...
##  $ Marking                 : int  33 28 27 15 68 34 60 62 87 27 ...
##  $ StandingTackle          : int  28 31 24 21 58 27 76 45 92 12 ...
##  $ SlidingTackle           : int  26 23 33 13 51 22 73 38 91 18 ...
##  $ GKDiving                : int  6 7 9 90 15 11 13 27 11 86 ...
##  $ GKHandling              : int  11 11 9 85 13 12 9 25 8 92 ...
##  $ GKKicking               : int  15 15 15 87 5 6 7 31 9 78 ...
##  $ GKPositioning           : int  14 14 15 88 10 8 14 33 7 88 ...
##  $ GKReflexes              : int  8 11 11 94 13 8 9 37 11 89 ...
##  $ Release.Clause          : chr  "€226.5M" "€127.1M" "€228.1M" "€138.6M" ...
#Clean NA
df <- na.omit(df)

#Select the continuous variables that describe the characteristics of players
main <- df[,55:88]
summary(main)
##     Crossing       Finishing     HeadingAccuracy  ShortPassing     Volleys     
##  Min.   : 5.00   Min.   : 2.00   Min.   : 4.0    Min.   : 7.0   Min.   : 4.00  
##  1st Qu.:38.00   1st Qu.:30.00   1st Qu.:44.0    1st Qu.:54.0   1st Qu.:30.00  
##  Median :54.00   Median :49.00   Median :56.0    Median :62.0   Median :44.00  
##  Mean   :49.74   Mean   :45.55   Mean   :52.3    Mean   :58.7   Mean   :42.91  
##  3rd Qu.:64.00   3rd Qu.:62.00   3rd Qu.:64.0    3rd Qu.:68.0   3rd Qu.:57.00  
##  Max.   :93.00   Max.   :95.00   Max.   :94.0    Max.   :93.0   Max.   :90.00  
##    Dribbling         Curve         FKAccuracy     LongPassing   
##  Min.   : 4.00   Min.   : 6.00   Min.   : 3.00   Min.   : 9.00  
##  1st Qu.:49.00   1st Qu.:34.00   1st Qu.:31.00   1st Qu.:43.00  
##  Median :61.00   Median :48.00   Median :41.00   Median :56.00  
##  Mean   :55.38   Mean   :47.18   Mean   :42.87   Mean   :52.72  
##  3rd Qu.:68.00   3rd Qu.:62.00   3rd Qu.:57.00   3rd Qu.:64.00  
##  Max.   :97.00   Max.   :94.00   Max.   :94.00   Max.   :93.00  
##   BallControl     Acceleration    SprintSpeed       Agility       Reactions    
##  Min.   : 5.00   Min.   :12.00   Min.   :12.00   Min.   :14.0   Min.   :21.00  
##  1st Qu.:54.00   1st Qu.:57.00   1st Qu.:57.00   1st Qu.:55.0   1st Qu.:56.00  
##  Median :63.00   Median :67.00   Median :67.00   Median :66.0   Median :62.00  
##  Mean   :58.37   Mean   :64.61   Mean   :64.73   Mean   :63.5   Mean   :61.84  
##  3rd Qu.:69.00   3rd Qu.:75.00   3rd Qu.:75.00   3rd Qu.:74.0   3rd Qu.:68.00  
##  Max.   :96.00   Max.   :97.00   Max.   :96.00   Max.   :96.0   Max.   :96.00  
##     Balance        ShotPower        Jumping         Stamina     
##  Min.   :16.00   Min.   : 2.00   Min.   :15.00   Min.   :12.00  
##  1st Qu.:56.00   1st Qu.:45.00   1st Qu.:58.00   1st Qu.:56.00  
##  Median :66.00   Median :59.00   Median :66.00   Median :66.00  
##  Mean   :63.96   Mean   :55.47   Mean   :65.09   Mean   :63.22  
##  3rd Qu.:74.00   3rd Qu.:68.00   3rd Qu.:73.00   3rd Qu.:74.00  
##  Max.   :96.00   Max.   :95.00   Max.   :95.00   Max.   :96.00  
##     Strength       LongShots       Aggression    Interceptions   Positioning   
##  Min.   :17.00   Min.   : 3.00   Min.   :11.00   Min.   : 3.0   Min.   : 2.00  
##  1st Qu.:58.00   1st Qu.:33.00   1st Qu.:44.00   1st Qu.:26.0   1st Qu.:38.00  
##  Median :67.00   Median :51.00   Median :59.00   Median :52.0   Median :55.00  
##  Mean   :65.32   Mean   :47.11   Mean   :55.88   Mean   :46.7   Mean   :49.96  
##  3rd Qu.:74.00   3rd Qu.:62.00   3rd Qu.:69.00   3rd Qu.:64.0   3rd Qu.:64.00  
##  Max.   :97.00   Max.   :94.00   Max.   :95.00   Max.   :92.0   Max.   :95.00  
##      Vision        Penalties       Composure        Marking      StandingTackle
##  Min.   :10.00   Min.   : 5.00   Min.   : 3.00   Min.   : 3.00   Min.   : 2.0  
##  1st Qu.:44.00   1st Qu.:39.00   1st Qu.:51.00   1st Qu.:30.00   1st Qu.:27.0  
##  Median :55.00   Median :49.00   Median :60.00   Median :53.00   Median :55.0  
##  Mean   :53.41   Mean   :48.55   Mean   :58.65   Mean   :47.29   Mean   :47.7  
##  3rd Qu.:64.00   3rd Qu.:60.00   3rd Qu.:67.00   3rd Qu.:64.00   3rd Qu.:66.0  
##  Max.   :94.00   Max.   :92.00   Max.   :96.00   Max.   :94.00   Max.   :93.0  
##  SlidingTackle      GKDiving       GKHandling      GKKicking    
##  Min.   : 3.00   Min.   : 1.00   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.:24.00   1st Qu.: 8.00   1st Qu.: 8.00   1st Qu.: 8.00  
##  Median :52.00   Median :11.00   Median :11.00   Median :11.00  
##  Mean   :45.67   Mean   :16.62   Mean   :16.39   Mean   :16.23  
##  3rd Qu.:64.00   3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:14.00  
##  Max.   :91.00   Max.   :90.00   Max.   :92.00   Max.   :91.00  
##  GKPositioning     GKReflexes   
##  Min.   : 1.00   Min.   : 1.00  
##  1st Qu.: 8.00   1st Qu.: 8.00  
##  Median :11.00   Median :11.00  
##  Mean   :16.39   Mean   :16.71  
##  3rd Qu.:14.00   3rd Qu.:14.00  
##  Max.   :90.00   Max.   :94.00

Then, I check the relations between variables, and normalize data.

#check relations between variables
corr_simple <- function(data=main,sig=0.5){
  corr <- cor(main)
  #prepare to drop duplicates and correlations of 1     
  corr[lower.tri(corr,diag=TRUE)] <- NA 
  #drop perfect correlations
  corr[corr == 1] <- NA 
  #turn into a 3-column table
  corr <- as.data.frame(as.table(corr))
  #remove the NA values from above 
  corr <- na.omit(corr) 
  #select significant values  
  corr <- subset(corr, abs(Freq) > sig) 
  #sort by highest correlation
  corr <- corr[order(-abs(corr$Freq)),] 
  #turn corr back into matrix in order to plot with corrplot
  mtx_corr <- reshape2::acast(corr, Var1~Var2, value.var="Freq")
  
  #plot correlations visually
  corrplot(mtx_corr, is.corr=FALSE, tl.col="black", na.label=" ")
}
corr_simple()

# normalization of data
main_preproc <- preProcess(main, method=c("center", "scale"))
main_z <- predict(main_preproc, main)

Because we have so many variables which will not be visible if all are shown. So, I just select to show the correlation that is greater than or equal to 0.5 and not equal to 1. Look at the map, we can see that “composure” which is the ability to play under pressure, and “positioning” has positive correlation with various types of skills. And Goal Keepers is a special position on field whose particular skills have negative correlations with other positions’ skills.

C. PCA application

Due to the size of the dataset, MDS is very expensive in term of computing cost. Therefore, in this assignment, I choose the PCA method.

#apply PCA
main_z.pca <- prcomp(main_z, center = FALSE,scale. = FALSE)
summary(main_z.pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     4.3419 2.2158 1.58430 1.30628 1.14426 0.79216 0.66251
## Proportion of Variance 0.5545 0.1444 0.07382 0.05019 0.03851 0.01846 0.01291
## Cumulative Proportion  0.5545 0.6989 0.77269 0.82288 0.86139 0.87985 0.89276
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.59924 0.54908 0.52604 0.49891 0.48207 0.47250 0.45325
## Proportion of Variance 0.01056 0.00887 0.00814 0.00732 0.00683 0.00657 0.00604
## Cumulative Proportion  0.90332 0.91218 0.92032 0.92764 0.93448 0.94105 0.94709
##                           PC15    PC16   PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.44747 0.42779 0.4125 0.37214 0.35811 0.35283 0.34384
## Proportion of Variance 0.00589 0.00538 0.0050 0.00407 0.00377 0.00366 0.00348
## Cumulative Proportion  0.95298 0.95836 0.9634 0.96744 0.97121 0.97487 0.97835
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.32802 0.30413 0.28872 0.26866 0.26127 0.25516 0.24776
## Proportion of Variance 0.00316 0.00272 0.00245 0.00212 0.00201 0.00191 0.00181
## Cumulative Proportion  0.98151 0.98423 0.98669 0.98881 0.99082 0.99273 0.99454
##                          PC29   PC30    PC31    PC32    PC33    PC34
## Standard deviation     0.1937 0.1931 0.17616 0.17426 0.16318 0.15144
## Proportion of Variance 0.0011 0.0011 0.00091 0.00089 0.00078 0.00067
## Cumulative Proportion  0.9956 0.9967 0.99765 0.99854 0.99933 1.00000
#plot
fviz_pca_var(main_z.pca, col.var="steelblue")

# visusalisation of quality
fviz_eig(main_z.pca, choice='eigenvalue') # eigenvalues on y-axis

eig.val<-get_eigenvalue(main_z.pca)
eig.val
##         eigenvalue variance.percent cumulative.variance.percent
## Dim.1  18.85180556      55.44648694                    55.44649
## Dim.2   4.90975256      14.44044871                    69.88694
## Dim.3   2.51001675       7.38240221                    77.26934
## Dim.4   1.70636728       5.01872730                    82.28807
## Dim.5   1.30932681       3.85096121                    86.13903
## Dim.6   0.62751831       1.84564208                    87.98467
## Dim.7   0.43892341       1.29095120                    89.27562
## Dim.8   0.35909034       1.05614805                    90.33177
## Dim.9   0.30148807       0.88672962                    91.21850
## Dim.10  0.27671920       0.81388001                    92.03238
## Dim.11  0.24890902       0.73208534                    92.76446
## Dim.12  0.23238685       0.68349072                    93.44795
## Dim.13  0.22325759       0.65663998                    94.10459
## Dim.14  0.20543604       0.60422363                    94.70882
## Dim.15  0.20022505       0.58889719                    95.29771
## Dim.16  0.18300700       0.53825590                    95.83597
## Dim.17  0.17016869       0.50049615                    96.33647
## Dim.18  0.13848727       0.40731551                    96.74378
## Dim.19  0.12824515       0.37719161                    97.12097
## Dim.20  0.12448956       0.36614576                    97.48712
## Dim.21  0.11822780       0.34772881                    97.83485
## Dim.22  0.10759798       0.31646466                    98.15131
## Dim.23  0.09249322       0.27203888                    98.42335
## Dim.24  0.08336001       0.24517651                    98.66853
## Dim.25  0.07218008       0.21229437                    98.88082
## Dim.26  0.06826044       0.20076600                    99.08159
## Dim.27  0.06510532       0.19148625                    99.27307
## Dim.28  0.06138729       0.18055086                    99.45363
## Dim.29  0.03751863       0.11034891                    99.56397
## Dim.30  0.03728700       0.10966764                    99.67364
## Dim.31  0.03103274       0.09127276                    99.76491
## Dim.32  0.03036686       0.08931430                    99.85423
## Dim.33  0.02662815       0.07831808                    99.93255
## Dim.34  0.02293397       0.06745286                   100.00000

Due to Kaiser criterion, 5 first dimensions are selected for further evaluation as they have eigenvalues which are greater than mean of all eigenvalues. In addition, Dim.1 and Dim.2 are 2 most important dimensions and should be focused on, as they account for almost 70% variance.

  1. Focusing on Dim.1
a<-fviz_contrib(main_z.pca, "var", axes=1, xtickslab.rt=90)
a

In Dim.1, the below variables contributes the most, in other words, we can read the values of these variables in Dim.1.

This dimension can be called “Attack”. BallControl, Dribbling and ShortPassing are the most important variables among all in this dimension

  1. Focusing on Dim.2
b<-fviz_contrib(main_z.pca, "var", axes=2, xtickslab.rt=90)
b

In Dim.2, the below variables contributes the most.

This dimension can be called “Defence” as the most 4 important variables contributing to it are SlidingTackle, StandingTackle, Interceptions and Marking

  1. Focusing on Dim.3
c<-fviz_contrib(main_z.pca, "var", axes=3, xtickslab.rt=90)
c

In Dim.3, the below variables contributes the most.

This dimension can be called “Goalkeeper”. In this dimension, followed by the set of goal keeper skills are reaction and composure which are also important to the goalkeeper. In addition, those 2 variables contribute the most to this dimension.

  1. Focusing on Dim.4
d<-fviz_contrib(main_z.pca, "var", axes=4, xtickslab.rt=90)
d

In Dim.4, the below variables contributes the most. - Acceleration - Agility - Balance - Crossing - Finishing

The 4 attributes contributing the most are agility, balance, heading accuracy and strength. So, this dimension can be call Physique

  1. Focusing on Dim.5
e<-fviz_contrib(main_z.pca, "var", axes=5, xtickslab.rt=90)
e

In Dim.5, the below variables contributes the most.

The top contributions are acceleration, jumping and sprintspeed. Therefore, the dimension can be named Athletics.

  1. Conclusion

    Thanks to PCA method, from 34 metrics, the data can be reduced to 5 dimensions: Attack, Defense, Goalkeeper, Physique and Athletics. People can either evaluate each player by looking at the detailed 34 attributes or analyze them faster with 5 dimensions.