Dimension Reduction

Introduction

In this project, we will use Dimension Reduction techniques on football player attributes. We will use two different dimension reduction techniques: PCA and t-SNE. In the end, we will compare these two methods based on two characteristics:

Running time;
How it can distinguish the different positions of players in visualization.

Libraries

Data Understanding

Our data consists of detailed attributes for every player registered in the latest edition of the FIFA 19 database. It is acquired from the Kaggle website (K.Gadiya 2019) and it is prepared for analysis in the previous project by me. (https://rpubs.com/nijat_g/data_prep)

data <- read.csv2("Prepared_data.csv", sep = ",")
str(data)

## 'data.frame':    18207 obs. of  58 variables:
##  $ ID                             : int  158023 20801 190871 193080 192985 183277 177003 176580 155862 200389 ...
##  $ Name                           : chr  "L. Messi" "Cristiano Ronaldo" "Neymar Jr" "De Gea" ...
##  $ Age                            : int  31 33 26 27 27 27 32 31 32 25 ...
##  $ Nationality                    : chr  "Argentina" "Portugal" "Brazil" "Spain" ...
##  $ Overall                        : int  94 94 92 91 91 91 91 91 91 90 ...
##  $ Potential                      : int  94 94 93 93 92 91 91 91 91 93 ...
##  $ Club                           : chr  "FC Barcelona" "Juventus" "Paris Saint-Germain" "Manchester United" ...
##  $ Value.in.million.euros         : chr  "110.5" "77" "118.5" "72" ...
##  $ Wage.in.thousand.euros         : int  565 405 290 260 355 340 420 455 380 94 ...
##  $ Special                        : int  2202 2228 2143 1471 2281 2142 2280 2346 2201 1331 ...
##  $ Preferred.Foot                 : chr  "Left" "Right" "Right" "Right" ...
##  $ International.Reputation       : int  5 5 5 4 4 4 4 5 4 3 ...
##  $ Weak.Foot                      : int  4 4 5 3 5 4 4 4 3 3 ...
##  $ Skill.Moves                    : int  4 5 5 1 4 4 4 3 3 1 ...
##  $ Work.Rate                      : chr  "Medium/ Medium" "High/ Low" "High/ Medium" "Medium/ Medium" ...
##  $ Real.Face                      : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Position                       : chr  "RF" "ST" "LW" "GK" ...
##  $ Jersey.Number                  : int  10 7 10 1 7 10 10 9 15 1 ...
##  $ Joined                         : chr  "1-Jul-04" "10-Jul-18" "3-Aug-17" "1-Jul-11" ...
##  $ Loaned.From                    : chr  "" "" "" "" ...
##  $ Contract.Valid.Until           : chr  "2021" "2022" "2022" "2020" ...
##  $ Weight                         : int  159 183 150 168 154 163 146 190 181 192 ...
##  $ Crossing                       : int  84 84 79 17 93 81 86 77 66 13 ...
##  $ Finishing                      : int  95 94 87 13 82 84 72 93 60 11 ...
##  $ HeadingAccuracy                : int  70 89 62 21 55 61 55 77 91 15 ...
##  $ ShortPassing                   : int  90 81 84 50 92 89 93 82 78 29 ...
##  $ Volleys                        : int  86 87 84 13 82 80 76 88 66 13 ...
##  $ Dribbling                      : int  97 88 96 18 86 95 90 87 63 12 ...
##  $ Curve                          : int  93 81 88 21 85 83 85 86 74 13 ...
##  $ FKAccuracy                     : int  94 76 87 19 83 79 78 84 72 14 ...
##  $ LongPassing                    : int  87 77 78 51 91 83 88 64 77 26 ...
##  $ BallControl                    : int  96 94 95 42 91 94 93 90 84 16 ...
##  $ Acceleration                   : int  91 89 94 57 78 94 80 86 76 43 ...
##  $ SprintSpeed                    : int  86 91 90 58 76 88 72 75 75 60 ...
##  $ Agility                        : int  91 87 96 60 79 95 93 82 78 67 ...
##  $ Reactions                      : int  95 96 94 90 91 90 90 92 85 86 ...
##  $ Balance                        : int  95 70 84 43 77 94 94 83 66 49 ...
##  $ ShotPower                      : int  85 95 80 31 91 82 79 86 79 22 ...
##  $ Jumping                        : int  68 95 61 67 63 56 68 69 93 76 ...
##  $ Stamina                        : int  72 88 81 43 90 83 89 90 84 41 ...
##  $ Strength                       : int  59 79 49 64 75 66 58 83 83 78 ...
##  $ LongShots                      : int  94 93 82 12 91 80 82 85 59 12 ...
##  $ Aggression                     : int  48 63 56 38 76 54 62 87 88 34 ...
##  $ Interceptions                  : int  22 29 36 30 61 41 83 41 90 19 ...
##  $ Positioning                    : int  94 95 89 12 87 87 79 92 60 11 ...
##  $ Vision                         : int  94 82 87 68 94 89 92 84 63 70 ...
##  $ Penalties                      : int  75 85 81 40 79 86 82 85 75 11 ...
##  $ Composure                      : int  96 95 94 68 88 91 84 85 82 70 ...
##  $ Marking                        : int  33 28 27 15 68 34 60 62 87 27 ...
##  $ StandingTackle                 : int  28 31 24 21 58 27 76 45 92 12 ...
##  $ SlidingTackle                  : int  26 23 33 13 51 22 73 38 91 18 ...
##  $ GKDiving                       : int  6 7 9 90 15 11 13 27 11 86 ...
##  $ GKHandling                     : int  11 11 9 85 13 12 9 25 8 92 ...
##  $ GKKicking                      : int  15 15 15 87 5 6 7 31 9 78 ...
##  $ GKPositioning                  : int  14 14 15 88 10 8 14 33 7 88 ...
##  $ GKReflexes                     : int  8 11 11 94 13 8 9 37 11 89 ...
##  $ Release.clause.in.million.euros: chr  "226.5" "127.1" "228.1" "138.6" ...
##  $ Height                         : chr  "170.18" "187.96" "175.26" "193.04" ...

For this project, we will use only Position and player attributes columns. Moreover, I will remove rows that contain missing values from our dataset in order to be able to build models in further steps.

data <- data[,c(17, 23:56)]
data <- data[complete.cases(data),]
dim(data)

## [1] 18159    35

summary(data)

##    Position            Crossing       Finishing     HeadingAccuracy
##  Length:18159       Min.   : 5.00   Min.   : 2.00   Min.   : 4.0   
##  Class :character   1st Qu.:38.00   1st Qu.:30.00   1st Qu.:44.0   
##  Mode  :character   Median :54.00   Median :49.00   Median :56.0   
##                     Mean   :49.73   Mean   :45.55   Mean   :52.3   
##                     3rd Qu.:64.00   3rd Qu.:62.00   3rd Qu.:64.0   
##                     Max.   :93.00   Max.   :95.00   Max.   :94.0   
##   ShortPassing      Volleys        Dribbling         Curve      
##  Min.   : 7.00   Min.   : 4.00   Min.   : 4.00   Min.   : 6.00  
##  1st Qu.:54.00   1st Qu.:30.00   1st Qu.:49.00   1st Qu.:34.00  
##  Median :62.00   Median :44.00   Median :61.00   Median :48.00  
##  Mean   :58.69   Mean   :42.91   Mean   :55.37   Mean   :47.17  
##  3rd Qu.:68.00   3rd Qu.:57.00   3rd Qu.:68.00   3rd Qu.:62.00  
##  Max.   :93.00   Max.   :90.00   Max.   :97.00   Max.   :94.00  
##    FKAccuracy     LongPassing     BallControl     Acceleration  
##  Min.   : 3.00   Min.   : 9.00   Min.   : 5.00   Min.   :12.00  
##  1st Qu.:31.00   1st Qu.:43.00   1st Qu.:54.00   1st Qu.:57.00  
##  Median :41.00   Median :56.00   Median :63.00   Median :67.00  
##  Mean   :42.86   Mean   :52.71   Mean   :58.37   Mean   :64.61  
##  3rd Qu.:57.00   3rd Qu.:64.00   3rd Qu.:69.00   3rd Qu.:75.00  
##  Max.   :94.00   Max.   :93.00   Max.   :96.00   Max.   :97.00  
##   SprintSpeed       Agility       Reactions        Balance        ShotPower    
##  Min.   :12.00   Min.   :14.0   Min.   :21.00   Min.   :16.00   Min.   : 2.00  
##  1st Qu.:57.00   1st Qu.:55.0   1st Qu.:56.00   1st Qu.:56.00   1st Qu.:45.00  
##  Median :67.00   Median :66.0   Median :62.00   Median :66.00   Median :59.00  
##  Mean   :64.73   Mean   :63.5   Mean   :61.84   Mean   :63.97   Mean   :55.46  
##  3rd Qu.:75.00   3rd Qu.:74.0   3rd Qu.:68.00   3rd Qu.:74.00   3rd Qu.:68.00  
##  Max.   :96.00   Max.   :96.0   Max.   :96.00   Max.   :96.00   Max.   :95.00  
##     Jumping         Stamina         Strength       LongShots    
##  Min.   :15.00   Min.   :12.00   Min.   :17.00   Min.   : 3.00  
##  1st Qu.:58.00   1st Qu.:56.00   1st Qu.:58.00   1st Qu.:33.00  
##  Median :66.00   Median :66.00   Median :67.00   Median :51.00  
##  Mean   :65.09   Mean   :63.22   Mean   :65.31   Mean   :47.11  
##  3rd Qu.:73.00   3rd Qu.:74.00   3rd Qu.:74.00   3rd Qu.:62.00  
##  Max.   :95.00   Max.   :96.00   Max.   :97.00   Max.   :94.00  
##    Aggression    Interceptions   Positioning        Vision       Penalties    
##  Min.   :11.00   Min.   : 3.0   Min.   : 2.00   Min.   :10.0   Min.   : 5.00  
##  1st Qu.:44.00   1st Qu.:26.0   1st Qu.:38.00   1st Qu.:44.0   1st Qu.:39.00  
##  Median :59.00   Median :52.0   Median :55.00   Median :55.0   Median :49.00  
##  Mean   :55.87   Mean   :46.7   Mean   :49.96   Mean   :53.4   Mean   :48.55  
##  3rd Qu.:69.00   3rd Qu.:64.0   3rd Qu.:64.00   3rd Qu.:64.0   3rd Qu.:60.00  
##  Max.   :95.00   Max.   :92.0   Max.   :95.00   Max.   :94.0   Max.   :92.00  
##    Composure        Marking      StandingTackle SlidingTackle      GKDiving    
##  Min.   : 3.00   Min.   : 3.00   Min.   : 2.0   Min.   : 3.00   Min.   : 1.00  
##  1st Qu.:51.00   1st Qu.:30.00   1st Qu.:27.0   1st Qu.:24.00   1st Qu.: 8.00  
##  Median :60.00   Median :53.00   Median :55.0   Median :52.00   Median :11.00  
##  Mean   :58.65   Mean   :47.28   Mean   :47.7   Mean   :45.66   Mean   :16.62  
##  3rd Qu.:67.00   3rd Qu.:64.00   3rd Qu.:66.0   3rd Qu.:64.00   3rd Qu.:14.00  
##  Max.   :96.00   Max.   :94.00   Max.   :93.0   Max.   :91.00   Max.   :90.00  
##    GKHandling      GKKicking     GKPositioning     GKReflexes   
##  Min.   : 1.00   Min.   : 1.00   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.: 8.00   1st Qu.: 8.00   1st Qu.: 8.00   1st Qu.: 8.00  
##  Median :11.00   Median :11.00   Median :11.00   Median :11.00  
##  Mean   :16.39   Mean   :16.23   Mean   :16.39   Mean   :16.71  
##  3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:14.00  
##  Max.   :92.00   Max.   :91.00   Max.   :90.00   Max.   :94.00

There are 34 attributes of each player that are measured in 0 and 100 scale. In the below image, you can see the meaning of positions.

Positions

pl_cor <- cor(data[-1], method="pearson") 
corrplot(pl_cor, type = "lower", order ="alphabet", tl.cex=0.6)

From the correlation matrix, we can observe that there are high correlations between some variables. For example, if a player has high boll control ability, then he should have a high dribbling ability. Consequently, if a player has high goalkeeper abilities, then it is expected that he has not trained for dribbling and that’s why he has lower dribbling ability.

PCA

First, I will build a PCA model. However, before building model I want to change position of players to their role for simplicity of analysis and comparison:

ST - Striker
MF - Midfielder
DF - Defender
GK - Goalkeeper

Role <- data.frame(Position = c("LS","ST","RS",
                                "LW","LF","CF","RF","RW",
                                "LAM","RAM","CAM",
                                "LM","LCM","CM","RCM","RM",
                                "LWB","LDM","CDM","RDM","RWB",
                                "LB","LCB","CB","RCB","RB",
                                "GK"), 
                   Role = c(rep("ST",8),rep("MF",13),rep("DF",5),"GK"))

data <- left_join(data,Role, by = "Position")
data <- data[,-1]
data$Role <- as.factor(data$Role)

Now we can apply PCA technique to the data for dimension reduction.

pl_pca <- prcomp(data[,-35])
summary(pl_pca)

## Importance of components:
##                            PC1     PC2      PC3      PC4      PC5      PC6
## Standard deviation     74.4271 42.1554 23.47371 20.60433 15.86298 10.74088
## Proportion of Variance  0.5704  0.1830  0.05674  0.04372  0.02591  0.01188
## Cumulative Proportion   0.5704  0.7534  0.81018  0.85389  0.87981  0.89169
##                           PC7     PC8    PC9    PC10   PC11    PC12    PC13
## Standard deviation     9.9989 8.90424 8.7571 8.37101 8.3051 7.84047 7.11906
## Proportion of Variance 0.0103 0.00816 0.0079 0.00722 0.0071 0.00633 0.00522
## Cumulative Proportion  0.9020 0.91015 0.9180 0.92526 0.9324 0.93869 0.94391
##                           PC14    PC15    PC16    PC17    PC18   PC19    PC20
## Standard deviation     6.90321 6.81307 6.55929 6.42496 6.12829 6.0756 5.89232
## Proportion of Variance 0.00491 0.00478 0.00443 0.00425 0.00387 0.0038 0.00358
## Cumulative Proportion  0.94882 0.95360 0.95803 0.96228 0.96615 0.9699 0.97353
##                           PC21    PC22    PC23    PC24    PC25   PC26    PC27
## Standard deviation     5.83809 5.56297 5.41587 5.29599 4.96105 4.6206 3.92163
## Proportion of Variance 0.00351 0.00319 0.00302 0.00289 0.00253 0.0022 0.00158
## Cumulative Proportion  0.97704 0.98022 0.98324 0.98613 0.98867 0.9909 0.99245
##                           PC28   PC29    PC30    PC31    PC32   PC33    PC34
## Standard deviation     3.91130 3.2681 3.23119 3.22023 3.06146 2.9554 2.90530
## Proportion of Variance 0.00158 0.0011 0.00108 0.00107 0.00097 0.0009 0.00087
## Cumulative Proportion  0.99402 0.9951 0.99620 0.99727 0.99823 0.9991 1.00000

biplot(pl_pca)

From the biplot, we can observe that attributes of specific positions have been separated. Goalkeeper attributes are collected on the right, defender attributes are collected on the upper left corner. However, midfielder and striker attributes are not separated well.

The next step is to find the number of the components. For this we will use two methods:

Proportion of variance explained
Scree plot of eigenvalues

In both cases, we will use the elbow method.

data_var <- pl_pca$sdev^2
pve <- data_var/sum(data_var)
plot(pve, xlab = "Principal Component", 
     ylab = "Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

plot(cumsum(pve), xlab = "Principal Component", 
     ylab = "Cumulative Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

According to the above charts, the optimal number of components are 3. Now let’s look at the eigenvalues.

data.cov<-cov(data[-35])
data.eigen<-eigen(data.cov)
data.eigen$values

##  [1] 5539.397902 1777.080007  551.014923  424.538279  251.634135  115.366501
##  [7]   99.978610   79.285417   76.687440   70.073848   68.974331   61.472938
## [13]   50.681062   47.654322   46.417951   43.024341   41.280080   37.555985
## [19]   36.912503   34.719391   34.083292   30.946630   29.331613   28.047540
## [25]   24.611998   21.349737   15.379158   15.298232   10.680724   10.440612
## [31]   10.369864    9.372549    8.734375    8.440744

In general, we need to take components that eigenvalues of these components are higher than 1. However, as you can see all of our components’ eigenvalues are greater than one. It is probably because of the high dimension of data and great variability of variables. That’s why I will use the elbow method here again.

fviz_eig(pl_pca, choice='eigenvalue')

According to this chart, also the number of optimal components are 3.

After determining optimal number of clusters, I will visualize our result and we will se how our model can separate different roles of players based on attributes.

colors = rainbow(length(unique(data$Role)))
names(colors) = unique(data$Role)
plot(pl_pca$x[,1:2], t='n', main="PCA")
text(pl_pca$x[,1:2], labels=data$Role, col=colors[data$Role])

If we visualize the first two components of the model, we can see that the model perfectly separate goalkeepers from other roles. Even though other roles are grouped together, we cannot see clear separation among them.

plot(pl_pca$x[,2:3], t='n', main="PCA")
text(pl_pca$x[,2:3], labels=data$Role, col=colors[data$Role])

If we visualize second and third components, this did even worse for our data. For finding main reason behind this, let’s look at the variable contribution to different components.

PC1 <- fviz_contrib(pl_pca, choice = "var", axes = 1)
PC2 <- fviz_contrib(pl_pca, choice = "var", axes = 2)
PC3 <- fviz_contrib(pl_pca, choice = "var", axes = 3)
grid.arrange(PC1, PC2, PC3)

The variables (attributes) that contribute most to the first component are the attributes of attackers. For the second one, the most contributions come from the attributes of defenders. That’s why the visualization of the first two components can clearly differentiate attackers and defenders as they are two “opposite” roles. However, as midfielders have both defender and attacker attributes, our model was not able to differentiate them.

The contribution for the third variable belongs mostly to the goalkeeper attributes which also have defender attributes. This is the reason behind our model’s worse performance in the second and third components.

t-SNE

For building t-SNE, I will take dimension equal to three because of the PCA results, limit the maximum iterations to 500 and close the “check_duplicates” parameter of the function as our data consists of integers and can be same for several players.

tsne <- Rtsne(data[,-35], dims = 3, perplexity=30, verbose=TRUE, max_iter = 500, check_duplicates = FALSE)

## Performing PCA
## Read the 18159 x 34 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 3, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
##  - point 10000 of 18159
## Done in 13.97 seconds (sparsity = 0.007553)!
## Learning embedding...
## Iteration 50: error is 102.983237 (50 iterations in 9.22 seconds)
## Iteration 100: error is 99.641786 (50 iterations in 11.33 seconds)
## Iteration 150: error is 89.695916 (50 iterations in 8.92 seconds)
## Iteration 200: error is 89.149931 (50 iterations in 8.85 seconds)
## Iteration 250: error is 89.011659 (50 iterations in 9.38 seconds)
## Iteration 300: error is 3.603248 (50 iterations in 7.92 seconds)
## Iteration 350: error is 3.290715 (50 iterations in 7.60 seconds)
## Iteration 400: error is 3.124611 (50 iterations in 7.67 seconds)
## Iteration 450: error is 3.011999 (50 iterations in 7.53 seconds)
## Iteration 500: error is 2.929062 (50 iterations in 7.68 seconds)
## Fitting performed in 86.11 seconds.

plot(tsne$Y[,1:2], t='n', main="t-SNE")
text(tsne$Y[,1:2], labels=data$Role, col=colors[data$Role])

plot(tsne$Y[,2:3], t='n', main="t-SNE")
text(tsne$Y[,2:3], labels=data$Role, col=colors[data$Role])

As you can see from the previous graphs, t-SNE performs worse than PCA results. It mostly because of the “duplication” problem of our data.

Before conclude our result, I want to compare the running time of two models.

exeTimePCA<- system.time(prcomp(data[,-35]))
exeTimetSNE<- system.time(Rtsne(data[,-35], dims = 3, perplexity=30, verbose=TRUE, max_iter = 500, check_duplicates = FALSE))

## Performing PCA
## Read the 18159 x 34 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 3, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
##  - point 10000 of 18159
## Done in 13.70 seconds (sparsity = 0.007553)!
## Learning embedding...
## Iteration 50: error is 102.983218 (50 iterations in 10.75 seconds)
## Iteration 100: error is 100.936801 (50 iterations in 47.30 seconds)
## Iteration 150: error is 89.990146 (50 iterations in 16.46 seconds)
## Iteration 200: error is 89.290048 (50 iterations in 13.63 seconds)
## Iteration 250: error is 89.109278 (50 iterations in 21.50 seconds)
## Iteration 300: error is 3.586800 (50 iterations in 12.16 seconds)
## Iteration 350: error is 3.276531 (50 iterations in 8.77 seconds)
## Iteration 400: error is 3.112004 (50 iterations in 8.15 seconds)
## Iteration 450: error is 3.000232 (50 iterations in 8.21 seconds)
## Iteration 500: error is 2.916969 (50 iterations in 8.06 seconds)
## Fitting performed in 155.01 seconds.

exeTimePCA;exeTimetSNE

##    user  system elapsed 
##    0.12    0.01    0.14

##    user  system elapsed 
##  172.84    0.44  173.86

t-SNE takes much longer time to perform than PCA.

Conclusion

I have built PCA and t-SNE models on the given dataset for dimension reduction. In conclusion, I can say that even though there was not a perfect result, PCA is better than t-SNE in this case as it takes less time to perform and gives more accurate results.