In this project, we will use Dimension Reduction techniques on football player attributes. We will use two different dimension reduction techniques: PCA and t-SNE. In the end, we will compare these two methods based on two characteristics:
Running time;
How it can distinguish the different positions of players in visualization.
Our data consists of detailed attributes for every player registered in the latest edition of the FIFA 19 database. It is acquired from the Kaggle website (K.Gadiya 2019) and it is prepared for analysis in the previous project by me. (https://rpubs.com/nijat_g/data_prep)
data <- read.csv2("Prepared_data.csv", sep = ",")
str(data)
## 'data.frame': 18207 obs. of 58 variables:
## $ ID : int 158023 20801 190871 193080 192985 183277 177003 176580 155862 200389 ...
## $ Name : chr "L. Messi" "Cristiano Ronaldo" "Neymar Jr" "De Gea" ...
## $ Age : int 31 33 26 27 27 27 32 31 32 25 ...
## $ Nationality : chr "Argentina" "Portugal" "Brazil" "Spain" ...
## $ Overall : int 94 94 92 91 91 91 91 91 91 90 ...
## $ Potential : int 94 94 93 93 92 91 91 91 91 93 ...
## $ Club : chr "FC Barcelona" "Juventus" "Paris Saint-Germain" "Manchester United" ...
## $ Value.in.million.euros : chr "110.5" "77" "118.5" "72" ...
## $ Wage.in.thousand.euros : int 565 405 290 260 355 340 420 455 380 94 ...
## $ Special : int 2202 2228 2143 1471 2281 2142 2280 2346 2201 1331 ...
## $ Preferred.Foot : chr "Left" "Right" "Right" "Right" ...
## $ International.Reputation : int 5 5 5 4 4 4 4 5 4 3 ...
## $ Weak.Foot : int 4 4 5 3 5 4 4 4 3 3 ...
## $ Skill.Moves : int 4 5 5 1 4 4 4 3 3 1 ...
## $ Work.Rate : chr "Medium/ Medium" "High/ Low" "High/ Medium" "Medium/ Medium" ...
## $ Real.Face : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Position : chr "RF" "ST" "LW" "GK" ...
## $ Jersey.Number : int 10 7 10 1 7 10 10 9 15 1 ...
## $ Joined : chr "1-Jul-04" "10-Jul-18" "3-Aug-17" "1-Jul-11" ...
## $ Loaned.From : chr "" "" "" "" ...
## $ Contract.Valid.Until : chr "2021" "2022" "2022" "2020" ...
## $ Weight : int 159 183 150 168 154 163 146 190 181 192 ...
## $ Crossing : int 84 84 79 17 93 81 86 77 66 13 ...
## $ Finishing : int 95 94 87 13 82 84 72 93 60 11 ...
## $ HeadingAccuracy : int 70 89 62 21 55 61 55 77 91 15 ...
## $ ShortPassing : int 90 81 84 50 92 89 93 82 78 29 ...
## $ Volleys : int 86 87 84 13 82 80 76 88 66 13 ...
## $ Dribbling : int 97 88 96 18 86 95 90 87 63 12 ...
## $ Curve : int 93 81 88 21 85 83 85 86 74 13 ...
## $ FKAccuracy : int 94 76 87 19 83 79 78 84 72 14 ...
## $ LongPassing : int 87 77 78 51 91 83 88 64 77 26 ...
## $ BallControl : int 96 94 95 42 91 94 93 90 84 16 ...
## $ Acceleration : int 91 89 94 57 78 94 80 86 76 43 ...
## $ SprintSpeed : int 86 91 90 58 76 88 72 75 75 60 ...
## $ Agility : int 91 87 96 60 79 95 93 82 78 67 ...
## $ Reactions : int 95 96 94 90 91 90 90 92 85 86 ...
## $ Balance : int 95 70 84 43 77 94 94 83 66 49 ...
## $ ShotPower : int 85 95 80 31 91 82 79 86 79 22 ...
## $ Jumping : int 68 95 61 67 63 56 68 69 93 76 ...
## $ Stamina : int 72 88 81 43 90 83 89 90 84 41 ...
## $ Strength : int 59 79 49 64 75 66 58 83 83 78 ...
## $ LongShots : int 94 93 82 12 91 80 82 85 59 12 ...
## $ Aggression : int 48 63 56 38 76 54 62 87 88 34 ...
## $ Interceptions : int 22 29 36 30 61 41 83 41 90 19 ...
## $ Positioning : int 94 95 89 12 87 87 79 92 60 11 ...
## $ Vision : int 94 82 87 68 94 89 92 84 63 70 ...
## $ Penalties : int 75 85 81 40 79 86 82 85 75 11 ...
## $ Composure : int 96 95 94 68 88 91 84 85 82 70 ...
## $ Marking : int 33 28 27 15 68 34 60 62 87 27 ...
## $ StandingTackle : int 28 31 24 21 58 27 76 45 92 12 ...
## $ SlidingTackle : int 26 23 33 13 51 22 73 38 91 18 ...
## $ GKDiving : int 6 7 9 90 15 11 13 27 11 86 ...
## $ GKHandling : int 11 11 9 85 13 12 9 25 8 92 ...
## $ GKKicking : int 15 15 15 87 5 6 7 31 9 78 ...
## $ GKPositioning : int 14 14 15 88 10 8 14 33 7 88 ...
## $ GKReflexes : int 8 11 11 94 13 8 9 37 11 89 ...
## $ Release.clause.in.million.euros: chr "226.5" "127.1" "228.1" "138.6" ...
## $ Height : chr "170.18" "187.96" "175.26" "193.04" ...
For this project, we will use only Position and player attributes columns. Moreover, I will remove rows that contain missing values from our dataset in order to be able to build models in further steps.
data <- data[,c(17, 23:56)]
data <- data[complete.cases(data),]
dim(data)
## [1] 18159 35
summary(data)
## Position Crossing Finishing HeadingAccuracy
## Length:18159 Min. : 5.00 Min. : 2.00 Min. : 4.0
## Class :character 1st Qu.:38.00 1st Qu.:30.00 1st Qu.:44.0
## Mode :character Median :54.00 Median :49.00 Median :56.0
## Mean :49.73 Mean :45.55 Mean :52.3
## 3rd Qu.:64.00 3rd Qu.:62.00 3rd Qu.:64.0
## Max. :93.00 Max. :95.00 Max. :94.0
## ShortPassing Volleys Dribbling Curve
## Min. : 7.00 Min. : 4.00 Min. : 4.00 Min. : 6.00
## 1st Qu.:54.00 1st Qu.:30.00 1st Qu.:49.00 1st Qu.:34.00
## Median :62.00 Median :44.00 Median :61.00 Median :48.00
## Mean :58.69 Mean :42.91 Mean :55.37 Mean :47.17
## 3rd Qu.:68.00 3rd Qu.:57.00 3rd Qu.:68.00 3rd Qu.:62.00
## Max. :93.00 Max. :90.00 Max. :97.00 Max. :94.00
## FKAccuracy LongPassing BallControl Acceleration
## Min. : 3.00 Min. : 9.00 Min. : 5.00 Min. :12.00
## 1st Qu.:31.00 1st Qu.:43.00 1st Qu.:54.00 1st Qu.:57.00
## Median :41.00 Median :56.00 Median :63.00 Median :67.00
## Mean :42.86 Mean :52.71 Mean :58.37 Mean :64.61
## 3rd Qu.:57.00 3rd Qu.:64.00 3rd Qu.:69.00 3rd Qu.:75.00
## Max. :94.00 Max. :93.00 Max. :96.00 Max. :97.00
## SprintSpeed Agility Reactions Balance ShotPower
## Min. :12.00 Min. :14.0 Min. :21.00 Min. :16.00 Min. : 2.00
## 1st Qu.:57.00 1st Qu.:55.0 1st Qu.:56.00 1st Qu.:56.00 1st Qu.:45.00
## Median :67.00 Median :66.0 Median :62.00 Median :66.00 Median :59.00
## Mean :64.73 Mean :63.5 Mean :61.84 Mean :63.97 Mean :55.46
## 3rd Qu.:75.00 3rd Qu.:74.0 3rd Qu.:68.00 3rd Qu.:74.00 3rd Qu.:68.00
## Max. :96.00 Max. :96.0 Max. :96.00 Max. :96.00 Max. :95.00
## Jumping Stamina Strength LongShots
## Min. :15.00 Min. :12.00 Min. :17.00 Min. : 3.00
## 1st Qu.:58.00 1st Qu.:56.00 1st Qu.:58.00 1st Qu.:33.00
## Median :66.00 Median :66.00 Median :67.00 Median :51.00
## Mean :65.09 Mean :63.22 Mean :65.31 Mean :47.11
## 3rd Qu.:73.00 3rd Qu.:74.00 3rd Qu.:74.00 3rd Qu.:62.00
## Max. :95.00 Max. :96.00 Max. :97.00 Max. :94.00
## Aggression Interceptions Positioning Vision Penalties
## Min. :11.00 Min. : 3.0 Min. : 2.00 Min. :10.0 Min. : 5.00
## 1st Qu.:44.00 1st Qu.:26.0 1st Qu.:38.00 1st Qu.:44.0 1st Qu.:39.00
## Median :59.00 Median :52.0 Median :55.00 Median :55.0 Median :49.00
## Mean :55.87 Mean :46.7 Mean :49.96 Mean :53.4 Mean :48.55
## 3rd Qu.:69.00 3rd Qu.:64.0 3rd Qu.:64.00 3rd Qu.:64.0 3rd Qu.:60.00
## Max. :95.00 Max. :92.0 Max. :95.00 Max. :94.0 Max. :92.00
## Composure Marking StandingTackle SlidingTackle GKDiving
## Min. : 3.00 Min. : 3.00 Min. : 2.0 Min. : 3.00 Min. : 1.00
## 1st Qu.:51.00 1st Qu.:30.00 1st Qu.:27.0 1st Qu.:24.00 1st Qu.: 8.00
## Median :60.00 Median :53.00 Median :55.0 Median :52.00 Median :11.00
## Mean :58.65 Mean :47.28 Mean :47.7 Mean :45.66 Mean :16.62
## 3rd Qu.:67.00 3rd Qu.:64.00 3rd Qu.:66.0 3rd Qu.:64.00 3rd Qu.:14.00
## Max. :96.00 Max. :94.00 Max. :93.0 Max. :91.00 Max. :90.00
## GKHandling GKKicking GKPositioning GKReflexes
## Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.: 8.00 1st Qu.: 8.00 1st Qu.: 8.00 1st Qu.: 8.00
## Median :11.00 Median :11.00 Median :11.00 Median :11.00
## Mean :16.39 Mean :16.23 Mean :16.39 Mean :16.71
## 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00
## Max. :92.00 Max. :91.00 Max. :90.00 Max. :94.00
There are 34 attributes of each player that are measured in 0 and 100 scale. In the below image, you can see the meaning of positions.
pl_cor <- cor(data[-1], method="pearson")
corrplot(pl_cor, type = "lower", order ="alphabet", tl.cex=0.6)
From the correlation matrix, we can observe that there are high correlations between some variables. For example, if a player has high boll control ability, then he should have a high dribbling ability. Consequently, if a player has high goalkeeper abilities, then it is expected that he has not trained for dribbling and that’s why he has lower dribbling ability.
First, I will build a PCA model. However, before building model I want to change position of players to their role for simplicity of analysis and comparison:
ST - Striker
MF - Midfielder
DF - Defender
GK - Goalkeeper
Role <- data.frame(Position = c("LS","ST","RS",
"LW","LF","CF","RF","RW",
"LAM","RAM","CAM",
"LM","LCM","CM","RCM","RM",
"LWB","LDM","CDM","RDM","RWB",
"LB","LCB","CB","RCB","RB",
"GK"),
Role = c(rep("ST",8),rep("MF",13),rep("DF",5),"GK"))
data <- left_join(data,Role, by = "Position")
data <- data[,-1]
data$Role <- as.factor(data$Role)
Now we can apply PCA technique to the data for dimension reduction.
pl_pca <- prcomp(data[,-35])
summary(pl_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 74.4271 42.1554 23.47371 20.60433 15.86298 10.74088
## Proportion of Variance 0.5704 0.1830 0.05674 0.04372 0.02591 0.01188
## Cumulative Proportion 0.5704 0.7534 0.81018 0.85389 0.87981 0.89169
## PC7 PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 9.9989 8.90424 8.7571 8.37101 8.3051 7.84047 7.11906
## Proportion of Variance 0.0103 0.00816 0.0079 0.00722 0.0071 0.00633 0.00522
## Cumulative Proportion 0.9020 0.91015 0.9180 0.92526 0.9324 0.93869 0.94391
## PC14 PC15 PC16 PC17 PC18 PC19 PC20
## Standard deviation 6.90321 6.81307 6.55929 6.42496 6.12829 6.0756 5.89232
## Proportion of Variance 0.00491 0.00478 0.00443 0.00425 0.00387 0.0038 0.00358
## Cumulative Proportion 0.94882 0.95360 0.95803 0.96228 0.96615 0.9699 0.97353
## PC21 PC22 PC23 PC24 PC25 PC26 PC27
## Standard deviation 5.83809 5.56297 5.41587 5.29599 4.96105 4.6206 3.92163
## Proportion of Variance 0.00351 0.00319 0.00302 0.00289 0.00253 0.0022 0.00158
## Cumulative Proportion 0.97704 0.98022 0.98324 0.98613 0.98867 0.9909 0.99245
## PC28 PC29 PC30 PC31 PC32 PC33 PC34
## Standard deviation 3.91130 3.2681 3.23119 3.22023 3.06146 2.9554 2.90530
## Proportion of Variance 0.00158 0.0011 0.00108 0.00107 0.00097 0.0009 0.00087
## Cumulative Proportion 0.99402 0.9951 0.99620 0.99727 0.99823 0.9991 1.00000
biplot(pl_pca)
From the biplot, we can observe that attributes of specific positions have been separated. Goalkeeper attributes are collected on the right, defender attributes are collected on the upper left corner. However, midfielder and striker attributes are not separated well.
The next step is to find the number of the components. For this we will use two methods:
Proportion of variance explained
Scree plot of eigenvalues
In both cases, we will use the elbow method.
data_var <- pl_pca$sdev^2
pve <- data_var/sum(data_var)
plot(pve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
plot(cumsum(pve), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
According to the above charts, the optimal number of components are 3. Now let’s look at the eigenvalues.
data.cov<-cov(data[-35])
data.eigen<-eigen(data.cov)
data.eigen$values
## [1] 5539.397902 1777.080007 551.014923 424.538279 251.634135 115.366501
## [7] 99.978610 79.285417 76.687440 70.073848 68.974331 61.472938
## [13] 50.681062 47.654322 46.417951 43.024341 41.280080 37.555985
## [19] 36.912503 34.719391 34.083292 30.946630 29.331613 28.047540
## [25] 24.611998 21.349737 15.379158 15.298232 10.680724 10.440612
## [31] 10.369864 9.372549 8.734375 8.440744
In general, we need to take components that eigenvalues of these components are higher than 1. However, as you can see all of our components’ eigenvalues are greater than one. It is probably because of the high dimension of data and great variability of variables. That’s why I will use the elbow method here again.
fviz_eig(pl_pca, choice='eigenvalue')
According to this chart, also the number of optimal components are 3.
After determining optimal number of clusters, I will visualize our result and we will se how our model can separate different roles of players based on attributes.
colors = rainbow(length(unique(data$Role)))
names(colors) = unique(data$Role)
plot(pl_pca$x[,1:2], t='n', main="PCA")
text(pl_pca$x[,1:2], labels=data$Role, col=colors[data$Role])
If we visualize the first two components of the model, we can see that the model perfectly separate goalkeepers from other roles. Even though other roles are grouped together, we cannot see clear separation among them.
plot(pl_pca$x[,2:3], t='n', main="PCA")
text(pl_pca$x[,2:3], labels=data$Role, col=colors[data$Role])
If we visualize second and third components, this did even worse for our data. For finding main reason behind this, let’s look at the variable contribution to different components.
PC1 <- fviz_contrib(pl_pca, choice = "var", axes = 1)
PC2 <- fviz_contrib(pl_pca, choice = "var", axes = 2)
PC3 <- fviz_contrib(pl_pca, choice = "var", axes = 3)
grid.arrange(PC1, PC2, PC3)
The variables (attributes) that contribute most to the first component are the attributes of attackers. For the second one, the most contributions come from the attributes of defenders. That’s why the visualization of the first two components can clearly differentiate attackers and defenders as they are two “opposite” roles. However, as midfielders have both defender and attacker attributes, our model was not able to differentiate them.
The contribution for the third variable belongs mostly to the goalkeeper attributes which also have defender attributes. This is the reason behind our model’s worse performance in the second and third components.
For building t-SNE, I will take dimension equal to three because of the PCA results, limit the maximum iterations to 500 and close the “check_duplicates” parameter of the function as our data consists of integers and can be same for several players.
tsne <- Rtsne(data[,-35], dims = 3, perplexity=30, verbose=TRUE, max_iter = 500, check_duplicates = FALSE)
## Performing PCA
## Read the 18159 x 34 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 3, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## - point 10000 of 18159
## Done in 13.97 seconds (sparsity = 0.007553)!
## Learning embedding...
## Iteration 50: error is 102.983237 (50 iterations in 9.22 seconds)
## Iteration 100: error is 99.641786 (50 iterations in 11.33 seconds)
## Iteration 150: error is 89.695916 (50 iterations in 8.92 seconds)
## Iteration 200: error is 89.149931 (50 iterations in 8.85 seconds)
## Iteration 250: error is 89.011659 (50 iterations in 9.38 seconds)
## Iteration 300: error is 3.603248 (50 iterations in 7.92 seconds)
## Iteration 350: error is 3.290715 (50 iterations in 7.60 seconds)
## Iteration 400: error is 3.124611 (50 iterations in 7.67 seconds)
## Iteration 450: error is 3.011999 (50 iterations in 7.53 seconds)
## Iteration 500: error is 2.929062 (50 iterations in 7.68 seconds)
## Fitting performed in 86.11 seconds.
plot(tsne$Y[,1:2], t='n', main="t-SNE")
text(tsne$Y[,1:2], labels=data$Role, col=colors[data$Role])
plot(tsne$Y[,2:3], t='n', main="t-SNE")
text(tsne$Y[,2:3], labels=data$Role, col=colors[data$Role])
As you can see from the previous graphs, t-SNE performs worse than PCA results. It mostly because of the “duplication” problem of our data.
Before conclude our result, I want to compare the running time of two models.
exeTimePCA<- system.time(prcomp(data[,-35]))
exeTimetSNE<- system.time(Rtsne(data[,-35], dims = 3, perplexity=30, verbose=TRUE, max_iter = 500, check_duplicates = FALSE))
## Performing PCA
## Read the 18159 x 34 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 3, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## - point 10000 of 18159
## Done in 13.70 seconds (sparsity = 0.007553)!
## Learning embedding...
## Iteration 50: error is 102.983218 (50 iterations in 10.75 seconds)
## Iteration 100: error is 100.936801 (50 iterations in 47.30 seconds)
## Iteration 150: error is 89.990146 (50 iterations in 16.46 seconds)
## Iteration 200: error is 89.290048 (50 iterations in 13.63 seconds)
## Iteration 250: error is 89.109278 (50 iterations in 21.50 seconds)
## Iteration 300: error is 3.586800 (50 iterations in 12.16 seconds)
## Iteration 350: error is 3.276531 (50 iterations in 8.77 seconds)
## Iteration 400: error is 3.112004 (50 iterations in 8.15 seconds)
## Iteration 450: error is 3.000232 (50 iterations in 8.21 seconds)
## Iteration 500: error is 2.916969 (50 iterations in 8.06 seconds)
## Fitting performed in 155.01 seconds.
exeTimePCA;exeTimetSNE
## user system elapsed
## 0.12 0.01 0.14
## user system elapsed
## 172.84 0.44 173.86
t-SNE takes much longer time to perform than PCA.
I have built PCA and t-SNE models on the given dataset for dimension reduction. In conclusion, I can say that even though there was not a perfect result, PCA is better than t-SNE in this case as it takes less time to perform and gives more accurate results.