Indroduction

The aim of this article is to use dimension reduction techniques as a pre-step method for clustering analysis. The purpose is to extract intrinsic dimension from footballers data set. Initially the data set has more than 50 variables but most of them seem to be correlated. Therefore we will see how many dimensions do we really need to explain greater share of the variance.

Dataset

The data set is powered by kaggle: https://www.kaggle.com/datasets/maso0dahmed/football-players-data
This comprehensive dataset offers detailed information on approximately 17,000 FIFA football players, meticulously scraped from SoFIFA.com. It encompasses a wide array of player-specific data points, including but not limited to player names, nationalities, clubs, player ratings, potential, positions, ages, and various skill attributes. This dataset is ideal for football enthusiasts, data analysts, and researchers seeking to conduct in-depth analysis, statistical studies, or machine learning projects related to football players’ performance, characteristics, and career progressions.

The variables used in the data set are self explanatory, so I will just put them without explaining.
List of variables: name, full_name, birth_date, age, height_cm, weight_kgs, positions, nationality, overall_rating, potential, value_euro, wage_euro, preferred_foot, international_reputation(1-5), weak_foot(1-5), skill_moves(1-5), body_type, release_clause_euro, national_team, national_rating, national_team_position, national_jersey_number, crossing, finishing, heading_accuracy, short_passing, volleys, dribbling, curve, freekick_accuracy, long_passing, ball_control, acceleration, sprint_speed, agility, reactions, balance, shot_power, jumping, stamina, strength, long_shots, aggression, interceptions, positioning, vision, penalties, composure, marking, standing_tackle and sliding_tackle

Methodology

PCA is a linear dimension reduction technique. Performs a linear mapping of the data to a lower-dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized. The idea of this method is behind the covariance matrix. The eigenvectors that correspond to the largest eigenvalues (the principal components) are used to reconstruct a large fraction of the variance of the original data.

t-SNE is a statistical method efficient in visualizing high-dimenisional data. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. The interpretation of this technique is simple - if the points are close to each other in the transformed space means that they are similar in the initial manifold.

Literature overview

Football is the most commonly followed sport by people. For its development, different tools are used to describe the statistical events that occur during high-level sports competitions.They allow to understand and identify the most important features that are related to the success of a failure of observed team.

In the article “In-game behaviour analysis of football players using machine learning techniques based on player statistics” there were machine learning algorithms used to determine the on-field playing positions of a group of football players based on their technical-tactical behaviour. To visualize connections between the points in two dimensional projection, t-SNE unsupervised learning method were used. This dimension reduction technique allowed to see some grouping (4 groups). From the t-SNE plot it was observed, for example, that the central defenders are the farthest from the strikers, the wingers are the closest to the strikers, or the full-backs are placed among central defenders, midfielders and wingers.

There are important researches on techniques that can be brought to `athletic training and be also useful in any organization that is dealing with training multitudes of athletes. In general, with increasing wearable technology usage there are more parameters that can be used in K-means clustering. It might be a tooAl to create informed training groups that will best maximize the strength coaches time and the athlete’s training. (Reuben et al., 2020). Wearable technology will give huge opportunities to the coaches and the staff. In utopiian scenario for football managers and their analysts, the data taken from players will be available live which means that the coaches will no longer have to rely on gut instinct alone. It will allow to to maximize team utility, and take coach role to a higher level. (Creasey, 2015)

# Set the seed to make the code repeatable
set.seed(123)
# Libraries
library(ggplot2)
library(corrplot)

## Warning: pakiet 'corrplot' został zbudowany w wersji R 4.3.2

## corrplot 0.92 loaded

library(stats)
library(factoextra)

## Warning: pakiet 'factoextra' został zbudowany w wersji R 4.3.2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(FactoMineR)

## Warning: pakiet 'FactoMineR' został zbudowany w wersji R 4.3.2

library(gridExtra)

## Warning: pakiet 'gridExtra' został zbudowany w wersji R 4.3.2

library(psych)

## 
## Dołączanie pakietu: 'psych'

## Następujące obiekty zostały zakryte z 'package:ggplot2':
## 
##     %+%, alpha

library(Rtsne)

## Warning: pakiet 'Rtsne' został zbudowany w wersji R 4.3.2

# Read a data set
players <- read.csv("C:/Users/User/Desktop/Dimension Reduction/fifa_players.csv")
str(players)

## 'data.frame':    17954 obs. of  51 variables:
##  $ name                         : chr  "L. Messi" "C. Eriksen" "P. Pogba" "L. Insigne" ...
##  $ full_name                    : chr  "Lionel Andrés Messi Cuccittini" "Christian  Dannemann Eriksen" "Paul Pogba" "Lorenzo Insigne" ...
##  $ birth_date                   : chr  "6/24/1987" "2/14/1992" "3/15/1993" "6/4/1991" ...
##  $ age                          : int  31 27 25 27 27 27 20 30 32 32 ...
##  $ height_cm                    : num  170 155 190 163 188 ...
##  $ weight_kgs                   : num  72.1 76.2 83.9 59 88.9 92.1 73 69.9 92.1 77.1 ...
##  $ positions                    : chr  "CF,RW,ST" "CAM,RM,CM" "CM,CAM" "LW,ST" ...
##  $ nationality                  : chr  "Argentina" "Denmark" "France" "Italy" ...
##  $ overall_rating               : int  94 88 88 88 88 88 88 89 89 89 ...
##  $ potential                    : int  94 89 91 88 91 90 95 89 89 89 ...
##  $ value_euro                   : int  110500000 69500000 73000000 62000000 60000000 59500000 81000000 64500000 38000000 60000000 ...
##  $ wage_euro                    : int  565000 205000 255000 165000 135000 215000 100000 300000 130000 200000 ...
##  $ preferred_foot               : chr  "Left" "Right" "Right" "Right" ...
##  $ international_reputation.1.5.: int  5 3 4 3 3 3 3 4 5 4 ...
##  $ weak_foot.1.5.               : int  4 5 4 4 3 3 4 4 4 4 ...
##  $ skill_moves.1.5.             : int  4 4 5 4 2 2 5 4 1 3 ...
##  $ body_type                    : chr  "Messi" "Lean" "Normal" "Normal" ...
##  $ release_clause_euro          : int  226500000 133800000 144200000 105400000 106500000 114500000 166100000 119300000 62700000 111000000 ...
##  $ national_team                : chr  "Argentina" "Denmark" "France" "Italy" ...
##  $ national_rating              : int  82 78 84 83 NA 81 84 82 85 81 ...
##  $ national_team_position       : chr  "RF" "CAM" "RDM" "LW" ...
##  $ national_jersey_number       : int  10 10 6 10 NA 4 10 11 1 21 ...
##  $ crossing                     : int  86 88 80 86 30 53 77 70 15 70 ...
##  $ finishing                    : int  95 81 75 77 22 52 88 93 13 89 ...
##  $ heading_accuracy             : int  70 52 75 56 83 83 77 77 25 89 ...
##  $ short_passing                : int  92 91 86 85 68 79 82 81 55 78 ...
##  $ volleys                      : int  86 80 85 74 14 45 78 85 11 90 ...
##  $ dribbling                    : int  97 84 87 90 69 70 90 89 30 80 ...
##  $ curve                        : int  93 86 85 87 28 60 77 82 14 77 ...
##  $ freekick_accuracy            : int  94 87 82 77 28 70 63 73 11 76 ...
##  $ long_passing                 : int  89 89 90 78 60 81 73 64 59 52 ...
##  $ ball_control                 : int  96 91 90 93 63 76 91 89 46 82 ...
##  $ acceleration                 : int  91 76 71 94 70 74 96 88 54 75 ...
##  $ sprint_speed                 : int  86 73 79 86 75 77 96 80 60 76 ...
##  $ agility                      : int  93 80 76 94 50 61 92 86 51 77 ...
##  $ reactions                    : int  95 88 82 83 82 87 87 90 84 91 ...
##  $ balance                      : int  95 81 66 93 40 49 83 91 35 59 ...
##  $ shot_power                   : int  85 84 90 75 55 81 79 88 25 87 ...
##  $ jumping                      : int  68 50 83 53 81 88 75 81 77 88 ...
##  $ stamina                      : int  72 92 88 75 75 75 83 76 43 92 ...
##  $ strength                     : int  66 58 87 44 94 92 71 73 80 78 ...
##  $ long_shots                   : int  94 89 82 84 15 64 78 83 16 79 ...
##  $ aggression                   : int  48 46 78 34 87 82 62 65 29 84 ...
##  $ interceptions                : int  22 56 64 26 88 88 38 24 30 48 ...
##  $ positioning                  : int  94 84 82 83 24 41 88 92 12 93 ...
##  $ vision                       : int  94 91 88 87 49 60 82 83 70 77 ...
##  $ penalties                    : int  75 67 82 61 33 62 70 83 47 85 ...
##  $ composure                    : int  96 88 87 83 80 87 86 90 70 82 ...
##  $ marking                      : int  33 59 63 51 91 90 34 30 17 52 ...
##  $ standing_tackle              : int  28 57 67 24 88 89 34 20 10 45 ...
##  $ sliding_tackle               : int  26 22 67 22 87 84 32 12 11 39 ...

EDA

Firstly, we will check if there are any missing values.

# Number of observations in the data set
dim(players)[1]

## [1] 17954

# Number of missing values via column
sort(colSums(is.na(players)), decreasing = T)

##               national_rating        national_jersey_number 
##                         17097                         17097 
##           release_clause_euro                    value_euro 
##                          1837                           255 
##                     wage_euro                          name 
##                           246                             0 
##                     full_name                    birth_date 
##                             0                             0 
##                           age                     height_cm 
##                             0                             0 
##                    weight_kgs                     positions 
##                             0                             0 
##                   nationality                overall_rating 
##                             0                             0 
##                     potential                preferred_foot 
##                             0                             0 
## international_reputation.1.5.                weak_foot.1.5. 
##                             0                             0 
##              skill_moves.1.5.                     body_type 
##                             0                             0 
##                 national_team        national_team_position 
##                             0                             0 
##                      crossing                     finishing 
##                             0                             0 
##              heading_accuracy                 short_passing 
##                             0                             0 
##                       volleys                     dribbling 
##                             0                             0 
##                         curve             freekick_accuracy 
##                             0                             0 
##                  long_passing                  ball_control 
##                             0                             0 
##                  acceleration                  sprint_speed 
##                             0                             0 
##                       agility                     reactions 
##                             0                             0 
##                       balance                    shot_power 
##                             0                             0 
##                       jumping                       stamina 
##                             0                             0 
##                      strength                    long_shots 
##                             0                             0 
##                    aggression                 interceptions 
##                             0                             0 
##                   positioning                        vision 
##                             0                             0 
##                     penalties                     composure 
##                             0                             0 
##                       marking               standing_tackle 
##                             0                             0 
##                sliding_tackle 
##                             0

As we can see there are missing values in few columns. Columns national_rating and national_jersey_number contain 17097 NA’s out of 17954 records. Therefore it was decided to drop them both. The rest of columns with missing values are release_clause_euro, value_euro, wage_euro. In this case we observe much less missing values so we will input them with the mean.

# Data set without columns national_rating and national_jersey_number
players <- subset(players, select = -c(national_rating, national_jersey_number))
# Now there are 3 columns with missing values left - release_clause_euro, value_euro, wage_euro
columns_to_input_NA = c("release_clause_euro", "value_euro", "wage_euro")
# Loop to input missing values with the mean
for (column in columns_to_input_NA){
  players[,column] <- ifelse(is.na(players[,column]), mean(players[,column], na.rm = T), 
                           players[,column])
}

Now we will visualize some variables to get familiar with the data set.
From the distribution of variable age we can see that the distribution is right skewed (there are more extreme values in the right tail). We can also observe that greatest share of players in our data set are 21-27 years old.

# Distribution of the variable age
ggplot(data = players, aes(x = age)) +
  geom_histogram(bins = 30, colour = "black", fill = "lightblue")

When we move onto variable height_cm the observe from the distribution that greatest share of football players have 181-186cm or 151-156 height.

# Distribution of the variable height_cm
ggplot(data = players, aes(x = height_cm)) +
  geom_histogram(bins = 10, colour = "black", fill = "lightblue")

# How many players have preffered left foot comparing to right?
ggplot(data = players, aes(x = preferred_foot)) +
  geom_bar(fill = c("grey", "lightblue"))

In this step we will prepare our data set for dimension reduction. To do so, we have to extract numeric columns from the data set and standardize the data. Before processing the PCA - it is important to standardize the data (to give every feature “a chance” data needs to be scaled so that features have equal variance).

# Preparing data to dimension reduction
numeric_columns = c()
for(colname in colnames(players)){
  if (!is.character(players[, colname])){
    numeric_columns <- append(numeric_columns, colname)
  }
}
# Extracting numeric variables from the data set
players_numeric <- players[, numeric_columns]
# Statistical glimpse at the numeric data
summary(players_numeric)

##       age          height_cm       weight_kgs    overall_rating 
##  Min.   :17.00   Min.   :152.4   Min.   : 49.9   Min.   :47.00  
##  1st Qu.:22.00   1st Qu.:154.9   1st Qu.: 69.9   1st Qu.:62.00  
##  Median :25.00   Median :175.3   Median : 74.8   Median :66.00  
##  Mean   :25.57   Mean   :174.9   Mean   : 75.3   Mean   :66.24  
##  3rd Qu.:29.00   3rd Qu.:185.4   3rd Qu.: 79.8   3rd Qu.:71.00  
##  Max.   :46.00   Max.   :205.7   Max.   :110.2   Max.   :94.00  
##    potential       value_euro          wage_euro     
##  Min.   :48.00   Min.   :    10000   Min.   :  1000  
##  1st Qu.:67.00   1st Qu.:   325000   1st Qu.:  1000  
##  Median :71.00   Median :   725000   Median :  3000  
##  Mean   :71.43   Mean   :  2479280   Mean   :  9902  
##  3rd Qu.:75.00   3rd Qu.:  2300000   3rd Qu.:  9902  
##  Max.   :95.00   Max.   :110500000   Max.   :565000  
##  international_reputation.1.5. weak_foot.1.5.  skill_moves.1.5.
##  Min.   :1.000                 Min.   :1.000   Min.   :1.000   
##  1st Qu.:1.000                 1st Qu.:3.000   1st Qu.:2.000   
##  Median :1.000                 Median :3.000   Median :2.000   
##  Mean   :1.112                 Mean   :2.946   Mean   :2.361   
##  3rd Qu.:1.000                 3rd Qu.:3.000   3rd Qu.:3.000   
##  Max.   :5.000                 Max.   :5.000   Max.   :5.000   
##  release_clause_euro    crossing      finishing     heading_accuracy
##  Min.   :    13000   Min.   : 5.0   Min.   : 2.00   Min.   : 4.00   
##  1st Qu.:   581000   1st Qu.:38.0   1st Qu.:30.00   1st Qu.:44.00   
##  Median :  1400000   Median :54.0   Median :49.00   Median :56.00   
##  Mean   :  4622522   Mean   :49.7   Mean   :45.36   Mean   :52.15   
##  3rd Qu.:  4622522   3rd Qu.:64.0   3rd Qu.:62.00   3rd Qu.:64.00   
##  Max.   :226500000   Max.   :93.0   Max.   :95.00   Max.   :94.00   
##  short_passing      volleys        dribbling         curve     
##  Min.   : 7.00   Min.   : 3.00   Min.   : 4.00   Min.   : 6.0  
##  1st Qu.:53.00   1st Qu.:30.00   1st Qu.:49.00   1st Qu.:34.0  
##  Median :62.00   Median :44.00   Median :61.00   Median :49.0  
##  Mean   :58.57   Mean   :42.76   Mean   :55.28   Mean   :47.1  
##  3rd Qu.:68.00   3rd Qu.:57.00   3rd Qu.:68.00   3rd Qu.:62.0  
##  Max.   :93.00   Max.   :90.00   Max.   :97.00   Max.   :94.0  
##  freekick_accuracy  long_passing    ball_control    acceleration 
##  Min.   : 3.00     Min.   : 9.00   Min.   : 5.00   Min.   :12.0  
##  1st Qu.:30.00     1st Qu.:43.00   1st Qu.:54.00   1st Qu.:57.0  
##  Median :41.00     Median :56.00   Median :63.00   Median :67.0  
##  Mean   :42.69     Mean   :52.67   Mean   :58.22   Mean   :64.7  
##  3rd Qu.:56.00     3rd Qu.:64.00   3rd Qu.:69.00   3rd Qu.:75.0  
##  Max.   :94.00     Max.   :93.00   Max.   :96.00   Max.   :97.0  
##   sprint_speed     agility        reactions        balance        shot_power   
##  Min.   :12.0   Min.   :11.00   Min.   :24.00   Min.   :16.00   Min.   : 2.00  
##  1st Qu.:58.0   1st Qu.:55.00   1st Qu.:56.00   1st Qu.:56.00   1st Qu.:45.00  
##  Median :67.0   Median :66.00   Median :62.00   Median :66.00   Median :59.00  
##  Mean   :64.8   Mean   :63.38   Mean   :61.82   Mean   :63.87   Mean   :55.32  
##  3rd Qu.:75.0   3rd Qu.:74.00   3rd Qu.:68.00   3rd Qu.:74.00   3rd Qu.:68.00  
##  Max.   :96.0   Max.   :96.00   Max.   :96.00   Max.   :96.00   Max.   :95.00  
##     jumping         stamina         strength       long_shots   
##  Min.   :15.00   Min.   :12.00   Min.   :20.00   Min.   : 3.00  
##  1st Qu.:58.00   1st Qu.:56.00   1st Qu.:58.00   1st Qu.:32.00  
##  Median :66.00   Median :66.00   Median :66.00   Median :51.00  
##  Mean   :64.96   Mean   :63.13   Mean   :65.16   Mean   :46.85  
##  3rd Qu.:73.00   3rd Qu.:74.00   3rd Qu.:74.00   3rd Qu.:62.00  
##  Max.   :95.00   Max.   :97.00   Max.   :97.00   Max.   :94.00  
##    aggression    interceptions    positioning        vision     
##  Min.   :11.00   Min.   : 3.00   Min.   : 2.00   Min.   :10.00  
##  1st Qu.:44.00   1st Qu.:26.00   1st Qu.:38.00   1st Qu.:44.00  
##  Median :59.00   Median :52.00   Median :55.00   Median :55.00  
##  Mean   :55.82   Mean   :46.66   Mean   :49.86   Mean   :53.41  
##  3rd Qu.:69.00   3rd Qu.:64.00   3rd Qu.:64.00   3rd Qu.:64.00  
##  Max.   :95.00   Max.   :92.00   Max.   :95.00   Max.   :94.00  
##    penalties       composure        marking      standing_tackle
##  Min.   : 5.00   Min.   :12.00   Min.   : 3.00   Min.   : 2.00  
##  1st Qu.:38.00   1st Qu.:51.00   1st Qu.:30.00   1st Qu.:27.00  
##  Median :49.00   Median :60.00   Median :52.50   Median :55.00  
##  Mean   :48.36   Mean   :58.68   Mean   :47.16   Mean   :47.73  
##  3rd Qu.:60.00   3rd Qu.:67.00   3rd Qu.:64.00   3rd Qu.:66.00  
##  Max.   :92.00   Max.   :96.00   Max.   :94.00   Max.   :93.00  
##  sliding_tackle 
##  Min.   : 3.00  
##  1st Qu.:24.00  
##  Median :52.00  
##  Mean   :45.71  
##  3rd Qu.:64.00  
##  Max.   :90.00

# Standardization of the data
players_normalized <- as.data.frame(lapply(players_numeric, scale))

As a pre step before PCA, we check correlation of the variables. From the correlation plot we can see that most of the variables are highly linearly correlated (pearson correlation). In terms of dimension reduction, it presage that PCA will perform well on this data set. Data processed by dimension reduction technique will extract most compact representation of the samples - intrinsic dimension.

# Correlation plot
corrplot(cor(players_normalized, method = "pearson"), order ="alphabet", tl.cex=0.6, diag = F, method = "square", type = "upper")

We calculate covariance matrix and then its eigenvalues. The purpose of this step is to make some judgements before performing principal component analysis which base on the covariance matrix.

# Calculate covariance matrix
players_normalized.cov <- cov(players_normalized)
# Calculate eigenvalues
players_normalized.eig <- eigen(players_normalized.cov)
# Eigenvalues
players_normalized.eig$values

##  [1] 17.79950825  5.22103481  4.03978147  2.25898981  1.42468465  1.24156248
##  [7]  0.93528607  0.82791238  0.66260601  0.57864851  0.45357312  0.43130198
## [13]  0.34899667  0.30009003  0.26842081  0.26023164  0.23818763  0.23177669
## [19]  0.21832030  0.20085740  0.19385009  0.19064559  0.18161265  0.16790215
## [25]  0.16509808  0.14015368  0.13076938  0.12016480  0.11302460  0.10170080
## [31]  0.08664520  0.07830514  0.06905778  0.06469179  0.06307170  0.05817814
## [37]  0.05130063  0.03564935  0.02398953  0.02241819

Dimension reduction

Dimension reduction technique which will be used to gather intrinsic dimension from our data set is Principal Component Analysis (PCA).

PCA

How many principal components to choose as a representation?
We have to specify appropriate number based on cumulative variance explained and additional plots.

# Perform PCA
PCA <- prcomp(players_normalized, center = F, scale = F)
# Summary of principal components
summary(PCA)

## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     4.219 2.2850 2.0099 1.50299 1.19360 1.11425 0.96710
## Proportion of Variance 0.445 0.1305 0.1010 0.05647 0.03562 0.03104 0.02338
## Cumulative Proportion  0.445 0.5755 0.6765 0.73298 0.76860 0.79964 0.82302
##                           PC8     PC9    PC10    PC11    PC12    PC13   PC14
## Standard deviation     0.9099 0.81401 0.76069 0.67348 0.65674 0.59076 0.5478
## Proportion of Variance 0.0207 0.01657 0.01447 0.01134 0.01078 0.00872 0.0075
## Cumulative Proportion  0.8437 0.86028 0.87475 0.88609 0.89687 0.90560 0.9131
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.51809 0.51013 0.48804 0.48143 0.46725 0.44817 0.44028
## Proportion of Variance 0.00671 0.00651 0.00595 0.00579 0.00546 0.00502 0.00485
## Cumulative Proportion  0.91981 0.92632 0.93227 0.93806 0.94352 0.94854 0.95339
##                           PC22    PC23   PC24    PC25   PC26    PC27   PC28
## Standard deviation     0.43663 0.42616 0.4098 0.40632 0.3744 0.36162 0.3466
## Proportion of Variance 0.00477 0.00454 0.0042 0.00413 0.0035 0.00327 0.0030
## Cumulative Proportion  0.95816 0.96270 0.9669 0.97102 0.9745 0.97780 0.9808
##                           PC29    PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     0.33619 0.31891 0.29436 0.27983 0.26279 0.25435 0.25114
## Proportion of Variance 0.00283 0.00254 0.00217 0.00196 0.00173 0.00162 0.00158
## Cumulative Proportion  0.98362 0.98617 0.98833 0.99029 0.99202 0.99363 0.99521
##                           PC36    PC37    PC38   PC39    PC40
## Standard deviation     0.24120 0.22650 0.18881 0.1549 0.14973
## Proportion of Variance 0.00145 0.00128 0.00089 0.0006 0.00056
## Cumulative Proportion  0.99667 0.99795 0.99884 0.9994 1.00000

In terms of principal component analysis, we will consider principal component as significant when it explains more than 10% of the total variance. Based on the rule applied to the scree plot mentioned below, we choose 3 principal components as significant to explain whole data set. Cumulative variance explained via the principal components is about 68%. It is well dimension reduction as from 40 dimensions we obtain only 3 intrinsic dimensions.

# Plot of explained variance via principal components
fviz_eig(PCA, main = "Scree plot", barfill = "lightblue", barcolor = "lightblue")

get_eigenvalue(PCA)

##         eigenvalue variance.percent cumulative.variance.percent
## Dim.1  17.79950825      44.49877062                    44.49877
## Dim.2   5.22103481      13.05258703                    57.55136
## Dim.3   4.03978147      10.09945367                    67.65081
## Dim.4   2.25898981       5.64747451                    73.29829
## Dim.5   1.42468465       3.56171163                    76.86000
## Dim.6   1.24156248       3.10390621                    79.96390
## Dim.7   0.93528607       2.33821518                    82.30212
## Dim.8   0.82791238       2.06978096                    84.37190
## Dim.9   0.66260601       1.65651503                    86.02841
## Dim.10  0.57864851       1.44662127                    87.47504
## Dim.11  0.45357312       1.13393279                    88.60897
## Dim.12  0.43130198       1.07825496                    89.68722
## Dim.13  0.34899667       0.87249166                    90.55972
## Dim.14  0.30009003       0.75022509                    91.30994
## Dim.15  0.26842081       0.67105203                    91.98099
## Dim.16  0.26023164       0.65057909                    92.63157
## Dim.17  0.23818763       0.59546907                    93.22704
## Dim.18  0.23177669       0.57944171                    93.80648
## Dim.19  0.21832030       0.54580075                    94.35228
## Dim.20  0.20085740       0.50214351                    94.85443
## Dim.21  0.19385009       0.48462523                    95.33905
## Dim.22  0.19064559       0.47661398                    95.81567
## Dim.23  0.18161265       0.45403163                    96.26970
## Dim.24  0.16790215       0.41975538                    96.68945
## Dim.25  0.16509808       0.41274521                    97.10220
## Dim.26  0.14015368       0.35038420                    97.45258
## Dim.27  0.13076938       0.32692345                    97.77951
## Dim.28  0.12016480       0.30041201                    98.07992
## Dim.29  0.11302460       0.28256151                    98.36248
## Dim.30  0.10170080       0.25425201                    98.61673
## Dim.31  0.08664520       0.21661299                    98.83334
## Dim.32  0.07830514       0.19576284                    99.02911
## Dim.33  0.06905778       0.17264445                    99.20175
## Dim.34  0.06469179       0.16172948                    99.36348
## Dim.35  0.06307170       0.15767924                    99.52116
## Dim.36  0.05817814       0.14544536                    99.66661
## Dim.37  0.05130063       0.12825157                    99.79486
## Dim.38  0.03564935       0.08912338                    99.88398
## Dim.39  0.02398953       0.05997383                    99.94395
## Dim.40  0.02241819       0.05604546                   100.00000

The blue vertical line in our cumulative variance plot is a number of principal components used in further analysis.

# Cumulative variance plot
plot(summary(PCA)$importance[3,],type="l", ylab = "Variance explained (%)", 
     xlab = "Number of principal components", main = "Cumulative variance explained (%)")
abline(v = 3, col = "lightblue", lwd = 2, lty = 2)

The table contains variables ordered from most significant to least significant in terms of contribution to the first principal component explained variance. We will also mention bar plot to visualize which ones are the most important in terms of first, second and third principal components.

loading_scores_PC_1<-PCA$rotation[,1]
fac_scores_PC_1<-abs(loading_scores_PC_1)
fac_scores_PC_1_ranked<-names(sort(fac_scores_PC_1, decreasing=T))
PCA$rotation[fac_scores_PC_1_ranked, 1]

##                  ball_control                     dribbling 
##                   -0.22551237                   -0.21963050 
##                 short_passing                   positioning 
##                   -0.21761239                   -0.20858994 
##                         curve                      crossing 
##                   -0.20858367                   -0.20796421 
##                    long_shots                    shot_power 
##                   -0.20763032                   -0.20496157 
##              skill_moves.1.5.                       volleys 
##                   -0.19792516                   -0.19696285 
##                  long_passing             freekick_accuracy 
##                   -0.19654245                   -0.19498512 
##                     finishing                     penalties 
##                   -0.19006914                   -0.18837896 
##                        vision                       stamina 
##                   -0.18543801                   -0.18523694 
##                     composure                       agility 
##                   -0.18137241                   -0.17728369 
##                  acceleration                  sprint_speed 
##                   -0.16884020                   -0.16588564 
##              heading_accuracy                       balance 
##                   -0.15485224                   -0.15172009 
##                    aggression                overall_rating 
##                   -0.14374574                   -0.13642395 
##                     reactions                       marking 
##                   -0.13431661                   -0.11669634 
##                 interceptions               standing_tackle 
##                   -0.11165036                   -0.10839623 
##                sliding_tackle                    value_euro 
##                   -0.10055821                   -0.09971576 
##                     potential           release_clause_euro 
##                   -0.09624621                   -0.09446493 
##                weak_foot.1.5.                     wage_euro 
##                   -0.09248866                   -0.09184609 
##                     height_cm                    weight_kgs 
##                    0.08296352                    0.07939504 
## international_reputation.1.5.                       jumping 
##                   -0.07475227                   -0.06496160 
##                           age                      strength 
##                   -0.04085795                   -0.02902935

For the first principal component the 5 most important variables in explaining greatest share of variance are:
- ball_control
- dribbling
- short_passing
- positioning
- curve

fviz_contrib(PCA, "var", axes=1, xtickslab.rt=90, color = "lightblue", fill = "lightblue")

For the second principal component the 5 top variables are:
- strength
- interceptions
- standing_tackle
- sliding_tackle
- marking

fviz_contrib(PCA, "var", axes=2, xtickslab.rt=90, color = "lightblue", fill = "lightblue")

Whereas for the third principal component the list looks like below:
- value_euro
- release_clause_euro
- wage_euro
- international_reputation.1.5.
- sliding tackle

fviz_contrib(PCA, "var", axes=3, xtickslab.rt=90, color = "lightblue", fill = "lightblue")

Correlation biplot for the first two principal components can give us lots of information about the variables. Positively correlated variables are grouped together, whereas negatively correlated variables are positioned on opposite sides of the plot origin (opposed quadrants). The distance between variables and the origin measures the quality of the variables on the factor map. Variables that are away from the origin are well represented on the factor map.

fviz_pca_var(PCA, col.var="lightblue", repel = T)

Based on the analysis above, we decided to stay with 3 principal components as intrinsic.
Let’s check quality of the representation using cos2 (squared coordinates).

# PCA with 3 principal components
res.pca_3comp <- PCA(players_normalized, scale.unit = F, ncp = 3, graph = F)
var <- get_pca_var(res.pca_3comp)
# Visualize the cos2 of variables on all the dimensions 
corrplot(var$cos2, is.corr = F)

A high cos2 indicates a good representation of the variable on the principal component, whereas low cos2 indicates that the variable is not perfectly represented by the PCs. It is similar as with the correlation circle plot - the closer a variable is to the circle of correlations (the higher cos2), the better its representation on the factor map (and the more important it is to interpret these components).

# Color by cos2 values: quality on the factor map
fviz_pca_var(res.pca_3comp, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = T, alpha.var = "cos2")

Rotated PCA

rot_pca <- principal(players_normalized, nfactors=3, rotate="varimax")
rot_pca

## Principal Components Analysis
## Call: principal(r = players_normalized, nfactors = 3, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                                 RC1   RC2   RC3   h2    u2 com
## age                           -0.05  0.24  0.35 0.18 0.816 1.8
## height_cm                     -0.51  0.01  0.22 0.31 0.691 1.4
## weight_kgs                    -0.58  0.10  0.34 0.46 0.539 1.7
## overall_rating                 0.22  0.27  0.81 0.77 0.227 1.4
## potential                      0.21  0.06  0.59 0.40 0.600 1.3
## value_euro                     0.15  0.01  0.85 0.74 0.263 1.1
## wage_euro                      0.11  0.04  0.81 0.66 0.338 1.0
## international_reputation.1.5.  0.07  0.03  0.71 0.52 0.484 1.0
## weak_foot.1.5.                 0.39  0.00  0.17 0.18 0.819 1.4
## skill_moves.1.5.               0.82  0.14  0.23 0.75 0.251 1.2
## release_clause_euro            0.13  0.01  0.82 0.69 0.313 1.1
## crossing                       0.81  0.36  0.13 0.80 0.197 1.4
## finishing                      0.87 -0.05  0.24 0.82 0.184 1.2
## heading_accuracy               0.39  0.65  0.20 0.61 0.390 1.8
## short_passing                  0.74  0.50  0.26 0.87 0.133 2.1
## volleys                        0.84  0.05  0.29 0.79 0.213 1.2
## dribbling                      0.91  0.24  0.15 0.92 0.084 1.2
## curve                          0.85  0.20  0.23 0.81 0.188 1.3
## freekick_accuracy              0.77  0.21  0.23 0.70 0.303 1.3
## long_passing                   0.62  0.54  0.25 0.74 0.259 2.3
## ball_control                   0.85  0.38  0.23 0.92 0.083 1.5
## acceleration                   0.81  0.11 -0.05 0.67 0.333 1.0
## sprint_speed                   0.77  0.15 -0.03 0.61 0.390 1.1
## agility                        0.84  0.06  0.02 0.72 0.284 1.0
## reactions                      0.25  0.28  0.72 0.65 0.347 1.6
## balance                        0.78  0.05 -0.13 0.63 0.371 1.1
## shot_power                     0.78  0.25  0.29 0.76 0.243 1.5
## jumping                        0.09  0.38  0.14 0.17 0.828 1.4
## stamina                        0.61  0.60  0.09 0.73 0.271 2.0
## strength                      -0.25  0.52  0.36 0.47 0.531 2.3
## long_shots                     0.85  0.15  0.27 0.82 0.180 1.3
## aggression                     0.26  0.82  0.18 0.77 0.233 1.3
## interceptions                  0.09  0.93  0.07 0.88 0.116 1.0
## positioning                    0.89  0.13  0.20 0.85 0.145 1.1
## vision                         0.73  0.08  0.37 0.68 0.319 1.5
## penalties                      0.79  0.09  0.25 0.70 0.304 1.2
## composure                      0.50  0.37  0.57 0.71 0.293 2.7
## marking                        0.14  0.91  0.04 0.85 0.150 1.1
## standing_tackle                0.10  0.94  0.00 0.90 0.103 1.0
## sliding_tackle                 0.07  0.93 -0.03 0.88 0.123 1.0
## 
##                         RC1  RC2  RC3
## SS loadings           14.34 6.83 5.89
## Proportion Var         0.36 0.17 0.15
## Cumulative Var         0.36 0.53 0.68
## Proportion Explained   0.53 0.25 0.22
## Cumulative Proportion  0.53 0.78 1.00
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.06 
##  with the empirical chi square  116327.5  with prob <  0 
## 
## Fit based upon off diagonal values = 0.98

Complexity and uniqueness of the variables
In simple words, the complexity of the variable gives us information about how many variables constitute single factor. The lower complexity the better as it involves easier interpretation of the factor. To get the intuition - complexity close to 1 means that the loading of only one factor is relatively large and the remaining ones are close to zero.

Variables in our initial data set that have the highest complexity are:
- composure
- long_passing
- strength
- short_passing
- stamina

# Complexity plot
plot(rot_pca$complexity, ylab = "Complexity")
abline(h = mean(rot_pca$complexity), col = "lightblue", lwd = 1, lty = 2)

# Data frame with complexity and uniqueness values for every factor
comp_uniq <- data.frame(complexity = rot_pca$complexity, uniqueness = rot_pca$uniquenesses)
# Top 5 complexity values
head(comp_uniq[order(comp_uniq$complexity, decreasing = T),], 5)

##               complexity uniqueness
## composure       2.729481  0.2929731
## long_passing    2.302919  0.2585065
## strength        2.266275  0.5307218
## short_passing   2.050271  0.1328398
## stamina         2.042112  0.2710557

Uniqueness of the variable is the proportion of variance that is not shared with other variables. The desire is to maintain uniqueness on low levels as it is easier to reduce the space to a smaller number of dimensions (we will maintain higher piece of variance). If the uniqueness of factor is low it also means that the variable does not carry additional information in relation to other variables contained in the model.

Variables in our initial data set that have the highest uniqueness are:
- jumping
- weak_foot.1.5.
- age
- height_cm
- potential

# Uniqueness plot
plot(rot_pca$uniqueness, ylab = "Uniqueness")
abline(h = mean(rot_pca$uniqueness), col = "lightblue", lwd = 1, lty = 2)

# Top 5 uniqueness values
head(comp_uniq[order(comp_uniq$uniqueness, decreasing = T),], 5)

##                complexity uniqueness
## jumping          1.375521  0.8283828
## weak_foot.1.5.   1.385093  0.8189360
## age              1.811937  0.8164792
## height_cm        1.376557  0.6914984
## potential        1.268423  0.5999152

t-SNE

As a second dimension reduction method we will use t-SNE (t-Distributed Stochastic Neighbor Embedding). This method is really helpful to have a look at some clustering abilities of the data set. It is a nonlinear dimensionality reduction technique for embedding high-dimensional data for visualization in a low-dimensional (in our case we obtain two dimensions which we immediately visualize). The nonlinear dimensionality reduction provides the mapping that points in lower (usually 2 or 3) dimension manifold which are close to each other, are also similar in the input (high-dimensional) manifold. From the t-SNE plot we can see how many clusters should we expect as optimal after performing some clustering techniques that base on within sum of squares minimization.

# Perform t-SNE dimension reduction technique
tsne <- Rtsne(players_normalized)
# Change 2D output to data frame
tsne_df <- data.frame(x = tsne$Y[,1], y = tsne$Y[,2])
# Visualization in 2D scatter plot
ggplot(tsne_df, aes(x = x, y = y)) +
  geom_point(color = "tomato") +
  ggtitle("t-SNE plot") + 
  theme(plot.title = element_text(hjust = 0.5))

Summary

Initially the data set contains 40 numeric variables, whereas due to principal component analysis we obtain only 3 dimensions which totally explain approximately 68% of total variance. The highest influence on the variance of the first principal component had variables that seem to be most important for offensive players (ball_control, dribbling, curve). When we move onto second principal component we may observe that the abilities with highest influence on the variance suit best to defensive players (strength, interceptions, standing_tackle). The attributes that have the highest influence on variance of the third principal component are beyond the football skills. They are highly connected with the contracts and marketing (value_euro, release_clause_euro, wage_euro, international_reputation.1.5.).

Sources

García-Aliaga, A., Marquina, M., Coterón, J., Rodríguez-González, A., & Luengo-Sánchez, S. (2021). In-game behaviour analysis of football players using machine learning techniques based on player statistics. International Journal of Sports Science & Coaching, 16(1), 148-157. https://doi.org/10.1177/1747954120959762
http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/112-pca-principal-component-analysis-essentials/
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(11).
Shelly, Zachary & Burch, Reuben & Tian, Wenmeng & Strawderman, Lesley & Piroli, Anthony & Bichey, Corey. (2020). Using K-means Clustering to Create Training Groups for Elite American Football Student-athletes Based on Game Demands. International Journal of Kinesiology and Sports Science. 8. 47. 10.7575//aiac.ijkss.v.8n.2p.47.
Creasey, S. (2015). Wearable technology will up the game for sports data analytics. Computer Weekly.
https://en.wikipedia.org/wiki/Dimensionality_reduction
https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding

Dimension Reduction

Hubert Magdziak

2023-12-30