Introduction

Dimension reduction is the process of reducing the number of variables or features in a dataset while retaining as much information as possible. It is commonly used in data analysis, machine learning, and statistics to simplify models and improve their efficiency. There are many techniques for dimension reduction, including principal component analysis (PCA), t-SNE, and linear discriminant analysis (LDA).

Dataset

In this project, we will use principal component analysis (PCA) as a technique for dimension reduction analysis. Our dataset for this project is the “FIFA WC 2022 Players Stats,” which contains various statistics on soccer players participating in the 2022 FIFA World Cup. By applying PCA to this dataset, we aim to reduce the number of features while preserving as much information as possible. This will allow us to analyze the data more efficiently and gain valuable insights into the performance of the players in the tournament.

Loading the required libraries and dataset

library(knitr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(corrplot)
## corrplot 0.92 loaded
library(DT)
mydata<- read.csv("C:\\Users\\PC\\AppData\\Local\\Temp\\Rar$DIa18832.48154\\FIFA WC 2022 Players Stats.csv")

Overviewing the dataset

datatable(mydata, options = list(scrollX = TRUE))
str(mydata)
## 'data.frame':    814 obs. of  18 variables:
##  $ Nationality                : chr  "Argentina" "Argentina" "Argentina" "Argentina" ...
##  $ FIFA.Ranking               : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ National.Team.Kit.Sponsor  : chr  "Adidas" "Adidas" "Adidas" "Adidas" ...
##  $ Position                   : chr  "GK" "GK" "GK" "DF" ...
##  $ National.Team.Jersey.Number: int  23 1 12 19 8 3 6 26 4 2 ...
##  $ Player.DOB                 : chr  "Sep 2, 1992" "Oct 16, 1986" "May 20, 1992" "Feb 12, 1988" ...
##  $ Club                       : chr  "Aston Villa" "River" "Villarreal" "Benfica" ...
##  $ Player.Name                : chr  "Emiliano Martinez" "Franco Armani" "Geronimo Rulli" "Nicolas Otamendi" ...
##  $ Appearances                : chr  "7" "0" "0" "7" ...
##  $ Goals.Scored               : chr  "0" "-" "-" "0" ...
##  $ Assists.Provided           : chr  "0" "-" "-" "1" ...
##  $ Dribbles.per.90            : chr  "0.00" "-" "-" "0.33" ...
##  $ Interceptions.per.90       : chr  "0.00" "-" "-" "1.17" ...
##  $ Tackles.per.90             : chr  "0.00" "-" "-" "1.30" ...
##  $ Total.Duels.Won.per.90     : chr  "0.65" "-" "-" "7.17" ...
##  $ Save.Percentage            : chr  "46.67%" "-" "-" "-" ...
##  $ Clean.Sheets               : chr  "43%" "-" "-" "-" ...
##  $ Brand.Sponsor.Brand.Used   : chr  "Adidas" "Nike" "Adidas" "Nike" ...

Preparing the dataset

Converting character data into numeric data

Converting character data to numeric data is important step for dimension reduction analysis because most dimension reduction techniques, such as principal component analysis (PCA), are designed to work with numeric data. When you have character data, which is essentially categorical data, you need to convert it to numeric data in order to be able to perform mathematical operations such as matrix multiplication and matrix inversion, which are fundamental to many dimension reduction techniques. So, we need to implement following operations on our dataset:

mydata <- mydata %>%
  mutate(
    Appearances = as.numeric(Appearances),
    Assists.Provided = as.numeric(Assists.Provided),
    Goals.Scored = as.numeric(Goals.Scored),
    Dribbles.per.90 = as.numeric(Dribbles.per.90),
    Interceptions.per.90 = as.numeric(Interceptions.per.90),
    Tackles.per.90 = as.numeric(Tackles.per.90),
    Total.Duels.Won.per.90 = as.numeric(Total.Duels.Won.per.90)
  )
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

Processing the dataset

As we mentioned above, because PCA technique is designed to work with numeric data, we need to select variables in which values are numeric. We should also omit NAs if there is any.

mydata1 <- select(mydata, c(9:15))
any(is.na(mydata1))
## [1] TRUE
mydata2 <- na.omit(mydata1)

Dimensions of the dataset

dim(mydata2)
## [1] 529   7
str(mydata2)
## 'data.frame':    529 obs. of  7 variables:
##  $ Appearances           : num  7 7 6 6 3 7 4 7 5 5 ...
##  $ Goals.Scored          : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ Assists.Provided      : num  0 1 0 0 0 1 0 0 0 0 ...
##  $ Dribbles.per.90       : num  0 0.33 1.45 0.48 0 0.32 0.77 0.16 0.3 0.4 ...
##  $ Interceptions.per.90  : num  0 1.17 0.48 2.17 0 0.47 0.77 0.49 0.9 1.21 ...
##  $ Tackles.per.90        : num  0 1.3 2.9 1.69 0 1.42 2.31 0.82 1.5 4.02 ...
##  $ Total.Duels.Won.per.90: num  0.65 7.17 7.97 5.07 3.16 1.58 5.39 5.09 4.19 9.24 ...
##  - attr(*, "na.action")= 'omit' Named int [1:285] 2 3 10 22 26 27 28 29 52 53 ...
##   ..- attr(*, "names")= chr [1:285] "2" "3" "10" "22" ...

Correlation matrix

We can use corrplot() functionin dimension reduction analysis in order to visualize the correlations between the variables in a dataset. In particular, it can be used to create a correlation matrix plot that shows the pairwise correlations between the variables.

corrplot(cor(mydata2, use="complete"), method="number", type="upper", diag=F, tl.col="black", tl.srt=30, tl.cex=0.9, number.cex=0.85, title="fifa2022", mar=c(0,0,1,0))

In our example, some pairs of variables have positive correlations, which means that as one variable increases, the other variable tends to increase as well. However, the correlations are not very strong, with coefficients ranging from 0.02 to 0.43. This suggests that there is some relationship between the variables, but there may be other factors that are more important in determining their values. Between a few variables there are also negative correlations which means that as one variable increases, the other variable tends to decrease.

Perform PCA on the data

By performing PCA on the dataset, we can identify the most important dimensions of variation in the data and reduce the dimensionality of the dataset by retaining only the principal components that explain the most variation. This can be useful for visualizing the data in lower-dimensional space and for reducing the computational complexity of subsequent analyses.

pca_result <- prcomp(mydata2, scale = TRUE)
summary(pca_result)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6     PC7
## Standard deviation     1.3311 1.1992 1.0060 0.9672 0.8346 0.78936 0.72327
## Proportion of Variance 0.2531 0.2054 0.1446 0.1336 0.0995 0.08901 0.07473
## Cumulative Proportion  0.2531 0.4586 0.6031 0.7368 0.8363 0.92527 1.00000

Based on the results provided, PC1 has the highest proportion of variance, which suggests that it is the most important component in explaining the variability in the data. However, it is also important to consider the cumulative proportions, which indicate that the first three PCs explain over 60% of the total variability in the data.

Optimal number of variance

Kaiser criterion

In dimension reduction analysis, the Kaiser criterion is a commonly used method for deciding how many principal components to retain from a PCA. The criterion is based on the idea that only principal components with eigenvalues greater than 1 should be retained. This is because any principal components with eigenvalues less than 1 are essentially just noise, and are not useful for reducing the dimensionality of the data. So, we need to set eigenvalues equal or higher than or equal to 1.

eigenvalues <- pca_result$sdev^2

num_PCs <- sum(eigenvalues >= 1)

pca_retained <- pca_result$x[,1:num_PCs]
fviz_eig(pca_result, choice = "eigenvalue", ncp = 21, barfill = "dodgerblue2", barcolor = "dodgerblue3", linecolor = "firebrick3",  addlabels = TRUE,   main = "Eigenvalues")

The result of the given method indicates that the optimal number of components is 3, as the eigenvalues of the first 3 components are higher than or equal to 1.

Percentage of explained variance

Percentage of explained variance is another important metric in dimension reduction analysis, particularly in techniques like principal component analysis (PCA) and factor analysis.

In PCA, the percentage of explained variance for each principal component can be calculated, and a scree plot can be used to determine the optimal number of components to retain. In this example, we will calculate the cumulative percentage of explained variance and use as a guide for selecting the number of components to retain.

In general, a higher percentage of explained variance indicates that the principal components or reduced dimensions are more important in explaining the variability in the data, and a lower percentage suggests that the dimensions are less important.

explained_variance <- pca_result$sdev^2
total_variance <- sum(explained_variance)
explained_variance_ratio <- explained_variance/total_variance

cumulative_explained_variance_ratio <- cumsum(explained_variance_ratio)

plot(cumulative_explained_variance_ratio, type="b", xlab="Number of Components", ylab="Cumulative Explained Variance", main="Cumulative Explained Variance by Number of Components")

threshold <- 0.60

n_components <- which(cumulative_explained_variance_ratio > threshold)[1]
n_components
## [1] 3

The result of this analysis, which requires a minimum threshold of 60% cumulative explained variance, indicates that 3 principal components are needed to capture a substantial amount of the variability in the original dataset. This finding is consistent with the Kaiser criterion. This means that these 3 components together capture a substantial amount of the variation in the original dataset.

Plotting

plot(explained_variance_ratio, type="b", xlab="Component", ylab="Explained Variance", main="Scree Plot")

The scree plot displays the percentage of variance explained by each principal component in descending order. It is used to identify the optimal number of principal components to retain. In this specific analysis, the scree plot shows that the first three principal components explain a relatively large proportion of the variability in the data, with the remaining components explaining smaller amounts. This is consistent with the earlier finding that three components capture a substantial amount of the variation in the original dataset.

fviz_pca_var(pca_result, col.var="contrib", gradient.cols = c("turquoise", "turquoise4", "black"), barfill = "#00AFBB", repel = TRUE, max.overlaps = 1550)

The variable factor map displays the contribution of each variable to each principal component, where variables with high absolute values contribute more strongly to a given component. The color and size of the variable labels indicate the degree of contribution to the component. In this plot, we can see which variables are strongly associated with each of the first three principal components.

fviz_pca_ind(pca_result, col.ind="contrib", geom = "point", gradient.cols = c("springgreen", "springgreen4", "black" ))

The individual factor map displays the contribution of each individual observation to each principal component, where observations with high absolute values contribute more strongly to a given component. The color and size of the point labels indicate the degree of contribution to the component.

PC1 <- fviz_contrib(pca_result, choice = "var", axes = 1, fill="green", col="green")
PC2 <- fviz_contrib(pca_result, choice = "var", axes = 2, fill="yellow", col="yellow")
PC3 <- fviz_contrib(pca_result, choice = "var", axes = 3, fill="red", col="red")

grid.arrange(PC1, PC2, PC3)

pca_var <- get_pca_var(pca_result)

The above plot shows the contribution of individual variables to one of the three principal components selected.

In the below representation, we can see the overall contribution of individual variables to the three components.

fviz_contrib(pca_result, "var", axes = 1:3, fill = "darkgoldenrod", color = "darkgoldenrod4")

Conclusion

In this project, we used principal component analysis (PCA) as a technique for dimension reduction analysis for dataset “FIFA WC 2022 Players Stats,”. After analyzing the dataset with PCA, we determined the number of principal components to use based on both the Kaiser criterion and the percentage of explained variance. We found that both methods suggested using three principal components.

We also created different plots and maps to show the contribution of each observation to the first three principal components. These plots helped us to identify which observations are driving the observed patterns in the data.