Dimen-Red---Copy---Copy.knit

Bernadette Mutsvagiwa

Dimension Reduction Analysis of FIFA Players Dataset

Introduction

Modern football analytics datasets contain a very large number of performance attributes for each player, such as passing, shooting, defending, physicality, and technical skills. While these features provide rich information, their high dimensionality makes direct visualization and interpretation difficult. Unsupervised learning techniques particularly dimension reduction methods allow us to summarize these complex datasets into a small number of informative components while preserving underlying patterns.

The objective of this project is to apply unsupervised dimension reduction techniques specifically Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP) to a FIFA players dataset. The analysis focuses on discovering hidden structures, understanding relationships between players, and visualizing similarities without using any labeled outcomes.

Loading Required Libraries

Before working with the dataset, it is necessary to load several libraries that support data manipulation, visualization, and dimension reduction. These packages provide reliable and widely used implementations of the required algorithms.

#Loading required libraries

library(tidyverse)
library(FactoMineR)
library(factoextra)
library(Rtsne)
library(umap)
library(GGally)
library(ggcorrplot)

Dataset Loading and Initial Inspection

The FIFA players dataset is loaded using the read.csv() function. This dataset contains player-level attributes representing technical skills, physical traits, and overall performance ratings.

This step reveals that the dataset contains thousands of players and dozens of attributes, including both numeric variables and categorical variables. Since dimension reduction techniques require numeric inputs, preprocessing is essential.

#Loading the dataset 

setwd("/Users/benna/Documents/Unsupervised Learning Project")

fifa <- read.csv("fifa_players.csv")

str(fifa)
dim(fifa)

'data.frame':   17954 obs. of  51 variables:
 $ name                         : chr  "L. Messi" "C. Eriksen" "P. Pogba" "L. Insigne" ...
 $ full_name                    : chr  "Lionel Andrés Messi Cuccittini" "Christian  Dannemann Eriksen" "Paul Pogba" "Lorenzo Insigne" ...
 $ birth_date                   : chr  "6/24/1987" "2/14/1992" "3/15/1993" "6/4/1991" ...
 $ age                          : int  31 27 25 27 27 27 20 30 32 32 ...
 $ height_cm                    : num  170 155 190 163 188 ...
 $ weight_kgs                   : num  72.1 76.2 83.9 59 88.9 92.1 73 69.9 92.1 77.1 ...
 $ positions                    : chr  "CF,RW,ST" "CAM,RM,CM" "CM,CAM" "LW,ST" ...
 $ nationality                  : chr  "Argentina" "Denmark" "France" "Italy" ...
 $ overall_rating               : int  94 88 88 88 88 88 88 89 89 89 ...
 $ potential                    : int  94 89 91 88 91 90 95 89 89 89 ...
 $ value_euro                   : int  110500000 69500000 73000000 62000000 60000000 59500000 81000000 64500000 38000000 60000000 ...
 $ wage_euro                    : int  565000 205000 255000 165000 135000 215000 100000 300000 130000 200000 ...
 $ preferred_foot               : chr  "Left" "Right" "Right" "Right" ...
 $ international_reputation.1.5.: int  5 3 4 3 3 3 3 4 5 4 ...
 $ weak_foot.1.5.               : int  4 5 4 4 3 3 4 4 4 4 ...
 $ skill_moves.1.5.             : int  4 4 5 4 2 2 5 4 1 3 ...
 $ body_type                    : chr  "Messi" "Lean" "Normal" "Normal" ...
 $ release_clause_euro          : int  226500000 133800000 144200000 105400000 106500000 114500000 166100000 119300000 62700000 111000000 ...
 $ national_team                : chr  "Argentina" "Denmark" "France" "Italy" ...
 $ national_rating              : int  82 78 84 83 NA 81 84 82 85 81 ...
 $ national_team_position       : chr  "RF" "CAM" "RDM" "LW" ...
 $ national_jersey_number       : int  10 10 6 10 NA 4 10 11 1 21 ...
 $ crossing                     : int  86 88 80 86 30 53 77 70 15 70 ...
 $ finishing                    : int  95 81 75 77 22 52 88 93 13 89 ...
 $ heading_accuracy             : int  70 52 75 56 83 83 77 77 25 89 ...
 $ short_passing                : int  92 91 86 85 68 79 82 81 55 78 ...
 $ volleys                      : int  86 80 85 74 14 45 78 85 11 90 ...
 $ dribbling                    : int  97 84 87 90 69 70 90 89 30 80 ...
 $ curve                        : int  93 86 85 87 28 60 77 82 14 77 ...
 $ freekick_accuracy            : int  94 87 82 77 28 70 63 73 11 76 ...
 $ long_passing                 : int  89 89 90 78 60 81 73 64 59 52 ...
 $ ball_control                 : int  96 91 90 93 63 76 91 89 46 82 ...
 $ acceleration                 : int  91 76 71 94 70 74 96 88 54 75 ...
 $ sprint_speed                 : int  86 73 79 86 75 77 96 80 60 76 ...
 $ agility                      : int  93 80 76 94 50 61 92 86 51 77 ...
 $ reactions                    : int  95 88 82 83 82 87 87 90 84 91 ...
 $ balance                      : int  95 81 66 93 40 49 83 91 35 59 ...
 $ shot_power                   : int  85 84 90 75 55 81 79 88 25 87 ...
 $ jumping                      : int  68 50 83 53 81 88 75 81 77 88 ...
 $ stamina                      : int  72 92 88 75 75 75 83 76 43 92 ...
 $ strength                     : int  66 58 87 44 94 92 71 73 80 78 ...
 $ long_shots                   : int  94 89 82 84 15 64 78 83 16 79 ...
 $ aggression                   : int  48 46 78 34 87 82 62 65 29 84 ...
 $ interceptions                : int  22 56 64 26 88 88 38 24 30 48 ...
 $ positioning                  : int  94 84 82 83 24 41 88 92 12 93 ...
 $ vision                       : int  94 91 88 87 49 60 82 83 70 77 ...
 $ penalties                    : int  75 67 82 61 33 62 70 83 47 85 ...
 $ composure                    : int  96 88 87 83 80 87 86 90 70 82 ...
 $ marking                      : int  33 59 63 51 91 90 34 30 17 52 ...
 $ standing_tackle              : int  28 57 67 24 88 89 34 20 10 45 ...
 $ sliding_tackle               : int  26 22 67 22 87 84 32 12 11 39 ...
>

[1] 17954    51

Data Cleaning and Preprocessing

Selection of Numeric Variables

Dimension reduction methods such as PCA, t-SNE, and UMAP operate on numeric data. Therefore, all non-numeric variables are removed from the dataset. This steps ensures that the analysis focuses solely on quantitative player attributes

fifa_numeric <- fifa %>%
  select(where(is.numeric))

Handling missing value

Missing values can distort distance-based and variance-based methods. To ensure accurate results, rows containing missing values are removed

sum(is.na(fifa_numeric))
fifa_numeric <- na.omit(fifa_numeric)

[1] 36532

Data Scaling

The dataset contains attributes measured on different scales, such as height and skill ratings (0–100). Without scaling, variables with larger numeric ranges would dominate the analysis. Standardization is applied so that each variable has a mean of zero and a standard deviation of one.

fifa_scaled <- scale(fifa_numeric)

Exploratory Data Analysis - Distribution of overall player ratings

Before applying dimension reduction, it is helpful to examine the distribution of player quality as measured by overall ratings.

ggplot(fifa, aes(x = overall_rating)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  theme_minimal() +
  labs(title = "Distribution of Overall Player Ratings",
       x = "Overall Rating",
       y = "Frequency")

This histogram shows that most players cluster around average ratings, with fewer elite players at the top end of the scale

Correlation Analysis of Player Attributes

Highly correlated features suggest redundancy and reinforce the need for dimension reduction.

cor_matrix <- cor(fifa_numeric)

ggcorrplot(cor_matrix,
           hc.order = TRUE,
           type = "lower",
           lab = FALSE)

The correlation heatmap reveals strong relationships among technical skills, physical attributes, and defensive metrics, indicating that many variables convey overlapping information.

Principal Component Analysis

Principal Component Analysis is a linear dimension reduction technique that transforms the original variables into a new set of uncorrelated components ordered by variance explained.

pca_model <- PCA(fifa_scaled, scale.unit = FALSE, graph = FALSE)

Scree Plot: Variance Explained by Components

The scree plot visualizes how much variance each principal component explains.

fviz_eig(pca_model, addlabels = TRUE, ylim = c(0, 40))

This plot shows that the first few components capture a substantial proportion of the total variance, suggesting that dimensionality can be significantly reduced without major information loss.

Variable Contribution to Principal Components

Understanding which attributes contribute most to each principal component helps interpret the reduced dimensions.

fviz_pca_var(pca_model,
             col.var = "contrib",
             gradient.cols = c("blue", "orange", "red"),
             repel = TRUE)

The plot indicates that attacking, passing, and dribbling attributes heavily influence the first component, while physical and defensive traits contribute more strongly to subsequent components.

PCA Projection of Players

This visualization projects players onto the first two principal components.

fviz_pca_ind(pca_model,
             geom = "point",
             pointsize = 1,
             alpha.ind = 0.4,
             col.ind = "steelblue")

The spread of points illustrates natural groupings among players based on their overall skill profiles.

PCA Colored by Overall Rating

To further interpret the PCA results, players are colored by their overall rating.

fviz_pca_ind(pca_model,
             geom = "point",
             col.ind = fifa$overall_rating,
             gradient.cols = c("blue", "yellow", "red"),
             legend.title = "Overall Rating")

Higher-rated players tend to cluster together, indicating that PCA successfully captures meaningful performance structure.

t-Distributed Stochastic Neighbor Embedding (t-SNE) - Apply t-SNE

t-SNE is a nonlinear dimension reduction technique that preserves local neighborhood structures, making it especially effective for visualization.

set.seed(123)

tsne_model <- Rtsne(fifa_scaled,
                    dims = 2,
                    perplexity = 30,
                    verbose = TRUE,
                    max_iter = 500)

Performing PCA
Read the 789 x 42 data matrix successfully!
OpenMP is working. 1 threads.
Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
Computing input similarities...
Building tree...
Done in 0.11 seconds (sparsity = 0.151574)!
Learning embedding...
Iteration 50: error is 64.847376 (50 iterations in 0.07 seconds)
Iteration 100: error is 62.355790 (50 iterations in 0.06 seconds)
Iteration 150: error is 62.203169 (50 iterations in 0.06 seconds)
Iteration 200: error is 62.133096 (50 iterations in 0.06 seconds)
Iteration 250: error is 62.088778 (50 iterations in 0.06 seconds)
Iteration 300: error is 1.126051 (50 iterations in 0.05 seconds)
Iteration 350: error is 1.044306 (50 iterations in 0.05 seconds)
Iteration 400: error is 1.020957 (50 iterations in 0.05 seconds)
Iteration 450: error is 1.007123 (50 iterations in 0.07 seconds)
Iteration 500: error is 0.999851 (50 iterations in 0.05 seconds)
Fitting performed in 0.60 seconds.
>

t-SNE Visualization

tsne_df <- data.frame(
  Dim1 = tsne_model$Y[,1],
  Dim2 = tsne_model$Y[,2],
  Rating = fifa$overall_rating
)

ggplot(tsne_df, aes(Dim1, Dim2, color = Rating)) +
  geom_point(alpha = 0.5) +
  scale_color_gradient(low = "blue", high = "red") +
  theme_minimal() +
  labs(title = "t-SNE Visualization of FIFA Players")

The resulting plot reveals compact clusters of players with similar attribute profiles, particularly among elite and lower-rated players.

Uniform Manifold Approximation and Projection (UMAP) -Applying UMAP

UMAP is a modern nonlinear technique that preserves both local and global data structure.

set.seed(123)

umap_model <- umap(fifa_scaled)

umap_df <- data.frame(
  UMAP1 = umap_model$layout[,1],
  UMAP2 = umap_model$layout[,2],
  Rating = fifa$overall_rating
)

UMAP Visualization

ggplot(umap_df, aes(UMAP1, UMAP2, color = Rating)) +
  geom_point(alpha = 0.5) +
  scale_color_gradient(low = "darkgreen", high = "red") +
  theme_minimal() +
  labs(title = "UMAP Projection of FIFA Players")

The resulting plot reveals compact clusters of players with similar attribute profiles, particularly among elite and lower-rated players.

Comparative Discussion of Methods and Conclusion

PCA provides interpretability and explains variance but is limited to linear relationships. t-SNE excels at uncovering fine-grained clusters but can distort global structure. UMAP balances both local and global structure, offering superior visualization for complex datasets like FIFA player attributes.

In conclusion, this project demonstrates how unsupervised dimension reduction techniques can effectively summarize and visualize high-dimensional football analytics data. The FIFA dataset exhibits strong internal structure driven by technical, physical, and tactical attributes. PCA offers valuable interpretability, while t-SNE and UMAP reveal nonlinear patterns and player groupings. These techniques provide powerful tools for player analysis, scouting, and performance evaluation.