Bernadette Mutsvagiwa
Dimension Reduction Analysis of FIFA Players Dataset
Introduction
Modern football analytics datasets contain a very large number of performance attributes for each player, such as passing, shooting, defending, physicality, and technical skills. While these features provide rich information, their high dimensionality makes direct visualization and interpretation difficult. Unsupervised learning techniques particularly dimension reduction methods allow us to summarize these complex datasets into a small number of informative components while preserving underlying patterns.
The objective of this project is to apply unsupervised dimension reduction techniques specifically Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP) to a FIFA players dataset. The analysis focuses on discovering hidden structures, understanding relationships between players, and visualizing similarities without using any labeled outcomes.
Loading Required Libraries
Before working with the dataset, it is necessary to load several libraries that support data manipulation, visualization, and dimension reduction. These packages provide reliable and widely used implementations of the required algorithms.
#Loading required libraries
library(tidyverse)
library(FactoMineR)
library(factoextra)
library(Rtsne)
library(umap)
library(GGally)
library(ggcorrplot)
Dataset Loading and Initial Inspection
The FIFA players dataset is loaded using the read.csv()
function. This dataset contains player-level attributes representing
technical skills, physical traits, and overall performance ratings.
This step reveals that the dataset contains thousands of players and dozens of attributes, including both numeric variables and categorical variables. Since dimension reduction techniques require numeric inputs, preprocessing is essential.
#Loading the dataset
setwd("/Users/benna/Documents/Unsupervised Learning Project")
fifa <- read.csv("fifa_players.csv")
str(fifa)
dim(fifa)
'data.frame': 17954 obs. of 51 variables:
$ name : chr "L. Messi" "C. Eriksen" "P. Pogba" "L. Insigne" ...
$ full_name : chr "Lionel Andrés Messi Cuccittini" "Christian Dannemann Eriksen" "Paul Pogba" "Lorenzo Insigne" ...
$ birth_date : chr "6/24/1987" "2/14/1992" "3/15/1993" "6/4/1991" ...
$ age : int 31 27 25 27 27 27 20 30 32 32 ...
$ height_cm : num 170 155 190 163 188 ...
$ weight_kgs : num 72.1 76.2 83.9 59 88.9 92.1 73 69.9 92.1 77.1 ...
$ positions : chr "CF,RW,ST" "CAM,RM,CM" "CM,CAM" "LW,ST" ...
$ nationality : chr "Argentina" "Denmark" "France" "Italy" ...
$ overall_rating : int 94 88 88 88 88 88 88 89 89 89 ...
$ potential : int 94 89 91 88 91 90 95 89 89 89 ...
$ value_euro : int 110500000 69500000 73000000 62000000 60000000 59500000 81000000 64500000 38000000 60000000 ...
$ wage_euro : int 565000 205000 255000 165000 135000 215000 100000 300000 130000 200000 ...
$ preferred_foot : chr "Left" "Right" "Right" "Right" ...
$ international_reputation.1.5.: int 5 3 4 3 3 3 3 4 5 4 ...
$ weak_foot.1.5. : int 4 5 4 4 3 3 4 4 4 4 ...
$ skill_moves.1.5. : int 4 4 5 4 2 2 5 4 1 3 ...
$ body_type : chr "Messi" "Lean" "Normal" "Normal" ...
$ release_clause_euro : int 226500000 133800000 144200000 105400000 106500000 114500000 166100000 119300000 62700000 111000000 ...
$ national_team : chr "Argentina" "Denmark" "France" "Italy" ...
$ national_rating : int 82 78 84 83 NA 81 84 82 85 81 ...
$ national_team_position : chr "RF" "CAM" "RDM" "LW" ...
$ national_jersey_number : int 10 10 6 10 NA 4 10 11 1 21 ...
$ crossing : int 86 88 80 86 30 53 77 70 15 70 ...
$ finishing : int 95 81 75 77 22 52 88 93 13 89 ...
$ heading_accuracy : int 70 52 75 56 83 83 77 77 25 89 ...
$ short_passing : int 92 91 86 85 68 79 82 81 55 78 ...
$ volleys : int 86 80 85 74 14 45 78 85 11 90 ...
$ dribbling : int 97 84 87 90 69 70 90 89 30 80 ...
$ curve : int 93 86 85 87 28 60 77 82 14 77 ...
$ freekick_accuracy : int 94 87 82 77 28 70 63 73 11 76 ...
$ long_passing : int 89 89 90 78 60 81 73 64 59 52 ...
$ ball_control : int 96 91 90 93 63 76 91 89 46 82 ...
$ acceleration : int 91 76 71 94 70 74 96 88 54 75 ...
$ sprint_speed : int 86 73 79 86 75 77 96 80 60 76 ...
$ agility : int 93 80 76 94 50 61 92 86 51 77 ...
$ reactions : int 95 88 82 83 82 87 87 90 84 91 ...
$ balance : int 95 81 66 93 40 49 83 91 35 59 ...
$ shot_power : int 85 84 90 75 55 81 79 88 25 87 ...
$ jumping : int 68 50 83 53 81 88 75 81 77 88 ...
$ stamina : int 72 92 88 75 75 75 83 76 43 92 ...
$ strength : int 66 58 87 44 94 92 71 73 80 78 ...
$ long_shots : int 94 89 82 84 15 64 78 83 16 79 ...
$ aggression : int 48 46 78 34 87 82 62 65 29 84 ...
$ interceptions : int 22 56 64 26 88 88 38 24 30 48 ...
$ positioning : int 94 84 82 83 24 41 88 92 12 93 ...
$ vision : int 94 91 88 87 49 60 82 83 70 77 ...
$ penalties : int 75 67 82 61 33 62 70 83 47 85 ...
$ composure : int 96 88 87 83 80 87 86 90 70 82 ...
$ marking : int 33 59 63 51 91 90 34 30 17 52 ...
$ standing_tackle : int 28 57 67 24 88 89 34 20 10 45 ...
$ sliding_tackle : int 26 22 67 22 87 84 32 12 11 39 ...
>
[1] 17954 51
Data Cleaning and Preprocessing
Selection of Numeric Variables
Dimension reduction methods such as PCA, t-SNE, and UMAP operate on numeric data. Therefore, all non-numeric variables are removed from the dataset. This steps ensures that the analysis focuses solely on quantitative player attributes
fifa_numeric <- fifa %>%
select(where(is.numeric))
Handling missing value
Missing values can distort distance-based and variance-based methods. To ensure accurate results, rows containing missing values are removed
sum(is.na(fifa_numeric))
fifa_numeric <- na.omit(fifa_numeric)
[1] 36532
Data Scaling
The dataset contains attributes measured on different scales, such as height and skill ratings (0–100). Without scaling, variables with larger numeric ranges would dominate the analysis. Standardization is applied so that each variable has a mean of zero and a standard deviation of one.
fifa_scaled <- scale(fifa_numeric)
Exploratory Data Analysis - Distribution of overall player ratings
Before applying dimension reduction, it is helpful to examine the distribution of player quality as measured by overall ratings.
ggplot(fifa, aes(x = overall_rating)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
theme_minimal() +
labs(title = "Distribution of Overall Player Ratings",
x = "Overall Rating",
y = "Frequency")
This histogram shows that most players cluster around average ratings, with fewer elite players at the top end of the scale
Correlation Analysis of Player Attributes
Highly correlated features suggest redundancy and reinforce the need for dimension reduction.
cor_matrix <- cor(fifa_numeric)
ggcorrplot(cor_matrix,
hc.order = TRUE,
type = "lower",
lab = FALSE)
The correlation heatmap reveals strong relationships among technical skills, physical attributes, and defensive metrics, indicating that many variables convey overlapping information.
Principal Component Analysis
Principal Component Analysis is a linear dimension reduction technique that transforms the original variables into a new set of uncorrelated components ordered by variance explained.
pca_model <- PCA(fifa_scaled, scale.unit = FALSE, graph = FALSE)
Scree Plot: Variance Explained by Components
The scree plot visualizes how much variance each principal component explains.
fviz_eig(pca_model, addlabels = TRUE, ylim = c(0, 40))
This plot shows that the first few components capture a substantial proportion of the total variance, suggesting that dimensionality can be significantly reduced without major information loss.
Variable Contribution to Principal Components
Understanding which attributes contribute most to each principal component helps interpret the reduced dimensions.
fviz_pca_var(pca_model,
col.var = "contrib",
gradient.cols = c("blue", "orange", "red"),
repel = TRUE)
The plot indicates that attacking, passing, and dribbling attributes heavily influence the first component, while physical and defensive traits contribute more strongly to subsequent components.
PCA Projection of Players
This visualization projects players onto the first two principal components.
fviz_pca_ind(pca_model,
geom = "point",
pointsize = 1,
alpha.ind = 0.4,
col.ind = "steelblue")
The spread of points illustrates natural groupings among players based on their overall skill profiles.
PCA Colored by Overall Rating
To further interpret the PCA results, players are colored by their overall rating.
fviz_pca_ind(pca_model,
geom = "point",
col.ind = fifa$overall_rating,
gradient.cols = c("blue", "yellow", "red"),
legend.title = "Overall Rating")
Higher-rated players tend to cluster together, indicating that PCA successfully captures meaningful performance structure.
t-Distributed Stochastic Neighbor Embedding (t-SNE) - Apply t-SNE
t-SNE is a nonlinear dimension reduction technique that preserves local neighborhood structures, making it especially effective for visualization.
set.seed(123)
tsne_model <- Rtsne(fifa_scaled,
dims = 2,
perplexity = 30,
verbose = TRUE,
max_iter = 500)
Performing PCA
Read the 789 x 42 data matrix successfully!
OpenMP is working. 1 threads.
Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
Computing input similarities...
Building tree...
Done in 0.11 seconds (sparsity = 0.151574)!
Learning embedding...
Iteration 50: error is 64.847376 (50 iterations in 0.07 seconds)
Iteration 100: error is 62.355790 (50 iterations in 0.06 seconds)
Iteration 150: error is 62.203169 (50 iterations in 0.06 seconds)
Iteration 200: error is 62.133096 (50 iterations in 0.06 seconds)
Iteration 250: error is 62.088778 (50 iterations in 0.06 seconds)
Iteration 300: error is 1.126051 (50 iterations in 0.05 seconds)
Iteration 350: error is 1.044306 (50 iterations in 0.05 seconds)
Iteration 400: error is 1.020957 (50 iterations in 0.05 seconds)
Iteration 450: error is 1.007123 (50 iterations in 0.07 seconds)
Iteration 500: error is 0.999851 (50 iterations in 0.05 seconds)
Fitting performed in 0.60 seconds.
>
t-SNE Visualization
tsne_df <- data.frame(
Dim1 = tsne_model$Y[,1],
Dim2 = tsne_model$Y[,2],
Rating = fifa$overall_rating
)
ggplot(tsne_df, aes(Dim1, Dim2, color = Rating)) +
geom_point(alpha = 0.5) +
scale_color_gradient(low = "blue", high = "red") +
theme_minimal() +
labs(title = "t-SNE Visualization of FIFA Players")
The resulting plot reveals compact clusters of players with similar attribute profiles, particularly among elite and lower-rated players.
Uniform Manifold Approximation and Projection (UMAP) -Applying UMAP
UMAP is a modern nonlinear technique that preserves both local and global data structure.
set.seed(123)
umap_model <- umap(fifa_scaled)
umap_df <- data.frame(
UMAP1 = umap_model$layout[,1],
UMAP2 = umap_model$layout[,2],
Rating = fifa$overall_rating
)
UMAP Visualization
ggplot(umap_df, aes(UMAP1, UMAP2, color = Rating)) +
geom_point(alpha = 0.5) +
scale_color_gradient(low = "darkgreen", high = "red") +
theme_minimal() +
labs(title = "UMAP Projection of FIFA Players")
The resulting plot reveals compact clusters of players with similar attribute profiles, particularly among elite and lower-rated players.
Comparative Discussion of Methods and Conclusion
PCA provides interpretability and explains variance but is limited to linear relationships. t-SNE excels at uncovering fine-grained clusters but can distort global structure. UMAP balances both local and global structure, offering superior visualization for complex datasets like FIFA player attributes.
In conclusion, this project demonstrates how unsupervised dimension reduction techniques can effectively summarize and visualize high-dimensional football analytics data. The FIFA dataset exhibits strong internal structure driven by technical, physical, and tactical attributes. PCA offers valuable interpretability, while t-SNE and UMAP reveal nonlinear patterns and player groupings. These techniques provide powerful tools for player analysis, scouting, and performance evaluation.