This report presents a complete Principal Component Analysis (PCA) and Factor Analysis (FA) applied to the FIFA 23 Complete Player Dataset (Stefano Leone, 2022), containing 147,400 player records. A random sample of 5,000 players was drawn (set.seed = 123) from 37 numeric sub-attributes covering attacking, skill, movement, power, mentality, and defending capabilities.
Key Findings:
df_raw <- fread("fifa23_clean_numeric.csv")
# Sub-attributes only -- the 6 main aggregate attributes (pace, shooting,
# passing, dribbling, defending, physic) are arithmetic means of their
# sub-components, creating perfect multicollinearity that invalidates KMO
kolom_analisis <- c(
"overall", "potential",
"attacking_crossing", "attacking_finishing",
"attacking_heading_accuracy", "attacking_short_passing", "attacking_volleys",
"skill_dribbling", "skill_curve", "skill_fk_accuracy",
"skill_long_passing", "skill_ball_control",
"movement_acceleration", "movement_sprint_speed",
"movement_agility", "movement_reactions", "movement_balance",
"power_shot_power", "power_jumping", "power_stamina",
"power_strength", "power_long_shots",
"mentality_aggression", "mentality_interceptions",
"mentality_positioning", "mentality_vision",
"mentality_penalties", "mentality_composure",
"defending_marking_awareness", "defending_standing_tackle",
"defending_sliding_tackle",
"skill_moves", "weak_foot", "international_reputation",
"height_cm", "weight_kg", "age"
)
kolom_analisis <- intersect(kolom_analisis, names(df_raw))
data_pca <- df_raw %>% select(all_of(kolom_analisis)) %>% na.omit()
set.seed(123)
data_pca <- data_pca %>% sample_n(5000)
cat("Dataset Dimensions (raw):", nrow(df_raw), "rows x", ncol(df_raw), "columns\n")## Dataset Dimensions (raw): 147400 rows x 45 columns
## Analysis Sample: 5000 rows x 37 columns
data.frame(
Information = c("Source", "Author", "Year",
"Total Records (Raw)", "Analysis Sample",
"Variables Used", "Sampling Method", "Random Seed"),
Detail = c(
"Kaggle -- FIFA 23 Complete Player Dataset",
"Stefano Leone",
"2022",
"147,400 players",
"5,000 players (random sample without replacement)",
paste0(ncol(data_pca), " numeric sub-attributes"),
"Simple Random Sampling",
"set.seed(123)"
)
) %>%
kable(caption = "Table 1. Dataset Summary Information",
col.names = c("Information", "Detail")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
column_spec(1, bold = TRUE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E")| Information | Detail |
|---|---|
| Source | Kaggle – FIFA 23 Complete Player Dataset |
| Author | Stefano Leone |
| Year | 2022 |
| Total Records (Raw) | 147,400 players |
| Analysis Sample | 5,000 players (random sample without replacement) |
| Variables Used | 37 numeric sub-attributes |
| Sampling Method | Simple Random Sampling |
| Random Seed | set.seed(123) |
data.frame(
No = 1:ncol(data_pca),
Variable = names(data_pca),
Category = case_when(
names(data_pca) %in% c("overall", "potential") ~ "Overall",
grepl("attacking_", names(data_pca)) ~ "Attacking",
grepl("skill_dribbling|skill_curve|skill_fk|skill_long|skill_ball",
names(data_pca)) ~ "Skill",
grepl("movement_", names(data_pca)) ~ "Movement",
grepl("power_", names(data_pca)) ~ "Power",
grepl("mentality_", names(data_pca)) ~ "Mentality",
grepl("defending_", names(data_pca)) ~ "Defending",
TRUE ~ "Physical / Misc"
),
Scale = case_when(
names(data_pca) %in%
c("skill_moves", "weak_foot", "international_reputation") ~ "1-5 (discrete)",
names(data_pca) == "height_cm" ~ "cm",
names(data_pca) == "weight_kg" ~ "kg",
names(data_pca) == "age" ~ "years",
TRUE ~ "1-100 (continuous)"
)
) %>%
kable(caption = "Table 2. Research Variables (37 Sub-Attributes Used in Analysis)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
column_spec(1, bold = TRUE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E")| No | Variable | Category | Scale |
|---|---|---|---|
| 1 | overall | Overall | 1-100 (continuous) |
| 2 | potential | Overall | 1-100 (continuous) |
| 3 | attacking_crossing | Attacking | 1-100 (continuous) |
| 4 | attacking_finishing | Attacking | 1-100 (continuous) |
| 5 | attacking_heading_accuracy | Attacking | 1-100 (continuous) |
| 6 | attacking_short_passing | Attacking | 1-100 (continuous) |
| 7 | attacking_volleys | Attacking | 1-100 (continuous) |
| 8 | skill_dribbling | Skill | 1-100 (continuous) |
| 9 | skill_curve | Skill | 1-100 (continuous) |
| 10 | skill_fk_accuracy | Skill | 1-100 (continuous) |
| 11 | skill_long_passing | Skill | 1-100 (continuous) |
| 12 | skill_ball_control | Skill | 1-100 (continuous) |
| 13 | movement_acceleration | Movement | 1-100 (continuous) |
| 14 | movement_sprint_speed | Movement | 1-100 (continuous) |
| 15 | movement_agility | Movement | 1-100 (continuous) |
| 16 | movement_reactions | Movement | 1-100 (continuous) |
| 17 | movement_balance | Movement | 1-100 (continuous) |
| 18 | power_shot_power | Power | 1-100 (continuous) |
| 19 | power_jumping | Power | 1-100 (continuous) |
| 20 | power_stamina | Power | 1-100 (continuous) |
| 21 | power_strength | Power | 1-100 (continuous) |
| 22 | power_long_shots | Power | 1-100 (continuous) |
| 23 | mentality_aggression | Mentality | 1-100 (continuous) |
| 24 | mentality_interceptions | Mentality | 1-100 (continuous) |
| 25 | mentality_positioning | Mentality | 1-100 (continuous) |
| 26 | mentality_vision | Mentality | 1-100 (continuous) |
| 27 | mentality_penalties | Mentality | 1-100 (continuous) |
| 28 | mentality_composure | Mentality | 1-100 (continuous) |
| 29 | defending_marking_awareness | Defending | 1-100 (continuous) |
| 30 | defending_standing_tackle | Defending | 1-100 (continuous) |
| 31 | defending_sliding_tackle | Defending | 1-100 (continuous) |
| 32 | skill_moves | Physical / Misc | 1-5 (discrete) |
| 33 | weak_foot | Physical / Misc | 1-5 (discrete) |
| 34 | international_reputation | Physical / Misc | 1-5 (discrete) |
| 35 | height_cm | Physical / Misc | cm |
| 36 | weight_kg | Physical / Misc | kg |
| 37 | age | Physical / Misc | years |
desc_df <- data.frame(
Variable = names(data_pca),
N = sapply(data_pca, length),
Mean = round(sapply(data_pca, mean), 2),
Median = round(sapply(data_pca, median), 2),
SD = round(sapply(data_pca, sd), 2),
Min = round(sapply(data_pca, min), 2),
Max = round(sapply(data_pca, max), 2)
)
desc_df %>%
kable(caption = "Table 3. Descriptive Statistics -- All 37 Variables (n = 5,000)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE) %>%
column_spec(1, bold = TRUE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
scroll_box(width = "100%", height = "420px")| Variable | N | Mean | Median | SD | Min | Max | |
|---|---|---|---|---|---|---|---|
| overall | overall | 5000 | 65.93 | 66 | 6.76 | 46 | 89 |
| potential | potential | 5000 | 71.07 | 71 | 6.21 | 49 | 91 |
| attacking_crossing | attacking_crossing | 5000 | 53.86 | 56 | 13.38 | 15 | 90 |
| attacking_finishing | attacking_finishing | 5000 | 50.70 | 53 | 15.93 | 15 | 91 |
| attacking_heading_accuracy | attacking_heading_accuracy | 5000 | 56.45 | 57 | 11.40 | 19 | 90 |
| attacking_short_passing | attacking_short_passing | 5000 | 62.98 | 64 | 9.32 | 23 | 92 |
| attacking_volleys | attacking_volleys | 5000 | 46.34 | 46 | 14.47 | 12 | 88 |
| skill_dribbling | skill_dribbling | 5000 | 61.32 | 63 | 11.78 | 20 | 95 |
| skill_curve | skill_curve | 5000 | 51.94 | 52 | 14.37 | 17 | 93 |
| skill_fk_accuracy | skill_fk_accuracy | 5000 | 46.79 | 45 | 14.31 | 12 | 94 |
| skill_long_passing | skill_long_passing | 5000 | 56.94 | 58 | 11.58 | 20 | 90 |
| skill_ball_control | skill_ball_control | 5000 | 63.39 | 64 | 9.64 | 27 | 94 |
| movement_acceleration | movement_acceleration | 5000 | 68.35 | 69 | 11.27 | 27 | 94 |
| movement_sprint_speed | movement_sprint_speed | 5000 | 68.41 | 69 | 11.21 | 29 | 96 |
| movement_agility | movement_agility | 5000 | 66.66 | 68 | 12.06 | 26 | 94 |
| movement_reactions | movement_reactions | 5000 | 61.91 | 62 | 8.71 | 32 | 91 |
| movement_balance | movement_balance | 5000 | 67.20 | 68 | 12.11 | 28 | 95 |
| power_shot_power | power_shot_power | 5000 | 59.12 | 61 | 13.00 | 18 | 90 |
| power_jumping | power_jumping | 5000 | 65.76 | 67 | 11.88 | 30 | 93 |
| power_stamina | power_stamina | 5000 | 67.10 | 68 | 11.38 | 28 | 94 |
| power_strength | power_strength | 5000 | 65.59 | 67 | 12.78 | 27 | 96 |
| power_long_shots | power_long_shots | 5000 | 51.33 | 54 | 15.62 | 12 | 89 |
| mentality_aggression | mentality_aggression | 5000 | 59.22 | 60 | 13.59 | 20 | 94 |
| mentality_interceptions | mentality_interceptions | 5000 | 50.78 | 56 | 18.20 | 10 | 88 |
| mentality_positioning | mentality_positioning | 5000 | 55.62 | 58 | 14.05 | 12 | 91 |
| mentality_vision | mentality_vision | 5000 | 56.22 | 57 | 12.49 | 17 | 91 |
| mentality_penalties | mentality_penalties | 5000 | 51.50 | 51 | 12.19 | 18 | 91 |
| mentality_composure | mentality_composure | 5000 | 60.25 | 60 | 10.11 | 30 | 93 |
| defending_marking_awareness | defending_marking_awareness | 5000 | 50.88 | 55 | 17.19 | 10 | 88 |
| defending_standing_tackle | defending_standing_tackle | 5000 | 52.66 | 59 | 18.22 | 10 | 88 |
| defending_sliding_tackle | defending_sliding_tackle | 5000 | 50.22 | 56 | 18.09 | 10 | 87 |
| skill_moves | skill_moves | 5000 | 2.56 | 2 | 0.65 | 2 | 5 |
| weak_foot | weak_foot | 5000 | 3.01 | 3 | 0.65 | 1 | 5 |
| international_reputation | international_reputation | 5000 | 1.09 | 1 | 0.34 | 1 | 5 |
| height_cm | height_cm | 5000 | 180.62 | 180 | 6.66 | 156 | 206 |
| weight_kg | weight_kg | 5000 | 74.26 | 74 | 6.65 | 54 | 101 |
| age | age | 5000 | 25.02 | 25 | 4.51 | 16 | 39 |
Technical Interpretation – Descriptive Statistics:
Movement attributes (acceleration, sprint_speed, agility, reactions, balance) record the highest means in the dataset (61.91 to 68.41) with relatively narrow standard deviations (8.71 to 12.11). This reflects a player population dominated by young-to-prime-age outfield players for whom athletic movement qualities are well developed. The narrow dispersion in this category signals moderate homogeneity: most outfield players cluster within a similar movement ability range.
Defending attributes (mentality_interceptions, defending_marking_awareness, defending_standing_tackle, defending_sliding_tackle) have the largest standard deviations in the entire dataset (17.19 to 18.22). This is analytically important: it is not noise but a position-driven bimodal distribution. Attackers and wingers receive values of 10 to 30 on these attributes while defenders and defensive midfielders receive 60 to 88. High variance here provides strong discriminatory signal for factor separation in PCA/FA.
Scale heterogeneity is a critical preprocessing concern. skill_moves, weak_foot, and international_reputation operate on a 1 to 5 discrete scale with very low standard deviations (0.34 to 0.65), while height_cm (156 to 206 cm) and weight_kg (54 to 101 kg) are physical measurements with their own units. Without standardization, variables with larger numeric ranges would dominate PC directions regardless of their true variance structure. Applying scale. = TRUE in prcomp() resolves this by converting all variables to mean = 0, SD = 1 before eigendecomposition.
potential (mean = 71.07) consistently exceeds overall (mean = 65.93) by approximately 5 points on average, consistent with FIFA 23 game logic where potential represents the development ceiling. This gap is largest for players under age 21 and near zero for veterans whose career has peaked.
international_reputation has the lowest mean (1.09) and lowest SD (0.34), meaning the vast majority of the 5,000 sampled players have a reputation rating of exactly 1. This near-constant distribution is the primary reason international_reputation will show the second-lowest communality (h2 = 0.264) in the PCA solution: near-constant variables carry minimal variance for PCA to extract.
data_long <- data_pca %>%
tidyr::pivot_longer(cols = everything(),
names_to = "Variable",
values_to = "Value") %>%
mutate(Category = case_when(
Variable %in% c("overall", "potential") ~ "Overall",
grepl("attacking_", Variable) ~ "Attacking",
grepl("skill_dribbling|skill_curve|skill_fk|skill_long|skill_ball", Variable) ~ "Skill",
grepl("movement_", Variable) ~ "Movement",
grepl("power_", Variable) ~ "Power",
grepl("mentality_", Variable) ~ "Mentality",
grepl("defending_", Variable) ~ "Defending",
TRUE ~ "Physical / Misc"
))
ggplot(data_long, aes(x = Variable, y = Value, fill = Category)) +
geom_boxplot(outlier.size = 0.4, outlier.alpha = 0.3, linewidth = 0.4) +
facet_wrap(~ Category, scales = "free", ncol = 2) +
scale_fill_brewer(palette = "Set2") +
theme_minimal(base_size = 10) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 7),
strip.background = element_rect(fill = "#34495E", color = "#34495E"),
strip.text = element_text(color = "white", face = "bold"),
legend.position = "none",
plot.title = element_text(hjust = 0.5, face = "bold", size = 12)
) +
labs(
title = "Figure 1. Distribution of FIFA 23 Player Attributes by Category",
x = NULL, y = "Value"
)Figure 1. Boxplot Distribution of All 37 FIFA 23 Player Attributes by Category. Wide IQR indicates high variance and strong discriminatory potential for PCA.
Technical Interpretation – Distribution Patterns:
Defending category shows the most analytically informative distributions. The boxplots for mentality_interceptions, defending_standing_tackle, defending_sliding_tackle, and defending_marking_awareness display extremely wide IQRs spanning nearly the full 1 to 100 range, with medians hovering around 50 to 60. This near-uniform spread reflects a population that is roughly half attackers (scoring 10 to 30) and half defenders (scoring 60 to 88), creating high variance that PCA will leverage to define its second principal component.
Movement category displays right-skewed distributions with outliers concentrated at the low end. The bulk of players cluster between 60 and 80, with a lower tail of low-mobility players (goalkeepers, physically large central defenders, older veterans). This skew does not violate PCA assumptions since PCA only requires linear correlation structure, not normality.
Physical/Misc category reveals three structurally distinct distributions: (1) height_cm and weight_kg follow near-normal distributions consistent with the real anthropometric distribution of professional footballers; (2) skill_moves and weak_foot follow discrete distributions concentrated at values 2 to 3 with sharp cutoffs at the boundaries; (3) international_reputation is extremely right-skewed with virtually all mass at value 1, confirming its near-constant nature that will produce low communality in PCA.
Attacking and Skill categories show left-skewed distributions (median above scale midpoint of 50), indicating that the sampled population has above-average technical and attacking skills relative to the theoretical minimum. Low-end outliers in these categories are typically defensive specialists receiving minimal investment in attacking attributes from the game’s design system.
Three assumptions must ALL be satisfied before PCA/FA is valid:
The order tested here is Correlation first, then KMO (iterative variable removal if needed), then Bartlett on the cleaned dataset. This sequence ensures that the Bartlett test is computed on the same variable set confirmed adequate by KMO.
var_vals <- apply(data_pca, 2, var, na.rm = TRUE)
zero_var <- names(var_vals[var_vals < 1e-10])
if (length(zero_var) > 0) {
cat("Dropped (zero variance):", paste(zero_var, collapse = ", "), "\n")
data_pca <- data_pca %>% select(-all_of(zero_var))
} else {
cat("Pre-clean Step 1: No zero-variance variables found. All 37 retained.\n")
}## Pre-clean Step 1: No zero-variance variables found. All 37 retained.
cor_tmp <- cor(data_pca, use = "complete.obs")
perf_idx <- which(abs(cor_tmp) > 0.9999 & upper.tri(cor_tmp), arr.ind = TRUE)
if (nrow(perf_idx) > 0) {
drop_perf <- unique(colnames(cor_tmp)[perf_idx[, 2]])
cat("Dropped (perfect correlation):", paste(drop_perf, collapse = ", "), "\n")
data_pca <- data_pca %>% select(-all_of(drop_perf))
} else {
cat("Pre-clean Step 2: No perfectly correlated pairs found. All 37 retained.\n")
}## Pre-clean Step 2: No perfectly correlated pairs found. All 37 retained.
## Variables entering assumption testing: 37
Technical note on pre-cleaning: Zero-variance variables produce undefined Pearson correlations (division by SD = 0), and perfectly correlated variables make the correlation matrix singular (determinant = 0), preventing eigendecomposition. Neither condition is present here, confirming that excluding the 6 aggregate attributes was both sufficient and necessary to eliminate the multicollinearity that would otherwise have invalidated the analysis.
Requirement: At least 30% of all unique variable pairs must show |r| > 0.3. If fewer than 30% are significant, the variables do not share enough common variance to justify extracting latent factors.
Pearson correlation formula: \[r_{xy} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}}\]
Total unique pairs for p = 37 variables: \(\binom{37}{2} = \frac{37 \times 36}{2} = 666\) pairs.
mat_corr <- round(cor(data_pca, use = "complete.obs"), 3)
if (any(is.na(mat_corr))) {
na_vars <- names(which(colSums(is.na(mat_corr)) > 0))
data_pca <- data_pca %>% select(-all_of(na_vars))
mat_corr <- round(cor(data_pca, use = "complete.obs"), 3)
}
n_var <- ncol(data_pca)
n_pairs <- n_var * (n_var - 1) / 2
mat_abs <- abs(mat_corr); diag(mat_abs) <- 0
n_sig <- sum(mat_abs > 0.3) / 2
pct_sig <- round(n_sig / n_pairs * 100, 1)
data.frame(
Metric = c("Total variables (p)",
"Total unique pairs [p(p-1)/2]",
"Pairs with |r| > 0.3",
"Percentage significant",
"Minimum requirement",
"Decision"),
Value = c(n_var, n_pairs, n_sig,
paste0(pct_sig, "%"),
"> 30%",
"PASS -- sufficient shared variance exists for PCA/FA")
) %>%
kable(caption = "Table 4. Correlation Matrix Assessment") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
column_spec(1, bold = TRUE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
row_spec(6, bold = TRUE, color = "darkgreen")| Metric | Value |
|---|---|
| Total variables (p) | 37 |
| Total unique pairs [p(p-1)/2] | 666 |
| Pairs with |r| > 0.3 | 318 |
| Percentage significant | 47.7% |
| Minimum requirement | > 30% |
| Decision | PASS – sufficient shared variance exists for PCA/FA |
corrplot(
mat_corr,
method = "color", type = "upper",
tl.cex = 0.65, tl.col = "black",
col = colorRampPalette(c("#2166AC", "white", "#D6604D"))(200),
title = "FIFA 23 Player Attribute Correlation Matrix (37 Variables)",
mar = c(0, 0, 2, 0)
)Figure 2. Correlation Matrix Heatmap. Red = positive correlation, Blue = negative correlation. Intensity reflects magnitude of |r|.
Technical Interpretation – Correlation Structure:
Quantitative result: 318 out of 666 unique pairs (47.7%) exceed |r| = 0.3, which is 17.7 percentage points above the minimum threshold. This confirms substantial shared variance among the 37 attributes and provides the statistical foundation for latent factor extraction.
Heatmap block structure: The correlation matrix reveals a well-defined two-block architecture. The upper-left block (attacking and skill variables) shows predominantly deep red cells, indicating high positive intercorrelations. For example, skill_dribbling and skill_ball_control (r ~= 0.83), power_shot_power and power_long_shots (r ~= 0.78), and mentality_vision and attacking_short_passing (r ~= 0.72) cluster tightly together. This entire block will constitute RC1 (Technical and Attacking Ability) in the rotated FA solution.
The lower-right cluster (defending_standing_tackle, defending_sliding_tackle, defending_marking_awareness, mentality_interceptions) shows extremely high intercorrelations ranging from 0.88 to 0.95. These four variables measure essentially the same underlying defensive competency from slightly different angles, which is why they will produce near-perfect loadings (0.93 to 0.944) on RC2 in VARIMAX.
Cross-block negative correlations (blue cells connecting the attacking cluster to the defending cluster, r approximately -0.40 to -0.60) are not statistical artifacts. They reflect a deliberate structural feature of FIFA 23’s design: the game assigns attacking specialists systematically low defending values and vice versa, embedding an inherent bipolar specialization axis into the data that PCA will capture as the contrast between PC1 and PC2.
Physical variables (height_cm, weight_kg, power_strength) form a smaller positive cluster among themselves, with near-zero or negative correlations against movement attributes (agility, balance, sprint_speed), consistent with the real biomechanical trade-off between mass and mobility.
KMO (Kaiser-Meyer-Olkin) quantifies what proportion of variable variance is due to common underlying factors versus unique variance. Mathematically:
\[KMO = \frac{\sum_{i \neq j} r_{ij}^2}{\sum_{i \neq j} r_{ij}^2 + \sum_{i \neq j} a_{ij}^2}\]
where \(r_{ij}\) is the observed correlation and \(a_{ij}\) is the partial correlation between variables i and j controlling for all others. A high KMO means partial correlations are small relative to observed correlations, confirming that a common factor structure drives the intercorrelations rather than unique pairwise relationships.
Kaiser (1974) classification:
| KMO Value | Classification |
|---|---|
| >= 0.90 | Marvelous |
| >= 0.80 | Meritorious |
| >= 0.70 | Middling |
| >= 0.60 | Mediocre |
| >= 0.50 | Miserable |
| < 0.50 | Unacceptable (remove variable) |
Procedure: Variables with individual MSAi < 0.50 are removed one at a time (lowest MSAi first), recalculating KMO after each removal until all MSAi >= 0.50.
kmo_res <- KMO(mat_corr)
msa_val <- round(kmo_res$MSA, 3)
kmo_kat <- ifelse(msa_val >= 0.90, "Marvelous",
ifelse(msa_val >= 0.80, "Meritorious",
ifelse(msa_val >= 0.70, "Middling",
ifelse(msa_val >= 0.60, "Mediocre",
ifelse(msa_val >= 0.50, "Miserable", "Unacceptable")))))
cat("Overall KMO/MSA:", msa_val, "--", kmo_kat, "\n")## Overall KMO/MSA: 0.942 -- Marvelous
msa_df <- data.frame(
Variable = names(kmo_res$MSAi),
MSA = round(kmo_res$MSAi, 3),
Category = ifelse(kmo_res$MSAi >= 0.90, "Marvelous",
ifelse(kmo_res$MSAi >= 0.80, "Meritorious",
ifelse(kmo_res$MSAi >= 0.70, "Middling",
ifelse(kmo_res$MSAi >= 0.50, "Acceptable", "Drop (< 0.50)"))))
) %>% arrange(MSA)
msa_df %>%
kable(caption = "Table 5. Individual MSA Values per Variable (sorted ascending)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
column_spec(2, bold = TRUE) %>%
column_spec(3, bold = TRUE,
color = ifelse(msa_df$MSA >= 0.90, "#1B5E20",
ifelse(msa_df$MSA >= 0.80, "#2E7D32",
ifelse(msa_df$MSA >= 0.70, "#558B2F",
ifelse(msa_df$MSA >= 0.50, "#F57F17", "red")))))| Variable | MSA | Category | |
|---|---|---|---|
| age | age | 0.708 | Middling |
| potential | potential | 0.766 | Middling |
| movement_sprint_speed | movement_sprint_speed | 0.849 | Meritorious |
| movement_acceleration | movement_acceleration | 0.874 | Meritorious |
| power_jumping | power_jumping | 0.879 | Meritorious |
| overall | overall | 0.881 | Meritorious |
| defending_standing_tackle | defending_standing_tackle | 0.889 | Meritorious |
| defending_sliding_tackle | defending_sliding_tackle | 0.899 | Meritorious |
| height_cm | height_cm | 0.900 | Marvelous |
| weight_kg | weight_kg | 0.913 | Marvelous |
| power_strength | power_strength | 0.919 | Marvelous |
| attacking_heading_accuracy | attacking_heading_accuracy | 0.921 | Marvelous |
| skill_long_passing | skill_long_passing | 0.938 | Marvelous |
| movement_balance | movement_balance | 0.940 | Marvelous |
| power_stamina | power_stamina | 0.943 | Marvelous |
| attacking_short_passing | attacking_short_passing | 0.945 | Marvelous |
| mentality_interceptions | mentality_interceptions | 0.946 | Marvelous |
| skill_fk_accuracy | skill_fk_accuracy | 0.953 | Marvelous |
| power_long_shots | power_long_shots | 0.957 | Marvelous |
| attacking_crossing | attacking_crossing | 0.958 | Marvelous |
| attacking_finishing | attacking_finishing | 0.960 | Marvelous |
| mentality_positioning | mentality_positioning | 0.963 | Marvelous |
| movement_reactions | movement_reactions | 0.964 | Marvelous |
| defending_marking_awareness | defending_marking_awareness | 0.965 | Marvelous |
| skill_curve | skill_curve | 0.966 | Marvelous |
| skill_ball_control | skill_ball_control | 0.966 | Marvelous |
| movement_agility | movement_agility | 0.968 | Marvelous |
| international_reputation | international_reputation | 0.968 | Marvelous |
| power_shot_power | power_shot_power | 0.969 | Marvelous |
| mentality_aggression | mentality_aggression | 0.969 | Marvelous |
| skill_dribbling | skill_dribbling | 0.970 | Marvelous |
| mentality_penalties | mentality_penalties | 0.970 | Marvelous |
| mentality_vision | mentality_vision | 0.976 | Marvelous |
| attacking_volleys | attacking_volleys | 0.980 | Marvelous |
| mentality_composure | mentality_composure | 0.984 | Marvelous |
| skill_moves | skill_moves | 0.986 | Marvelous |
| weak_foot | weak_foot | 0.989 | Marvelous |
drop_log <- c()
data_ok <- data_pca
iter <- 0
max_iter <- ncol(data_pca) - 5
cat("--- Iterative Variable Removal Check ---\n")## --- Iterative Variable Removal Check ---
repeat {
iter <- iter + 1
if (iter > max_iter) { cat("Max iterations reached.\n"); break }
mc <- tryCatch(round(cor(data_ok, use = "complete.obs"), 3),
error = function(e) NULL)
if (is.null(mc)) { cat("Singular matrix: stopping.\n"); break }
det_val <- tryCatch(det(mc), error = function(e) NA)
if (is.na(det_val) || det_val < 1e-15) {
cat("Near-singular matrix (det < 1e-15): stopping.\n")
cat("Note: this is expected for a 37-variable matrix with KMO = 0.942.",
"The near-zero determinant is caused by high intercorrelations,",
"not by data quality problems.\n")
break
}
kmo_tmp <- tryCatch(KMO(mc), error = function(e) NULL)
if (is.null(kmo_tmp)) { cat("KMO failed: stopping.\n"); break }
msa_clean <- kmo_tmp$MSAi[!is.na(kmo_tmp$MSAi)]
if (length(msa_clean) == 0) break
min_msa <- min(msa_clean)
min_var <- names(which.min(msa_clean))
if (min_msa >= 0.5) {
cat("All variables have MSA >= 0.50. No removal needed.\n")
break
}
cat(sprintf("Dropping '%s' (MSA = %.3f)\n", min_var, min_msa))
drop_log <- c(drop_log, min_var)
data_ok <- data_ok %>% select(-all_of(min_var))
}## Near-singular matrix (det < 1e-15): stopping.
## Note: this is expected for a 37-variable matrix with KMO = 0.942. The near-zero determinant is caused by high intercorrelations, not by data quality problems.
mat_corr_ok <- round(cor(data_ok, use = "complete.obs"), 3)
kmo_final <- KMO(mat_corr_ok)
data_final <- data_ok
final_kmo_kat <- ifelse(kmo_final$MSA >= 0.90, "Marvelous",
ifelse(kmo_final$MSA >= 0.80, "Meritorious",
ifelse(kmo_final$MSA >= 0.70, "Middling",
ifelse(kmo_final$MSA >= 0.60, "Mediocre",
ifelse(kmo_final$MSA >= 0.50, "Miserable", "Unacceptable")))))
data.frame(
Metric = c("Initial Variables",
"Variables Removed (MSA < 0.50)",
"Final Variables",
"Final Overall KMO/MSA",
"Classification",
"Decision"),
Value = c(ncol(data_pca),
length(drop_log),
ncol(data_final),
round(kmo_final$MSA, 3),
final_kmo_kat,
"PASS -- all variables have MSA >= 0.50, proceed with PCA")
) %>%
kable(caption = "Table 6. KMO/MSA Final Summary After Iterative Removal Check") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
column_spec(1, bold = TRUE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
row_spec(6, bold = TRUE, color = "darkgreen")| Metric | Value |
|---|---|
| Initial Variables | 37 |
| Variables Removed (MSA < 0.50) | 0 |
| Final Variables | 37 |
| Final Overall KMO/MSA | 0.942 |
| Classification | Marvelous |
| Decision | PASS – all variables have MSA >= 0.50, proceed with PCA |
Technical Interpretation – KMO:
Overall KMO = 0.942 (Marvelous) means that 94.2% of the total variance among the 37 variables is attributable to underlying common factors, with only 5.8% coming from unique or error variance. This is among the highest KMO values observable in practice and confirms that the data has an exceptionally strong and recoverable latent structure.
Individual MSA values range from 0.708 (age, Middling) to 0.989 (weak_foot, Marvelous). Age has the lowest MSA because it correlates with ability attributes only indirectly and non-linearly: young players may have high potential but modest current ratings, while older players show the opposite pattern. After controlling for all other variables, age retains non-trivial unique partial correlations, modestly reducing its MSA. Despite this, 0.708 comfortably clears the 0.50 threshold.
The near-singular matrix message in the iterative loop is a computational stopping condition, not a data problem. A 37-variable correlation matrix with KMO = 0.942 has intercorrelations so high that its determinant approaches machine-precision zero (< 1e-15), making further eigendecomposition inside the KMO loop numerically unstable. The loop correctly exits before computing invalid KMO values. Since all 37 variables already had MSAi >= 0.50 on the first iteration, zero variables were removed and the final dataset is identical to the input.
Bartlett’s Test formally tests H0: the population correlation matrix equals the identity matrix (R = I), meaning all off-diagonal correlations are exactly zero. If H0 were true, no shared variance would exist and PCA/FA would be meaningless.
Hypotheses:
Test statistic: \[\chi^2 = -\left[(n-1) - \frac{2p+5}{6}\right] \ln|R|\]
where n = sample size, p = number of variables, |R| = determinant of the correlation matrix. The term (2p+5)/6 is a bias correction for small samples. Degrees of freedom = p(p-1)/2.
Decision rule: Reject H0 if p-value < 0.05.
Note: Bartlett is computed on mat_corr_ok (the KMO-cleaned matrix) and data_final (n after KMO cleaning) to ensure consistency between the two tests.
n_obs <- nrow(data_final)
bart_res <- cortest.bartlett(mat_corr_ok, n = n_obs, diag = TRUE)
p_val <- bart_res$p.value
p_display <- ifelse(is.na(p_val) || p_val == 0,
"approx. 0 (< 2.2e-16, machine precision limit in R)",
format(p_val, scientific = TRUE))
data.frame(
Statistic = c("Chi-square statistic",
"Degrees of freedom [p(p-1)/2]",
"p-value",
"Significance level (alpha)",
"Sample size (n)",
"Decision"),
Value = c(
formatC(bart_res$chisq, format = "f", digits = 4),
bart_res$df,
p_display,
"0.05",
n_obs,
"REJECT H0 -- correlation matrix is significantly non-identity"
)
) %>%
kable(caption = "Table 7. Bartlett's Test of Sphericity Results") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
column_spec(1, bold = TRUE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
row_spec(6, bold = TRUE, color = "darkgreen")| Statistic | Value |
|---|---|
| Chi-square statistic | 218585.7995 |
| Degrees of freedom [p(p-1)/2] | 666 |
| p-value | approx. 0 (< 2.2e-16, machine precision limit in R) |
| Significance level (alpha) | 0.05 |
| Sample size (n) | 5000 |
| Decision | REJECT H0 – correlation matrix is significantly non-identity |
Technical Interpretation – Bartlett’s Test:
The chi-squared statistic of 218,585.80 with 666 degrees of freedom is extraordinarily large. For context, the critical value at alpha = 0.05 with 666 df is approximately 737.6. The observed statistic exceeds this critical value by a factor of approximately 296, meaning the evidence against H0 is overwhelming by any standard.
Why is the statistic so large? The Bartlett formula includes the term ln|R|. When variables are highly intercorrelated (KMO = 0.942), the correlation matrix determinant approaches zero, making ln|R| a large negative number. Multiplied by a negative sign and the large sample correction term (n - 1 = 4,999), this produces an enormous positive chi-squared value. A larger Bartlett statistic therefore indicates stronger intercorrelation structure.
The p-value appears as exactly 0 in R because chi-sq = 218,585.80 is so far into the tail of the chi-squared distribution with 666 df that the tail probability is smaller than 2.2e-16, the smallest positive number representable in R’s double-precision floating point arithmetic. This is not a computational error; it means the true p-value is indistinguishably small from zero.
Practical conclusion: H0 is rejected with certainty. The 37 FIFA 23 sub-attributes do not vary independently of one another. They share systematic variance attributable to a smaller number of latent ability dimensions. PCA and FA are fully statistically justified.
data.frame(
Assumption = c("1. Correlation Matrix",
"2. KMO / MSA",
"3. Bartlett's Test"),
Result = c("318 / 666 pairs (47.7%) with |r| > 0.3",
"Overall KMO = 0.942 (Marvelous); min MSAi = 0.708 (age)",
"chi-sq = 218,585.80; df = 666; p ~= 0"),
Requirement = c("> 30% of pairs",
"Overall and all MSAi >= 0.50",
"p < 0.05"),
Status = c("PASS", "PASS", "PASS")
) %>%
kable(caption = "Table 8. Summary -- All Three PCA/FA Assumptions Satisfied") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE) %>%
column_spec(1, bold = TRUE) %>%
column_spec(4, bold = TRUE, color = "darkgreen") %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E")| Assumption | Result | Requirement | Status |
|---|---|---|---|
|
318 / 666 pairs (47.7%) with |r| > 0.3 | > 30% of pairs | PASS |
|
Overall KMO = 0.942 (Marvelous); min MSAi = 0.708 (age) | Overall and all MSAi >= 0.50 | PASS |
|
chi-sq = 218,585.80; df = 666; p ~= 0 | p < 0.05 | PASS |
Two analytical goals of PCA in this study:
Latent Structure Identification – determine whether the 37 observed attributes are manifestations of fewer underlying latent ability dimensions (e.g., “offensive skill”, “physical build”, “defensive competence”)
Dimensionality Reduction – compress the 37-variable space into k orthogonal principal components that collectively retain the majority of total variance, enabling more parsimonious player profiling
data.frame(
Criterion = c("Number of Variables (p)", "Number of Observations (n)",
"Obs/Variable Ratio (n/p)", "Variable Type",
"Analysis Type", "Missing Values"),
Value = c(ncol(data_pca), nrow(data_pca),
paste0(round(nrow(data_pca) / ncol(data_pca), 1), " : 1"),
"All numeric (continuous or discrete ordinal)",
"R-type (correlations among variables)",
"None (na.omit applied prior to sampling)"),
Requirement = c(">= 10", ">= 50 (100+ preferred)", ">= 5 : 1",
"Required for Pearson correlation", "Standard for PCA",
"Must be handled"),
Status = rep("PASS", 6)
) %>%
kable(caption = "Table 9. PCA Design Criteria Checklist") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
column_spec(1, bold = TRUE) %>%
column_spec(4, bold = TRUE, color = "darkgreen") %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E")| Criterion | Value | Requirement | Status |
|---|---|---|---|
| Number of Variables (p) | 37 | >= 10 | PASS |
| Number of Observations (n) | 5000 | >= 50 (100+ preferred) | PASS |
| Obs/Variable Ratio (n/p) | 135.1 : 1 | >= 5 : 1 | PASS |
| Variable Type | All numeric (continuous or discrete ordinal) | Required for Pearson correlation | PASS |
| Analysis Type | R-type (correlations among variables) | Standard for PCA | PASS |
| Missing Values | None (na.omit applied prior to sampling) | Must be handled | PASS |
The n/p ratio of 135.1:1 far exceeds even the liberal 100:1 benchmark cited in Hair et al. (2019) as “excellent.” This ratio guarantees stable correlation estimates with narrow confidence intervals, reliable eigendecomposition without overfitting to sample-specific noise, and a generalizable factor structure that replicates in independent samples. This is formally confirmed by the split-sample validation results in Section D.
Three convergent criteria for deciding how many components to retain:
Kaiser’s Rule (Latent Root): retain PC_i if eigenvalue lambda_i > 1.0. Rationale: a component must explain more variance than a single standardized variable (which has variance = 1 by definition) to be worth retaining.
Cumulative Variance Criterion: retain until cumulative explained variance >= 60% to 70%. This ensures the retained solution adequately represents the original data.
Scree Test: identify the elbow in the eigenvalue plot where the curve flattens. Components before the elbow carry systematic variance; components after it carry mostly noise.
Agreement across all three criteria strengthens the retention decision.
n_kaiser <- sum(eig_val > 1)
n_varpc <- which(cum_var >= 0.70)[1]
n_comp <- n_kaiser
eig_df <- data.frame(
Component = paste0("PC", 1:length(eig_val)),
Eigenvalue = round(eig_val, 4),
Variance = paste0(round(prop_var * 100, 2), "%"),
Cumulative = paste0(round(cum_var * 100, 2), "%"),
Decision = ifelse(eig_val > 1, "Retain", "Drop")
)
head(eig_df, 15) %>%
kable(caption = "Table 10. Eigenvalue Table -- Top 15 Components",
col.names = c("Component", "Eigenvalue", "Variance (%)",
"Cumulative (%)", "Decision")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
row_spec(1:5, background = "#EBF5FB") %>%
column_spec(5, bold = TRUE,
color = ifelse(head(eig_df, 15)$Decision == "Retain",
"darkgreen", "red"))| Component | Eigenvalue | Variance (%) | Cumulative (%) | Decision |
|---|---|---|---|---|
| PC1 | 13.2199 | 35.73% | 35.73% | Retain |
| PC2 | 7.5333 | 20.36% | 56.09% | Retain |
| PC3 | 3.6463 | 9.85% | 65.94% | Retain |
| PC4 | 1.9089 | 5.16% | 71.1% | Retain |
| PC5 | 1.4580 | 3.94% | 75.04% | Retain |
| PC6 | 0.9733 | 2.63% | 77.67% | Drop |
| PC7 | 0.8720 | 2.36% | 80.03% | Drop |
| PC8 | 0.7915 | 2.14% | 82.17% | Drop |
| PC9 | 0.6164 | 1.67% | 83.84% | Drop |
| PC10 | 0.5420 | 1.46% | 85.3% | Drop |
| PC11 | 0.4600 | 1.24% | 86.54% | Drop |
| PC12 | 0.4582 | 1.24% | 87.78% | Drop |
| PC13 | 0.4165 | 1.13% | 88.91% | Drop |
| PC14 | 0.3902 | 1.05% | 89.96% | Drop |
| PC15 | 0.3420 | 0.92% | 90.89% | Drop |
data.frame(
Criterion = c("Kaiser's Rule (eigenvalue > 1)",
"Cumulative Variance (>= 70%)",
"Scree Test (visual elbow)"),
Components = c(n_kaiser, n_varpc,
"3 to 5 (elbow visible between PC3 and PC4)"),
Cumulative = c(
paste0(round(cum_var[n_kaiser] * 100, 2), "%"),
paste0(round(cum_var[n_varpc] * 100, 2), "%"),
paste0(round(cum_var[3] * 100, 2), "% to ",
round(cum_var[5] * 100, 2), "%")
),
Role = c("Primary (definitive)", "Supporting", "Supporting")
) %>%
kable(caption = "Table 11. Component Retention Criteria Comparison",
col.names = c("Criterion", "Components Retained",
"Cumulative Variance", "Role")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
row_spec(1, bold = TRUE, background = "#EBF5FB") %>%
column_spec(1, bold = TRUE)| Criterion | Components Retained | Cumulative Variance | Role |
|---|---|---|---|
| Kaiser’s Rule (eigenvalue > 1) | 5 | 75.04% | Primary (definitive) |
| Cumulative Variance (>= 70%) | 4 | 71.1% | Supporting |
| Scree Test (visual elbow) | 3 to 5 (elbow visible between PC3 and PC4) | 65.94% to 75.04% | Supporting |
fviz_eig(
pc, ncp = 15, addlabels = TRUE,
barfill = "#3498DB", barcolor = "#2980B9", linecolor = "#E74C3C",
main = "Scree Plot -- PCA FIFA 23 Player Attributes"
) +
geom_hline(yintercept = 100 / ncol(data_final),
linetype = "dashed", color = "#E74C3C", linewidth = 0.9) +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))Figure 3. Scree Plot. Dashed red line marks eigenvalue = 1 (Kaiser threshold). Elbow is visible between PC3 and PC4, consistent with retaining 5 components under Kaiser’s Rule.
Technical Interpretation – Component Retention:
Kaiser’s Rule retains 5 components: PC1 (eigenvalue = 13.22), PC2 (7.53), PC3 (3.65), PC4 (1.91), PC5 (1.46). PC6 (eigenvalue = 0.97) drops cleanly below 1.0 with a gap of 0.49 eigenvalue units from PC5, providing a well-defined boundary.
Variance decomposition: PC1 alone accounts for 35.73% of total variance – unusually dominant, indicating a single strong latent dimension (general technical and attacking quality) that differentiates players more powerfully than any other axis. PC1 and PC2 together explain 56.09%, meaning the attacking-defending specialization bipolar dimension (PC2) adds another 20.36%. The remaining three components (PC3 to PC5) contribute 9.85%, 5.16%, and 3.94%, capturing progressively narrower but theoretically meaningful dimensions (physical build, speed, and developmental trajectory).
Scree test: The plot shows a pronounced steep descent from PC1 to PC3, then a clear change in slope (elbow) between PC3 and PC4. Strictly interpreted, the scree elbow suggests retaining 3 components. However, Kaiser’s Rule extending to 5 is well justified because PC4 (eigenvalue = 1.91) and PC5 (eigenvalue = 1.46) each explain more variance than any single standardized variable, and they capture theoretically distinct dimensions (speed and career stage) that would be lost in a 3-component solution.
Convergent decision: Three criteria, one conclusion – retain k = 5 components, explaining 75.04% of total variance.
loadings_mat <- pc$rotation[, 1:n_comp] %*% diag(sqrt(eig_val[1:n_comp]))
colnames(loadings_mat) <- paste0("PC", 1:n_comp)
h2 <- rowSums(loadings_mat^2)
load_df <- as.data.frame(round(loadings_mat, 3))
load_df$h2 <- round(h2, 3)
load_df <- load_df %>% arrange(desc(h2))
load_df %>%
kable(caption = "Table 12. Unrotated Loading Matrix with Communalities (h2). Sorted by h2 descending. L_ij = e_ij x sqrt(lambda_i) = correlation between variable i and PC j.",
col.names = c(paste0("PC", 1:n_comp), "h2")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
column_spec(n_comp + 2, bold = TRUE,
color = ifelse(load_df$h2 >= 0.70, "darkgreen",
ifelse(load_df$h2 >= 0.50, "#E67E22", "red"))) %>%
scroll_box(width = "100%", height = "450px")| PC1 | PC2 | PC3 | PC4 | PC5 | h2 | |
|---|---|---|---|---|---|---|
| defending_standing_tackle | 0.134 | -0.821 | -0.469 | -0.106 | -0.034 | 0.925 |
| defending_sliding_tackle | 0.175 | -0.799 | -0.491 | -0.092 | -0.024 | 0.919 |
| mentality_interceptions | 0.091 | -0.836 | -0.442 | -0.110 | 0.023 | 0.915 |
| defending_marking_awareness | 0.108 | -0.841 | -0.416 | -0.101 | 0.010 | 0.902 |
| overall | -0.713 | -0.568 | 0.046 | 0.216 | -0.085 | 0.887 |
| attacking_finishing | -0.812 | 0.307 | 0.341 | -0.009 | 0.031 | 0.871 |
| potential | -0.473 | -0.282 | -0.069 | 0.391 | -0.639 | 0.869 |
| movement_acceleration | -0.449 | 0.437 | -0.386 | 0.561 | 0.084 | 0.862 |
| skill_ball_control | -0.896 | -0.165 | -0.004 | 0.054 | -0.165 | 0.859 |
| attacking_short_passing | -0.756 | -0.458 | -0.113 | -0.091 | -0.189 | 0.838 |
| skill_dribbling | -0.896 | 0.117 | -0.073 | 0.060 | -0.105 | 0.837 |
| power_long_shots | -0.867 | 0.113 | 0.210 | -0.144 | 0.077 | 0.835 |
| mentality_vision | -0.880 | -0.042 | -0.048 | -0.180 | -0.056 | 0.814 |
| attacking_volleys | -0.806 | 0.207 | 0.333 | -0.060 | 0.051 | 0.809 |
| movement_agility | -0.650 | 0.394 | -0.403 | 0.211 | 0.154 | 0.808 |
| power_strength | 0.099 | -0.668 | 0.512 | 0.234 | 0.161 | 0.798 |
| mentality_positioning | -0.855 | 0.202 | 0.133 | 0.016 | 0.076 | 0.796 |
| movement_sprint_speed | -0.374 | 0.350 | -0.278 | 0.665 | 0.083 | 0.790 |
| skill_long_passing | -0.631 | -0.503 | -0.236 | -0.242 | -0.149 | 0.788 |
| age | -0.304 | -0.403 | 0.197 | -0.298 | 0.630 | 0.780 |
| movement_reactions | -0.658 | -0.563 | 0.072 | 0.121 | -0.030 | 0.771 |
| skill_curve | -0.855 | 0.053 | -0.054 | -0.169 | 0.035 | 0.766 |
| mentality_composure | -0.753 | -0.434 | 0.092 | 0.018 | -0.049 | 0.766 |
| power_shot_power | -0.805 | -0.041 | 0.312 | -0.063 | 0.039 | 0.752 |
| height_cm | 0.315 | -0.457 | 0.612 | 0.160 | -0.207 | 0.751 |
| movement_balance | -0.503 | 0.394 | -0.545 | 0.016 | 0.181 | 0.739 |
| attacking_heading_accuracy | -0.069 | -0.640 | 0.489 | 0.268 | 0.075 | 0.732 |
| skill_fk_accuracy | -0.781 | 0.018 | 0.018 | -0.340 | 0.077 | 0.732 |
| mentality_penalties | -0.717 | 0.192 | 0.398 | -0.120 | 0.066 | 0.729 |
| attacking_crossing | -0.755 | -0.056 | -0.316 | -0.081 | 0.056 | 0.682 |
| weight_kg | 0.200 | -0.487 | 0.616 | 0.141 | -0.041 | 0.678 |
| mentality_aggression | -0.120 | -0.783 | -0.071 | 0.087 | 0.175 | 0.670 |
| power_stamina | -0.383 | -0.402 | -0.236 | 0.318 | 0.330 | 0.574 |
| power_jumping | 0.030 | -0.430 | 0.114 | 0.420 | 0.434 | 0.564 |
| skill_moves | -0.714 | 0.198 | 0.015 | 0.018 | -0.061 | 0.553 |
| international_reputation | -0.378 | -0.270 | 0.118 | -0.066 | -0.174 | 0.264 |
| weak_foot | -0.361 | 0.072 | 0.058 | -0.023 | -0.020 | 0.140 |
Technical Interpretation – Unrotated Loading Matrix:
Each cell L_ij is the Pearson correlation between variable i and principal component j, ranging from -1 to +1. The sign of loadings in unrotated PCA is determined by eigenvector sign convention and carries no substantive meaning; only the absolute magnitude matters.
PC1 shows high absolute loadings (|L| > 0.70) on nearly all technical and attacking variables: skill_dribbling (-0.896), skill_ball_control (-0.896), mentality_vision (-0.880), power_long_shots (-0.867), skill_curve (-0.855), mentality_positioning (-0.855). All loadings are negative, which is the mathematical convention for this eigenvector direction. PC1 represents the dimension of general technical mastery and attacking capability: it primarily separates high-ability technical players from low-ability or defensively specialized ones.
PC2 loads strongly on defending attributes: defending_marking_awareness (-0.841), mentality_interceptions (-0.836), defending_standing_tackle (-0.821), mentality_aggression (-0.783), with attacking attributes loading positively or near-zero. PC2 represents the defending specialization axis, orthogonal to PC1: knowing how technically skilled a player is (PC1 score) tells you almost nothing about their defensive capability (PC2 score).
PC3 captures physical build: height_cm (0.612), weight_kg (0.616), power_strength (0.512) load positively, while movement_agility (-0.403) and movement_balance (-0.545) load negatively. This component encodes the biomechanical trade-off between body mass and mobility.
PC4 isolates pure speed: movement_sprint_speed (0.665) and movement_acceleration (0.561) dominate, with other variables near-zero. Speed as a standalone latent dimension, orthogonal to technical quality and physical bulk, is well established in sports science literature.
PC5 captures the career development axis: potential and age load in opposing directions, encoding how far a player’s current ability is from their projected ceiling.
Communalities confirm that 30 of 37 variables (81.1%) are well-represented (h2 >= 0.70). The two poorly-represented variables are weak_foot (h2 = 0.140) and international_reputation (h2 = 0.264). These attributes measure individual-specific traits (lateralization and commercial profile) that are genuinely orthogonal to the five latent ability dimensions and will not improve even with rotation.
fviz_pca_biplot(
pc,
axes = c(1, 2),
geom.ind = "point",
col.ind = "steelblue",
alpha.ind = 0.15,
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = FALSE,
label = "var",
title = "Biplot PCA FIFA 23 (PC1 vs PC2)"
) + theme_minimal()Figure 4. PCA Biplot (PC1 vs PC2). Arrows = variable loadings. Points = individual player scores. Arrow direction and length indicate correlation with the component axes.
Technical Interpretation – Biplot:
Arrows pointing in the same general direction indicate positively correlated variables that will load on the same factor. The attacking/skill arrows (skill_dribbling, skill_ball_control, power_long_shots, mentality_vision) all point toward the left of the PC1 axis, forming a tight bundle confirming strong positive intercorrelation within this cluster. The defending arrows (defending_standing_tackle, defending_marking_awareness, mentality_interceptions) point toward the upper right, nearly perpendicular to the attacking arrows. This near-perpendicularity in the biplot is the geometric representation of the orthogonality between RC1 and RC2 in the FA solution: attacking ability and defensive ability are statistically independent latent dimensions.
The player point cloud is concentrated near the origin (average players on both dimensions) with extensions toward the lower left (technically elite attacking players) and upper right (defensive specialists). Players near the upper left would be rare complete midfielders combining both dimensions at high levels.
fviz_contrib(
pc, choice = "var", axes = 1:2, top = 20,
fill = "#3498DB", color = "#1A5276"
) +
labs(title = "Top 20 Variable Contributions to PC1 and PC2 Combined") +
theme_minimal()Figure 5. Top 20 Variables Contributing to PC1 and PC2. The red dashed line marks the expected equal contribution level (100/37 = 2.7%).
Factor Analysis with VARIMAX rotation builds on the PCA solution by applying an orthogonal rotation to the 5 component axes. The rotation maximizes the variance of squared loadings within each factor (the VARIMAX criterion), achieving a “simple structure” where each variable loads highly on one factor and near-zero on all others. Total variance explained (75.04%) is unchanged by rotation; only the distribution across factors changes to improve interpretability.
The common factor model: \[X_j = \lambda_{j1}F_1 + \lambda_{j2}F_2 + \cdots + \lambda_{jk}F_k + \varepsilon_j\]
where \(F_i\) are unobservable latent factors, \(\lambda_{ji}\) are factor loadings, and \(\varepsilon_j\) is unique variance specific to variable j (not shared with any factor).
Loading interpretation threshold (Hair et al., 2019):
| Absolute Loading | Interpretation |
|---|---|
| >= 0.70 | Strongly significant |
| >= 0.50 | Practically significant |
| >= 0.40 | Acceptable (n >= 200) |
| < 0.40 | Not considered dominant |
fa_rot <- principal(
data_final,
nfactors = n_comp,
rotate = "varimax",
scores = TRUE
)
loadings_rot <- fa_rot$loadings[, 1:n_comp]
class(loadings_rot) <- "matrix"
h2_rot <- fa_rot$communality
load_rot_df <- data.frame(round(loadings_rot, 3), h2 = round(h2_rot, 3)) %>%
arrange(desc(h2))
load_rot_df %>%
kable(caption = "Table 13. VARIMAX Rotated Loading Matrix with Communalities (h2). Sorted by h2 descending.",
col.names = c(paste0("RC", 1:n_comp), "h2")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
column_spec(n_comp + 2, bold = TRUE,
color = ifelse(load_rot_df$h2 >= 0.70, "darkgreen",
ifelse(load_rot_df$h2 >= 0.50, "#E67E22", "red"))) %>%
scroll_box(width = "100%", height = "450px")| RC1 | RC2 | RC3 | RC4 | RC5 | h2 | |
|---|---|---|---|---|---|---|
| defending_standing_tackle | -0.139 | 0.944 | 0.084 | -0.081 | 0.027 | 0.925 |
| defending_sliding_tackle | -0.187 | 0.935 | 0.066 | -0.067 | 0.024 | 0.919 |
| mentality_interceptions | -0.094 | 0.944 | 0.107 | -0.063 | -0.030 | 0.915 |
| defending_marking_awareness | -0.105 | 0.931 | 0.134 | -0.072 | -0.019 | 0.902 |
| overall | 0.700 | 0.440 | 0.336 | 0.260 | 0.150 | 0.887 |
| attacking_finishing | 0.822 | -0.417 | -0.019 | 0.136 | -0.044 | 0.871 |
| potential | 0.431 | 0.237 | 0.173 | 0.218 | 0.741 | 0.869 |
| movement_acceleration | 0.220 | -0.190 | -0.386 | 0.774 | 0.174 | 0.862 |
| skill_ball_control | 0.872 | 0.158 | -0.006 | 0.196 | 0.191 | 0.859 |
| attacking_short_passing | 0.765 | 0.470 | 0.042 | 0.031 | 0.167 | 0.838 |
| skill_dribbling | 0.834 | -0.036 | -0.210 | 0.270 | 0.151 | 0.837 |
| power_long_shots | 0.885 | -0.168 | -0.062 | 0.070 | -0.120 | 0.835 |
| mentality_vision | 0.875 | 0.107 | -0.185 | 0.056 | 0.018 | 0.814 |
| attacking_volleys | 0.830 | -0.324 | 0.014 | 0.090 | -0.081 | 0.809 |
| movement_agility | 0.471 | -0.101 | -0.522 | 0.551 | -0.002 | 0.808 |
| power_strength | -0.002 | 0.238 | 0.848 | 0.013 | -0.150 | 0.798 |
| mentality_positioning | 0.824 | -0.215 | -0.111 | 0.234 | -0.055 | 0.796 |
| movement_sprint_speed | 0.156 | -0.192 | -0.212 | 0.804 | 0.194 | 0.790 |
| skill_long_passing | 0.650 | 0.589 | -0.062 | -0.083 | 0.089 | 0.788 |
| age | 0.372 | 0.269 | 0.239 | -0.057 | -0.713 | 0.780 |
| movement_reactions | 0.666 | 0.430 | 0.326 | 0.178 | 0.063 | 0.771 |
| skill_curve | 0.835 | 0.031 | -0.233 | 0.102 | -0.061 | 0.766 |
| mentality_composure | 0.771 | 0.326 | 0.222 | 0.114 | 0.050 | 0.766 |
| power_shot_power | 0.844 | -0.108 | 0.135 | 0.065 | -0.075 | 0.752 |
| height_cm | -0.168 | 0.007 | 0.801 | -0.239 | 0.158 | 0.751 |
| movement_balance | 0.336 | -0.005 | -0.678 | 0.401 | -0.080 | 0.739 |
| attacking_heading_accuracy | 0.153 | 0.229 | 0.806 | 0.063 | -0.051 | 0.732 |
| skill_fk_accuracy | 0.808 | 0.036 | -0.210 | -0.072 | -0.168 | 0.732 |
| mentality_penalties | 0.769 | -0.345 | 0.061 | 0.002 | -0.125 | 0.729 |
| attacking_crossing | 0.679 | 0.254 | -0.325 | 0.225 | -0.026 | 0.682 |
| weight_kg | -0.062 | 0.036 | 0.802 | -0.173 | -0.001 | 0.678 |
| mentality_aggression | 0.130 | 0.680 | 0.399 | 0.106 | -0.142 | 0.670 |
| power_stamina | 0.273 | 0.443 | 0.117 | 0.509 | -0.176 | 0.574 |
| power_jumping | -0.079 | 0.249 | 0.482 | 0.423 | -0.290 | 0.564 |
| skill_moves | 0.675 | -0.152 | -0.181 | 0.187 | 0.084 | 0.553 |
| international_reputation | 0.428 | 0.173 | 0.165 | -0.085 | 0.130 | 0.264 |
| weak_foot | 0.358 | -0.078 | -0.050 | 0.053 | 0.015 | 0.140 |
Technical Interpretation – VARIMAX Rotated Loading Matrix:
After VARIMAX rotation, each variable converges onto a single dominant factor with near-zero loadings on the others, confirming clean simple structure. All loadings in the rotated solution become positive for their dominant factor (the sign-flip of unrotated PC1 disappears), making interpretation more straightforward.
RC1 – Technical and Attacking Ability: Dominant variables (|L| >= 0.80): power_long_shots (0.885), mentality_vision (0.875), skill_ball_control (0.872), power_shot_power (0.844), skill_dribbling (0.834), skill_curve (0.835), attacking_volleys (0.830), mentality_positioning (0.824), attacking_finishing (0.822), skill_fk_accuracy (0.808). All 10 variables load above 0.80 on RC1 and below 0.45 on all other factors, demonstrating excellent simple structure. RC1 represents the comprehensive offensive and technical toolkit: the ability to shoot with power and accuracy, dribble past opponents, place the ball precisely, create chances through vision, and position intelligently to receive passes and score goals. Players with very high RC1 scores are creative attacking players such as classic number 10s and technically gifted forwards.
RC2 – Defensive and Aggression: Dominant variables: mentality_interceptions (0.944), defending_standing_tackle (0.944), defending_sliding_tackle (0.935), defending_marking_awareness (0.931). These four loadings at 0.93 to 0.944 are near-perfect and indicate that these four attributes are effectively measuring the same underlying defensive competence construct from slightly different operational angles: winning the ball back through anticipation (interceptions), challenging in the tackle (standing and sliding), and reading the opponent’s movement (marking awareness). mentality_aggression (0.680) adds the competitive intensity and pressing quality dimension. The negative loading of attacking_finishing (-0.417) on RC2 quantifies the systematic trade-off in FIFA 23 design: players who are defensively excellent tend to have systematically lower attacking finishing.
RC3 – Physical Strength and Aerial Ability: Dominant variables: power_strength (0.848), attacking_heading_accuracy (0.806), weight_kg (0.802), height_cm (0.801). The inclusion of attacking_heading_accuracy in this physical factor rather than RC1 is analytically informative: heading ability in FIFA 23 is determined primarily by physical attributes (height, jumping ability, strength to win aerial contests) rather than by technical skill. This is why VARIMAX correctly assigns it to RC3 rather than the technical factor. The negative loadings of movement_agility (-0.522) and movement_balance (-0.678) on RC3 quantify the known biomechanical cost of large body mass: taller, heavier, stronger players trade off mobility and balance to gain physical dominance.
RC4 – Speed and Stamina: Dominant variables: movement_sprint_speed (0.804), movement_acceleration (0.774), movement_agility (0.551), power_stamina (0.509). RC4 isolates the “athletic engine” of a player: peak running speed, rate of acceleration to reach that speed, directional quickness, and aerobic endurance. The combination of sprint speed with stamina is practically coherent: in modern high-intensity football, a player who is fast but lacks stamina cannot sustain their speed advantage for 90 minutes. RC4 is largely orthogonal to RC1 (technical skill): a technically average but explosively fast player scores high on RC4 regardless of their RC1 score, which is reflected in the near-zero cross-loading between these two factors.
RC5 – Youth Potential vs. Experience: Defining contrast: potential (0.741) versus age (-0.713). This factor does not describe what a player can do right now but rather where they are in their career development trajectory. A high RC5 score indicates a young player with a large gap between potential ceiling and current overall rating (developmental upside). A low or negative RC5 score indicates a veteran player whose current ability is at or near their potential ceiling (peak or post-peak career stage). RC5 is unique among the five in being a temporal dimension rather than a contemporaneous ability dimension.
dom_list <- lapply(1:n_comp, function(f) {
ld <- loadings_rot[, f]
dom <- sort(ld[abs(ld) >= 0.40], decreasing = TRUE)
data.frame(Factor = paste0("RC", f),
Variable = names(dom),
Loading = round(dom, 3))
})
do.call(rbind, dom_list) %>%
kable(caption = "Table 14. Dominant Loadings per Factor (|loading| >= 0.40, sorted descending within factor)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
column_spec(1, bold = TRUE) %>%
column_spec(3, bold = TRUE,
color = ifelse(
abs(do.call(rbind, dom_list)$Loading) >= 0.70,
"darkgreen", "#E67E22"))| Factor | Variable | Loading | |
|---|---|---|---|
| power_long_shots | RC1 | power_long_shots | 0.885 |
| mentality_vision | RC1 | mentality_vision | 0.875 |
| skill_ball_control | RC1 | skill_ball_control | 0.872 |
| power_shot_power | RC1 | power_shot_power | 0.844 |
| skill_curve | RC1 | skill_curve | 0.835 |
| skill_dribbling | RC1 | skill_dribbling | 0.834 |
| attacking_volleys | RC1 | attacking_volleys | 0.830 |
| mentality_positioning | RC1 | mentality_positioning | 0.824 |
| attacking_finishing | RC1 | attacking_finishing | 0.822 |
| skill_fk_accuracy | RC1 | skill_fk_accuracy | 0.808 |
| mentality_composure | RC1 | mentality_composure | 0.771 |
| mentality_penalties | RC1 | mentality_penalties | 0.769 |
| attacking_short_passing | RC1 | attacking_short_passing | 0.765 |
| overall | RC1 | overall | 0.700 |
| attacking_crossing | RC1 | attacking_crossing | 0.679 |
| skill_moves | RC1 | skill_moves | 0.675 |
| movement_reactions | RC1 | movement_reactions | 0.666 |
| skill_long_passing | RC1 | skill_long_passing | 0.650 |
| movement_agility | RC1 | movement_agility | 0.471 |
| potential | RC1 | potential | 0.431 |
| international_reputation | RC1 | international_reputation | 0.428 |
| defending_standing_tackle | RC2 | defending_standing_tackle | 0.944 |
| mentality_interceptions | RC2 | mentality_interceptions | 0.944 |
| defending_sliding_tackle | RC2 | defending_sliding_tackle | 0.935 |
| defending_marking_awareness | RC2 | defending_marking_awareness | 0.931 |
| mentality_aggression | RC2 | mentality_aggression | 0.680 |
| skill_long_passing1 | RC2 | skill_long_passing | 0.589 |
| attacking_short_passing1 | RC2 | attacking_short_passing | 0.470 |
| power_stamina | RC2 | power_stamina | 0.443 |
| overall1 | RC2 | overall | 0.440 |
| movement_reactions1 | RC2 | movement_reactions | 0.430 |
| attacking_finishing1 | RC2 | attacking_finishing | -0.417 |
| power_strength | RC3 | power_strength | 0.848 |
| attacking_heading_accuracy | RC3 | attacking_heading_accuracy | 0.806 |
| weight_kg | RC3 | weight_kg | 0.802 |
| height_cm | RC3 | height_cm | 0.801 |
| power_jumping | RC3 | power_jumping | 0.482 |
| movement_agility1 | RC3 | movement_agility | -0.522 |
| movement_balance | RC3 | movement_balance | -0.678 |
| movement_sprint_speed | RC4 | movement_sprint_speed | 0.804 |
| movement_acceleration | RC4 | movement_acceleration | 0.774 |
| movement_agility2 | RC4 | movement_agility | 0.551 |
| power_stamina1 | RC4 | power_stamina | 0.509 |
| power_jumping1 | RC4 | power_jumping | 0.423 |
| movement_balance1 | RC4 | movement_balance | 0.401 |
| potential1 | RC5 | potential | 0.741 |
| age | RC5 | age | -0.713 |
load_heat <- loadings_rot[, 1:n_comp]
corrplot(
load_heat,
is.corr = FALSE,
method = "color",
tl.cex = 0.7, tl.col = "black",
col = colorRampPalette(c("#2166AC", "white", "#D6604D"))(200),
title = "VARIMAX Loading Heatmap -- 5 Rotated Factors",
mar = c(0, 0, 2, 0),
cl.lim = c(-1, 1)
)Figure 6. VARIMAX Loading Heatmap. Red = high positive loading, Blue = high negative loading. Each row should ideally show one dark cell (simple structure).
Technical Interpretation – VARIMAX Heatmap:
The heatmap is a visual test of simple structure quality. In perfect simple structure: every row (variable) has exactly one dark cell (one dominant factor), and every column (factor) has a clearly defined block of dark cells (a coherent variable cluster).
RC1 column shows a solid dark red block across all attacking and skill variables in the upper portion of the heatmap, with white or near-white cells in all other columns for these rows. This confirms textbook simple structure for the technical/attacking cluster.
RC2 column shows a concentrated dark red block for the four defending variables at the lower portion, with near-white cells for all non-defending variables. The clean separation of this column from RC1 quantifies the orthogonality between technical and defensive specialization.
RC3 column shows red for the height/weight/strength cluster and blue for agility/balance variables, encoding the physical build bipolar dimension.
RC4 and RC5 show more diffuse coloring because speed and career-stage are inherently less tightly clustered constructs. This is acceptable and expected: RC4 and RC5 explain less variance (5.16% and 3.94%) precisely because they are narrower, more specific dimensions.
Cross-loading rows (variables with multiple colored cells): skill_long_passing shows moderate loadings on both RC1 (0.650) and RC2 (0.589), reflecting its dual role as a technical attribute used by both creative playmakers and defensive midfielders. movement_agility appears in RC1, RC3, and RC4, reflecting its multidimensional nature. These cross-loadings are not failures of the rotation but genuine reflections of the multifaceted nature of certain specific attributes.
comm_df <- data.frame(
Variable = names(h2_rot),
Communality = round(h2_rot, 3),
Representation = ifelse(h2_rot >= 0.70, "Well explained (h2 >= 0.70)",
ifelse(h2_rot >= 0.50, "Adequately explained (0.50 <= h2 < 0.70)",
"Poorly explained (h2 < 0.50)"))
) %>% arrange(Communality)
comm_df %>%
kable(caption = "Table 15. Variable Communalities -- Proportion of Variance Explained by 5-Factor Solution (sorted ascending)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
column_spec(2, bold = TRUE) %>%
column_spec(3, bold = TRUE,
color = ifelse(comm_df$Communality >= 0.70, "darkgreen",
ifelse(comm_df$Communality >= 0.50, "#E67E22", "red")))| Variable | Communality | Representation | |
|---|---|---|---|
| weak_foot | weak_foot | 0.140 | Poorly explained (h2 < 0.50) |
| international_reputation | international_reputation | 0.264 | Poorly explained (h2 < 0.50) |
| skill_moves | skill_moves | 0.553 | Adequately explained (0.50 <= h2 < 0.70) |
| power_jumping | power_jumping | 0.564 | Adequately explained (0.50 <= h2 < 0.70) |
| power_stamina | power_stamina | 0.574 | Adequately explained (0.50 <= h2 < 0.70) |
| mentality_aggression | mentality_aggression | 0.670 | Adequately explained (0.50 <= h2 < 0.70) |
| weight_kg | weight_kg | 0.678 | Adequately explained (0.50 <= h2 < 0.70) |
| attacking_crossing | attacking_crossing | 0.682 | Adequately explained (0.50 <= h2 < 0.70) |
| mentality_penalties | mentality_penalties | 0.729 | Well explained (h2 >= 0.70) |
| attacking_heading_accuracy | attacking_heading_accuracy | 0.732 | Well explained (h2 >= 0.70) |
| skill_fk_accuracy | skill_fk_accuracy | 0.732 | Well explained (h2 >= 0.70) |
| movement_balance | movement_balance | 0.739 | Well explained (h2 >= 0.70) |
| height_cm | height_cm | 0.751 | Well explained (h2 >= 0.70) |
| power_shot_power | power_shot_power | 0.752 | Well explained (h2 >= 0.70) |
| skill_curve | skill_curve | 0.766 | Well explained (h2 >= 0.70) |
| mentality_composure | mentality_composure | 0.766 | Well explained (h2 >= 0.70) |
| movement_reactions | movement_reactions | 0.771 | Well explained (h2 >= 0.70) |
| age | age | 0.780 | Well explained (h2 >= 0.70) |
| skill_long_passing | skill_long_passing | 0.788 | Well explained (h2 >= 0.70) |
| movement_sprint_speed | movement_sprint_speed | 0.790 | Well explained (h2 >= 0.70) |
| mentality_positioning | mentality_positioning | 0.796 | Well explained (h2 >= 0.70) |
| power_strength | power_strength | 0.798 | Well explained (h2 >= 0.70) |
| movement_agility | movement_agility | 0.808 | Well explained (h2 >= 0.70) |
| attacking_volleys | attacking_volleys | 0.809 | Well explained (h2 >= 0.70) |
| mentality_vision | mentality_vision | 0.814 | Well explained (h2 >= 0.70) |
| power_long_shots | power_long_shots | 0.835 | Well explained (h2 >= 0.70) |
| skill_dribbling | skill_dribbling | 0.837 | Well explained (h2 >= 0.70) |
| attacking_short_passing | attacking_short_passing | 0.838 | Well explained (h2 >= 0.70) |
| skill_ball_control | skill_ball_control | 0.859 | Well explained (h2 >= 0.70) |
| movement_acceleration | movement_acceleration | 0.862 | Well explained (h2 >= 0.70) |
| potential | potential | 0.869 | Well explained (h2 >= 0.70) |
| attacking_finishing | attacking_finishing | 0.871 | Well explained (h2 >= 0.70) |
| overall | overall | 0.887 | Well explained (h2 >= 0.70) |
| defending_marking_awareness | defending_marking_awareness | 0.902 | Well explained (h2 >= 0.70) |
| mentality_interceptions | mentality_interceptions | 0.915 | Well explained (h2 >= 0.70) |
| defending_sliding_tackle | defending_sliding_tackle | 0.919 | Well explained (h2 >= 0.70) |
| defending_standing_tackle | defending_standing_tackle | 0.925 | Well explained (h2 >= 0.70) |
Technical Interpretation – Communalities:
Communality (h2) is the proportion of a variable’s total variance explained by the 5 retained factors. h2 = 1 means the variable is perfectly explained; h2 = 0 means it is completely unique.
30 of 37 variables (81.1%) achieve h2 >= 0.70, confirming excellent coverage. The highest communalities belong to the defending cluster: defending_standing_tackle (0.925), defending_sliding_tackle (0.919), mentality_interceptions (0.915), defending_marking_awareness (0.902). These extreme values reflect the near-perfect intercorrelation within the defending cluster: essentially all of their variance is captured by a single factor (RC2), leaving almost no unique residual.
weak_foot (h2 = 0.140): Only 14% of weak_foot variance is shared with the 5 factors. Footedness (the ability of the non-dominant foot) is largely an individual anatomical and practice-history characteristic that is statistically independent from all ability dimensions. A player can have skill_dribbling = 90 and weak_foot = 3, or skill_dribbling = 55 and weak_foot = 5. There is simply no consistent relationship between overall technical quality and weak foot rating, which is why the 5-factor model fails to capture it.
international_reputation (h2 = 0.264): Only 26.4% is explained. Reputation in FIFA 23 is influenced by commercial partnerships, historical peak performance, nationality and league visibility, and media prominence – factors with no direct mapping to any of the five ability dimensions. A veteran star may retain reputation = 4 as their actual attributes decline, and a technically exceptional but commercially overlooked player may hold reputation = 1 despite high ability scores.
These two variables are retained but carry minimal interpretive weight when discussing the factor structure. Their low communalities are substantively meaningful, not analytical failures.
set.seed(42)
idx_split <- sample(1:nrow(data_final), nrow(data_final) / 2)
data_s1 <- data_final[idx_split, ]
data_s2 <- data_final[-idx_split, ]
fa_s1 <- principal(data_s1, nfactors = n_comp, rotate = "varimax")
fa_s2 <- principal(data_s2, nfactors = n_comp, rotate = "varimax")
var_s1 <- round(fa_s1$Vaccounted[2, 1:n_comp] * 100, 2)
var_s2 <- round(fa_s2$Vaccounted[2, 1:n_comp] * 100, 2)
val_df <- data.frame(
Factor = paste0("RC", 1:n_comp),
Sample_1 = var_s1,
Sample_2 = var_s2,
Difference = round(abs(var_s1 - var_s2), 2),
Stable = ifelse(abs(var_s1 - var_s2) < 5, "Stable", "Unstable")
)
val_df %>%
kable(caption = "Table 16. Split-Sample Validation (n = 2,500 each, set.seed = 42)",
col.names = c("Factor", "Sample 1 Var (%)", "Sample 2 Var (%)",
"Difference (%)", "Stable (diff < 5%)?")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
column_spec(4, bold = TRUE) %>%
column_spec(5, bold = TRUE,
color = ifelse(val_df$Stable == "Stable", "darkgreen", "red"))| Factor | Sample 1 Var (%) | Sample 2 Var (%) | Difference (%) | Stable (diff < 5%)? | |
|---|---|---|---|---|---|
| RC1 | RC1 | 33.27 | 33.75 | 0.48 | Stable |
| RC2 | RC2 | 17.21 | 16.67 | 0.54 | Stable |
| RC3 | RC3 | 12.85 | 12.88 | 0.03 | Stable |
| RC4 | RC4 | 7.21 | 7.97 | 0.76 | Stable |
| RC5 | RC5 | 4.28 | 4.10 | 0.18 | Stable |
max_diff <- max(val_df$Difference)
cat(sprintf("\nMaximum variance difference across all factors: %.2f%%\n", max_diff))##
## Maximum variance difference across all factors: 0.76%
cat(ifelse(max_diff < 5,
"STABLE: solution generalizes reliably beyond this specific sample.",
"WARNING: solution may be unstable."))## STABLE: solution generalizes reliably beyond this specific sample.
Technical Interpretation – Split-Sample Validation:
Split-sample validation directly tests whether the 5-factor structure is a stable property of the FIFA 23 player population or an artifact of the specific 5,000-player random sample. The procedure splits the data into two independent n = 2,500 subsamples (set.seed = 42 for reproducibility), runs FA separately on each half, and compares the variance explained by each factor. If the structure were overfitted, substantial variance differences would emerge between the two halves.
Maximum difference = 0.76% (RC4). All five factors show differences of less than 1%, far below the 5% stability threshold. This result has several important implications:
RC1 (33.27% vs. 33.75%, diff = 0.48%): The dominant Technical and Attacking Ability factor is virtually identical across both halves. The near-perfect replication confirms that the attacking/technical ability dimension is the strongest and most stable structural feature of the FIFA 23 player population.
RC2 (17.21% vs. 16.67%, diff = 0.54%): The defending dimension replicates equally well, consistent with the near-perfect loadings (0.93 to 0.944) that define it.
RC3 (12.85% vs. 12.88%, diff = 0.03%): The most stable factor in the entire solution. Physical measurements (height, weight, strength) are more objectively assigned in FIFA than technical ratings, producing an exceptionally consistent distribution across any subsample.
RC4 (7.21% vs. 7.97%, diff = 0.76%): The largest difference, plausibly reflecting slight sampling variability in the proportion of speed-specialist wingers and full-backs between the two halves.
RC5 (4.28% vs. 4.10%, diff = 0.18%): Despite being the most conceptually unusual factor (career stage rather than ability), it replicates with excellent stability.
Practical conclusion: The 5-factor VARIMAX solution is robust and generalizable. Factor scores computed from this solution can be used with confidence in downstream analyses (player clustering, position classification, market value prediction) without concern that the structure is sample-dependent.
factor_scores <- as.data.frame(fa_rot$scores)
colnames(factor_scores) <- paste0("Factor_", 1:n_comp)
fwrite(factor_scores, "fifa23_factor_scores.csv")
pc_scores <- as.data.frame(pc$x[, 1:n_comp])
fwrite(pc_scores, "fifa23_pc_scores.csv")
cat("Factor scores saved: fifa23_factor_scores.csv\n")## Factor scores saved: fifa23_factor_scores.csv
## PC scores saved: fifa23_pc_scores.csv
round(sapply(factor_scores, function(x)
c(Mean = mean(x), SD = sd(x), Min = min(x), Max = max(x))), 3) %>%
t() %>%
kable(caption = "Table 17. Factor Score Distribution (standardized: mean ~= 0, SD ~= 1)") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
column_spec(1, bold = TRUE)| Mean | SD | Min | Max | |
|---|---|---|---|---|
| Factor_1 | 0 | 1 | -2.820 | 3.794 |
| Factor_2 | 0 | 1 | -2.773 | 2.367 |
| Factor_3 | 0 | 1 | -2.787 | 3.618 |
| Factor_4 | 0 | 1 | -4.706 | 3.204 |
| Factor_5 | 0 | 1 | -3.055 | 3.188 |
data.frame(
Metric = c(
"Dataset",
"Total Raw Records",
"Analysis Sample",
"Variables Used",
"Significant Correlation Pairs",
"Bartlett Test Result",
"Overall KMO",
"KMO Classification",
"Variables Removed (MSA < 0.50)",
"Components Retained (Kaiser Rule)",
"Total Variance Explained",
"Variance by PC1 alone",
"Variance by PC1 and PC2",
"VARIMAX Factors",
"Split-Sample Max Difference"
),
Value = c(
"FIFA 23 Complete Player Dataset (Kaggle, 2022)",
"147,400 players",
"5,000 players (set.seed = 123)",
paste0(ncol(data_final),
" numeric sub-attributes (6 aggregates excluded)"),
"318 / 666 pairs (47.7%) with |r| > 0.3",
"chi-sq = 218,585.80; df = 666; p ~= 0 (< 2.2e-16)",
"0.942",
"Marvelous (>= 0.90)",
"0 variables removed (minimum MSAi = 0.708)",
paste0(n_comp, " components (PC1 to PC", n_comp, ")"),
paste0(round(cum_var[n_comp] * 100, 2), "% of total variance"),
paste0(round(prop_var[1] * 100, 2), "%"),
paste0(round(cum_var[2] * 100, 2), "%"),
paste0("RC1: Technical-Attacking | RC2: Defensive-Aggression | ",
"RC3: Physical-Aerial | RC4: Speed-Stamina | ",
"RC5: Youth Potential vs. Experience"),
paste0(max(val_df$Difference),
"% (RC4) -- well below 5% stability threshold")
)
) %>%
kable(caption = "Table 18. Complete PCA and FA Results Summary") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
column_spec(1, bold = TRUE, width = "22em") %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
row_spec(c(7, 8, 9, 10, 11, 14, 15), bold = TRUE, background = "#EBF5FB")| Metric | Value |
|---|---|
| Dataset | FIFA 23 Complete Player Dataset (Kaggle, 2022) |
| Total Raw Records | 147,400 players |
| Analysis Sample | 5,000 players (set.seed = 123) |
| Variables Used | 37 numeric sub-attributes (6 aggregates excluded) |
| Significant Correlation Pairs | 318 / 666 pairs (47.7%) with |r| > 0.3 |
| Bartlett Test Result | chi-sq = 218,585.80; df = 666; p ~= 0 (< 2.2e-16) |
| Overall KMO | 0.942 |
| KMO Classification | Marvelous (>= 0.90) |
| Variables Removed (MSA < 0.50) | 0 variables removed (minimum MSAi = 0.708) |
| Components Retained (Kaiser Rule) | 5 components (PC1 to PC5) |
| Total Variance Explained | 75.04% of total variance |
| Variance by PC1 alone | 35.73% |
| Variance by PC1 and PC2 | 56.09% |
| VARIMAX Factors | RC1: Technical-Attacking | RC2: Defensive-Aggression | RC3: Physical-Aerial | RC4: Speed-Stamina | RC5: Youth Potential vs. Experience |
| Split-Sample Max Difference | 0.76% (RC4) – well below 5% stability threshold |
KMO = 0.942 means 94.2% of total inter-variable variance is driven by common latent factors, placing this dataset in the highest practical category for PCA/FA. Bartlett’s chi-sq = 218,585.80 (p ~= 0) rejects the identity matrix hypothesis with a test statistic 296 times larger than the critical value. Together, these results confirm that the FIFA 23 player attribute data has an exceptionally strong and fully recoverable latent factor structure. The exclusion of 6 aggregate attributes was methodologically essential: retaining them would have produced a singular correlation matrix, making the entire analysis invalid.
The reduction from 37 original attributes to 5 principal components retaining 75.04% of total variance achieves a 7.4:1 dimensionality compression with 24.96% information loss. The five dimensions are: RC1 – Technical and Attacking Ability (35.73% of variance), RC2 – Defensive and Aggression (20.36%), RC3 – Physical Strength and Aerial (9.85%), RC4 – Speed and Stamina (5.16%), and RC5 – Youth Potential vs. Experience (3.94%). Each dimension is interpretable in terms of well-established football analytics concepts, confirming that PCA has recovered genuine latent structure rather than mathematical noise.
The negative correlations (-0.40 to -0.60) between attacking/skill variables and defending variables, combined with the orthogonality of RC1 and RC2 in the VARIMAX solution, establish that technical attacking ability and defensive competence are statistically independent dimensions in FIFA 23. This has direct analytical implications: any method that combines these attributes into a single “overall quality” metric conflates two structurally distinct player types. The 5-factor solution correctly separates them, enabling meaningful comparison of attackers and defenders within their respective specialization dimensions.
weak_foot (h2 = 0.140) and international_reputation (h2 = 0.264) have critically low communalities because they measure constructs genuinely orthogonal to the five ability dimensions. Their low communalities are not data quality failures but analytical evidence that these two attributes operate on different latent dimensions entirely (individual lateralization for weak_foot; commercial and historical reputation for international_reputation). A sixth factor specifically targeting reputation and marketability might capture international_reputation, but this is beyond the current analysis scope.
All five factors replicate within 1% variance difference across two independent n = 2,500 subsamples, far below the 5% stability threshold. The most stable factor is RC3 (diff = 0.03%), reflecting the objective nature of physical measurements. The least stable is RC4 (diff = 0.76%), plausibly due to subsample variation in speed-specialist positions. Overall, the factor structure is a robust property of the FIFA 23 player population, not a sampling artifact, and factor scores can be used with full confidence in downstream analyses.
psych – KMO(), cortest.bartlett(), principal()factoextra – fviz_eig(), fviz_pca_biplot(),
fviz_contrib()corrplot – corrplot()kableExtra – table stylingggplot2 + tidyr – distribution
visualizationdata.table – fread(), fwrite()Analysis Date: March 01, 2026
Multivariate Analysis – FIFA 23 PCA and FA Full Report
INT2024 | Course Lecturer: Ulfa Siti Nuraini, S.Stat., M.Stat.