1 Executive Summary

This report presents a complete Principal Component Analysis (PCA) and Factor Analysis (FA) applied to the FIFA 23 Complete Player Dataset (Stefano Leone, 2022), containing 147,400 player records. A random sample of 5,000 players was drawn (set.seed = 123) from 37 numeric sub-attributes covering attacking, skill, movement, power, mentality, and defending capabilities.

Key Findings:

  • 47.7% of variable pairs show |r| > 0.3: data contains sufficient shared variance for PCA/FA
  • KMO = 0.942 (Marvelous): all 37 variables have individual MSA >= 0.50, no removal needed
  • Bartlett chi-sq = 218,585.80, p ~= 0: correlation structure is non-trivial and significant
  • 5 principal components retained via Kaiser’s Rule, explaining 75.04% of total variance
  • VARIMAX rotation identified 5 interpretable factors: Technical and Attacking Ability (RC1), Defensive and Aggression (RC2), Physical Strength and Aerial (RC3), Speed and Stamina (RC4), Youth Potential vs. Experience (RC5)
  • Split-sample validation confirms solution stability (max difference = 0.76% across all 5 factors)

2 A. Data Characteristics

2.1 Dataset Overview

df_raw <- fread("fifa23_clean_numeric.csv")

# Sub-attributes only -- the 6 main aggregate attributes (pace, shooting,
# passing, dribbling, defending, physic) are arithmetic means of their
# sub-components, creating perfect multicollinearity that invalidates KMO
kolom_analisis <- c(
  "overall", "potential",
  "attacking_crossing", "attacking_finishing",
  "attacking_heading_accuracy", "attacking_short_passing", "attacking_volleys",
  "skill_dribbling", "skill_curve", "skill_fk_accuracy",
  "skill_long_passing", "skill_ball_control",
  "movement_acceleration", "movement_sprint_speed",
  "movement_agility", "movement_reactions", "movement_balance",
  "power_shot_power", "power_jumping", "power_stamina",
  "power_strength", "power_long_shots",
  "mentality_aggression", "mentality_interceptions",
  "mentality_positioning", "mentality_vision",
  "mentality_penalties", "mentality_composure",
  "defending_marking_awareness", "defending_standing_tackle",
  "defending_sliding_tackle",
  "skill_moves", "weak_foot", "international_reputation",
  "height_cm", "weight_kg", "age"
)

kolom_analisis <- intersect(kolom_analisis, names(df_raw))
data_pca <- df_raw %>% select(all_of(kolom_analisis)) %>% na.omit()

set.seed(123)
data_pca <- data_pca %>% sample_n(5000)

cat("Dataset Dimensions (raw):", nrow(df_raw), "rows x", ncol(df_raw), "columns\n")
## Dataset Dimensions (raw): 147400 rows x 45 columns
cat("Analysis Sample:", nrow(data_pca), "rows x", ncol(data_pca), "columns\n")
## Analysis Sample: 5000 rows x 37 columns
data.frame(
  Information = c("Source", "Author", "Year",
                  "Total Records (Raw)", "Analysis Sample",
                  "Variables Used", "Sampling Method", "Random Seed"),
  Detail = c(
    "Kaggle -- FIFA 23 Complete Player Dataset",
    "Stefano Leone",
    "2022",
    "147,400 players",
    "5,000 players (random sample without replacement)",
    paste0(ncol(data_pca), " numeric sub-attributes"),
    "Simple Random Sampling",
    "set.seed(123)"
  )
) %>%
  kable(caption = "Table 1. Dataset Summary Information",
        col.names = c("Information", "Detail")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E")
Table 1. Dataset Summary Information
Information Detail
Source Kaggle – FIFA 23 Complete Player Dataset
Author Stefano Leone
Year 2022
Total Records (Raw) 147,400 players
Analysis Sample 5,000 players (random sample without replacement)
Variables Used 37 numeric sub-attributes
Sampling Method Simple Random Sampling
Random Seed set.seed(123)

2.2 Variable List

data.frame(
  No       = 1:ncol(data_pca),
  Variable = names(data_pca),
  Category = case_when(
    names(data_pca) %in% c("overall", "potential") ~ "Overall",
    grepl("attacking_", names(data_pca))           ~ "Attacking",
    grepl("skill_dribbling|skill_curve|skill_fk|skill_long|skill_ball",
          names(data_pca))                         ~ "Skill",
    grepl("movement_", names(data_pca))            ~ "Movement",
    grepl("power_", names(data_pca))               ~ "Power",
    grepl("mentality_", names(data_pca))           ~ "Mentality",
    grepl("defending_", names(data_pca))           ~ "Defending",
    TRUE                                           ~ "Physical / Misc"
  ),
  Scale = case_when(
    names(data_pca) %in%
      c("skill_moves", "weak_foot", "international_reputation") ~ "1-5 (discrete)",
    names(data_pca) == "height_cm" ~ "cm",
    names(data_pca) == "weight_kg" ~ "kg",
    names(data_pca) == "age"       ~ "years",
    TRUE                           ~ "1-100 (continuous)"
  )
) %>%
  kable(caption = "Table 2. Research Variables (37 Sub-Attributes Used in Analysis)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E")
Table 2. Research Variables (37 Sub-Attributes Used in Analysis)
No Variable Category Scale
1 overall Overall 1-100 (continuous)
2 potential Overall 1-100 (continuous)
3 attacking_crossing Attacking 1-100 (continuous)
4 attacking_finishing Attacking 1-100 (continuous)
5 attacking_heading_accuracy Attacking 1-100 (continuous)
6 attacking_short_passing Attacking 1-100 (continuous)
7 attacking_volleys Attacking 1-100 (continuous)
8 skill_dribbling Skill 1-100 (continuous)
9 skill_curve Skill 1-100 (continuous)
10 skill_fk_accuracy Skill 1-100 (continuous)
11 skill_long_passing Skill 1-100 (continuous)
12 skill_ball_control Skill 1-100 (continuous)
13 movement_acceleration Movement 1-100 (continuous)
14 movement_sprint_speed Movement 1-100 (continuous)
15 movement_agility Movement 1-100 (continuous)
16 movement_reactions Movement 1-100 (continuous)
17 movement_balance Movement 1-100 (continuous)
18 power_shot_power Power 1-100 (continuous)
19 power_jumping Power 1-100 (continuous)
20 power_stamina Power 1-100 (continuous)
21 power_strength Power 1-100 (continuous)
22 power_long_shots Power 1-100 (continuous)
23 mentality_aggression Mentality 1-100 (continuous)
24 mentality_interceptions Mentality 1-100 (continuous)
25 mentality_positioning Mentality 1-100 (continuous)
26 mentality_vision Mentality 1-100 (continuous)
27 mentality_penalties Mentality 1-100 (continuous)
28 mentality_composure Mentality 1-100 (continuous)
29 defending_marking_awareness Defending 1-100 (continuous)
30 defending_standing_tackle Defending 1-100 (continuous)
31 defending_sliding_tackle Defending 1-100 (continuous)
32 skill_moves Physical / Misc 1-5 (discrete)
33 weak_foot Physical / Misc 1-5 (discrete)
34 international_reputation Physical / Misc 1-5 (discrete)
35 height_cm Physical / Misc cm
36 weight_kg Physical / Misc kg
37 age Physical / Misc years

2.3 Descriptive Statistics

desc_df <- data.frame(
  Variable = names(data_pca),
  N        = sapply(data_pca, length),
  Mean     = round(sapply(data_pca, mean),   2),
  Median   = round(sapply(data_pca, median), 2),
  SD       = round(sapply(data_pca, sd),     2),
  Min      = round(sapply(data_pca, min),    2),
  Max      = round(sapply(data_pca, max),    2)
)

desc_df %>%
  kable(caption = "Table 3. Descriptive Statistics -- All 37 Variables (n = 5,000)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  scroll_box(width = "100%", height = "420px")
Table 3. Descriptive Statistics – All 37 Variables (n = 5,000)
Variable N Mean Median SD Min Max
overall overall 5000 65.93 66 6.76 46 89
potential potential 5000 71.07 71 6.21 49 91
attacking_crossing attacking_crossing 5000 53.86 56 13.38 15 90
attacking_finishing attacking_finishing 5000 50.70 53 15.93 15 91
attacking_heading_accuracy attacking_heading_accuracy 5000 56.45 57 11.40 19 90
attacking_short_passing attacking_short_passing 5000 62.98 64 9.32 23 92
attacking_volleys attacking_volleys 5000 46.34 46 14.47 12 88
skill_dribbling skill_dribbling 5000 61.32 63 11.78 20 95
skill_curve skill_curve 5000 51.94 52 14.37 17 93
skill_fk_accuracy skill_fk_accuracy 5000 46.79 45 14.31 12 94
skill_long_passing skill_long_passing 5000 56.94 58 11.58 20 90
skill_ball_control skill_ball_control 5000 63.39 64 9.64 27 94
movement_acceleration movement_acceleration 5000 68.35 69 11.27 27 94
movement_sprint_speed movement_sprint_speed 5000 68.41 69 11.21 29 96
movement_agility movement_agility 5000 66.66 68 12.06 26 94
movement_reactions movement_reactions 5000 61.91 62 8.71 32 91
movement_balance movement_balance 5000 67.20 68 12.11 28 95
power_shot_power power_shot_power 5000 59.12 61 13.00 18 90
power_jumping power_jumping 5000 65.76 67 11.88 30 93
power_stamina power_stamina 5000 67.10 68 11.38 28 94
power_strength power_strength 5000 65.59 67 12.78 27 96
power_long_shots power_long_shots 5000 51.33 54 15.62 12 89
mentality_aggression mentality_aggression 5000 59.22 60 13.59 20 94
mentality_interceptions mentality_interceptions 5000 50.78 56 18.20 10 88
mentality_positioning mentality_positioning 5000 55.62 58 14.05 12 91
mentality_vision mentality_vision 5000 56.22 57 12.49 17 91
mentality_penalties mentality_penalties 5000 51.50 51 12.19 18 91
mentality_composure mentality_composure 5000 60.25 60 10.11 30 93
defending_marking_awareness defending_marking_awareness 5000 50.88 55 17.19 10 88
defending_standing_tackle defending_standing_tackle 5000 52.66 59 18.22 10 88
defending_sliding_tackle defending_sliding_tackle 5000 50.22 56 18.09 10 87
skill_moves skill_moves 5000 2.56 2 0.65 2 5
weak_foot weak_foot 5000 3.01 3 0.65 1 5
international_reputation international_reputation 5000 1.09 1 0.34 1 5
height_cm height_cm 5000 180.62 180 6.66 156 206
weight_kg weight_kg 5000 74.26 74 6.65 54 101
age age 5000 25.02 25 4.51 16 39

Technical Interpretation – Descriptive Statistics:

Movement attributes (acceleration, sprint_speed, agility, reactions, balance) record the highest means in the dataset (61.91 to 68.41) with relatively narrow standard deviations (8.71 to 12.11). This reflects a player population dominated by young-to-prime-age outfield players for whom athletic movement qualities are well developed. The narrow dispersion in this category signals moderate homogeneity: most outfield players cluster within a similar movement ability range.

Defending attributes (mentality_interceptions, defending_marking_awareness, defending_standing_tackle, defending_sliding_tackle) have the largest standard deviations in the entire dataset (17.19 to 18.22). This is analytically important: it is not noise but a position-driven bimodal distribution. Attackers and wingers receive values of 10 to 30 on these attributes while defenders and defensive midfielders receive 60 to 88. High variance here provides strong discriminatory signal for factor separation in PCA/FA.

Scale heterogeneity is a critical preprocessing concern. skill_moves, weak_foot, and international_reputation operate on a 1 to 5 discrete scale with very low standard deviations (0.34 to 0.65), while height_cm (156 to 206 cm) and weight_kg (54 to 101 kg) are physical measurements with their own units. Without standardization, variables with larger numeric ranges would dominate PC directions regardless of their true variance structure. Applying scale. = TRUE in prcomp() resolves this by converting all variables to mean = 0, SD = 1 before eigendecomposition.

potential (mean = 71.07) consistently exceeds overall (mean = 65.93) by approximately 5 points on average, consistent with FIFA 23 game logic where potential represents the development ceiling. This gap is largest for players under age 21 and near zero for veterans whose career has peaked.

international_reputation has the lowest mean (1.09) and lowest SD (0.34), meaning the vast majority of the 5,000 sampled players have a reputation rating of exactly 1. This near-constant distribution is the primary reason international_reputation will show the second-lowest communality (h2 = 0.264) in the PCA solution: near-constant variables carry minimal variance for PCA to extract.

2.4 Distribution Visualization

data_long <- data_pca %>%
  tidyr::pivot_longer(cols = everything(),
                      names_to  = "Variable",
                      values_to = "Value") %>%
  mutate(Category = case_when(
    Variable %in% c("overall", "potential")                                            ~ "Overall",
    grepl("attacking_", Variable)                                                      ~ "Attacking",
    grepl("skill_dribbling|skill_curve|skill_fk|skill_long|skill_ball", Variable)     ~ "Skill",
    grepl("movement_", Variable)                                                       ~ "Movement",
    grepl("power_", Variable)                                                          ~ "Power",
    grepl("mentality_", Variable)                                                      ~ "Mentality",
    grepl("defending_", Variable)                                                      ~ "Defending",
    TRUE                                                                               ~ "Physical / Misc"
  ))

ggplot(data_long, aes(x = Variable, y = Value, fill = Category)) +
  geom_boxplot(outlier.size = 0.4, outlier.alpha = 0.3, linewidth = 0.4) +
  facet_wrap(~ Category, scales = "free", ncol = 2) +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal(base_size = 10) +
  theme(
    axis.text.x      = element_text(angle = 45, hjust = 1, size = 7),
    strip.background = element_rect(fill = "#34495E", color = "#34495E"),
    strip.text       = element_text(color = "white", face = "bold"),
    legend.position  = "none",
    plot.title       = element_text(hjust = 0.5, face = "bold", size = 12)
  ) +
  labs(
    title = "Figure 1. Distribution of FIFA 23 Player Attributes by Category",
    x = NULL, y = "Value"
  )
Figure 1. Boxplot Distribution of All 37 FIFA 23 Player Attributes by Category. Wide IQR indicates high variance and strong discriminatory potential for PCA.

Figure 1. Boxplot Distribution of All 37 FIFA 23 Player Attributes by Category. Wide IQR indicates high variance and strong discriminatory potential for PCA.

Technical Interpretation – Distribution Patterns:

Defending category shows the most analytically informative distributions. The boxplots for mentality_interceptions, defending_standing_tackle, defending_sliding_tackle, and defending_marking_awareness display extremely wide IQRs spanning nearly the full 1 to 100 range, with medians hovering around 50 to 60. This near-uniform spread reflects a population that is roughly half attackers (scoring 10 to 30) and half defenders (scoring 60 to 88), creating high variance that PCA will leverage to define its second principal component.

Movement category displays right-skewed distributions with outliers concentrated at the low end. The bulk of players cluster between 60 and 80, with a lower tail of low-mobility players (goalkeepers, physically large central defenders, older veterans). This skew does not violate PCA assumptions since PCA only requires linear correlation structure, not normality.

Physical/Misc category reveals three structurally distinct distributions: (1) height_cm and weight_kg follow near-normal distributions consistent with the real anthropometric distribution of professional footballers; (2) skill_moves and weak_foot follow discrete distributions concentrated at values 2 to 3 with sharp cutoffs at the boundaries; (3) international_reputation is extremely right-skewed with virtually all mass at value 1, confirming its near-constant nature that will produce low communality in PCA.

Attacking and Skill categories show left-skewed distributions (median above scale midpoint of 50), indicating that the sampled population has above-average technical and attacking skills relative to the theoretical minimum. Low-end outliers in these categories are typically defensive specialists receiving minimal investment in attacking attributes from the game’s design system.


3 B. Assumptions

Three assumptions must ALL be satisfied before PCA/FA is valid:

  1. Correlation Matrix – at least 30% of all variable pairs must show |r| > 0.3
  2. KMO / Measure of Sampling Adequacy (MSA) – overall KMO >= 0.50; all individual MSAi >= 0.50
  3. Bartlett’s Test of Sphericity – p-value < 0.05 (reject identity matrix hypothesis)

The order tested here is Correlation first, then KMO (iterative variable removal if needed), then Bartlett on the cleaned dataset. This sequence ensures that the Bartlett test is computed on the same variable set confirmed adequate by KMO.

3.1 Pre-Cleaning

var_vals <- apply(data_pca, 2, var, na.rm = TRUE)
zero_var <- names(var_vals[var_vals < 1e-10])
if (length(zero_var) > 0) {
  cat("Dropped (zero variance):", paste(zero_var, collapse = ", "), "\n")
  data_pca <- data_pca %>% select(-all_of(zero_var))
} else {
  cat("Pre-clean Step 1: No zero-variance variables found. All 37 retained.\n")
}
## Pre-clean Step 1: No zero-variance variables found. All 37 retained.
cor_tmp  <- cor(data_pca, use = "complete.obs")
perf_idx <- which(abs(cor_tmp) > 0.9999 & upper.tri(cor_tmp), arr.ind = TRUE)
if (nrow(perf_idx) > 0) {
  drop_perf <- unique(colnames(cor_tmp)[perf_idx[, 2]])
  cat("Dropped (perfect correlation):", paste(drop_perf, collapse = ", "), "\n")
  data_pca  <- data_pca %>% select(-all_of(drop_perf))
} else {
  cat("Pre-clean Step 2: No perfectly correlated pairs found. All 37 retained.\n")
}
## Pre-clean Step 2: No perfectly correlated pairs found. All 37 retained.
cat("Variables entering assumption testing:", ncol(data_pca))
## Variables entering assumption testing: 37

Technical note on pre-cleaning: Zero-variance variables produce undefined Pearson correlations (division by SD = 0), and perfectly correlated variables make the correlation matrix singular (determinant = 0), preventing eigendecomposition. Neither condition is present here, confirming that excluding the 6 aggregate attributes was both sufficient and necessary to eliminate the multicollinearity that would otherwise have invalidated the analysis.

3.2 Assumption 1 – Correlation Matrix

Requirement: At least 30% of all unique variable pairs must show |r| > 0.3. If fewer than 30% are significant, the variables do not share enough common variance to justify extracting latent factors.

Pearson correlation formula: \[r_{xy} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}}\]

Total unique pairs for p = 37 variables: \(\binom{37}{2} = \frac{37 \times 36}{2} = 666\) pairs.

mat_corr <- round(cor(data_pca, use = "complete.obs"), 3)

if (any(is.na(mat_corr))) {
  na_vars  <- names(which(colSums(is.na(mat_corr)) > 0))
  data_pca <- data_pca %>% select(-all_of(na_vars))
  mat_corr <- round(cor(data_pca, use = "complete.obs"), 3)
}

n_var   <- ncol(data_pca)
n_pairs <- n_var * (n_var - 1) / 2
mat_abs <- abs(mat_corr); diag(mat_abs) <- 0
n_sig   <- sum(mat_abs > 0.3) / 2
pct_sig <- round(n_sig / n_pairs * 100, 1)

data.frame(
  Metric = c("Total variables (p)",
             "Total unique pairs [p(p-1)/2]",
             "Pairs with |r| > 0.3",
             "Percentage significant",
             "Minimum requirement",
             "Decision"),
  Value  = c(n_var, n_pairs, n_sig,
             paste0(pct_sig, "%"),
             "> 30%",
             "PASS -- sufficient shared variance exists for PCA/FA")
) %>%
  kable(caption = "Table 4. Correlation Matrix Assessment") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  row_spec(6, bold = TRUE, color = "darkgreen")
Table 4. Correlation Matrix Assessment
Metric Value
Total variables (p) 37
Total unique pairs [p(p-1)/2] 666
Pairs with &#124;r&#124; > 0.3 318
Percentage significant 47.7%
Minimum requirement > 30%
Decision PASS – sufficient shared variance exists for PCA/FA
corrplot(
  mat_corr,
  method = "color", type = "upper",
  tl.cex = 0.65, tl.col = "black",
  col    = colorRampPalette(c("#2166AC", "white", "#D6604D"))(200),
  title  = "FIFA 23 Player Attribute Correlation Matrix (37 Variables)",
  mar    = c(0, 0, 2, 0)
)
Figure 2. Correlation Matrix Heatmap. Red = positive correlation, Blue = negative correlation. Intensity reflects magnitude of |r|.

Figure 2. Correlation Matrix Heatmap. Red = positive correlation, Blue = negative correlation. Intensity reflects magnitude of |r|.

Technical Interpretation – Correlation Structure:

Quantitative result: 318 out of 666 unique pairs (47.7%) exceed |r| = 0.3, which is 17.7 percentage points above the minimum threshold. This confirms substantial shared variance among the 37 attributes and provides the statistical foundation for latent factor extraction.

Heatmap block structure: The correlation matrix reveals a well-defined two-block architecture. The upper-left block (attacking and skill variables) shows predominantly deep red cells, indicating high positive intercorrelations. For example, skill_dribbling and skill_ball_control (r ~= 0.83), power_shot_power and power_long_shots (r ~= 0.78), and mentality_vision and attacking_short_passing (r ~= 0.72) cluster tightly together. This entire block will constitute RC1 (Technical and Attacking Ability) in the rotated FA solution.

The lower-right cluster (defending_standing_tackle, defending_sliding_tackle, defending_marking_awareness, mentality_interceptions) shows extremely high intercorrelations ranging from 0.88 to 0.95. These four variables measure essentially the same underlying defensive competency from slightly different angles, which is why they will produce near-perfect loadings (0.93 to 0.944) on RC2 in VARIMAX.

Cross-block negative correlations (blue cells connecting the attacking cluster to the defending cluster, r approximately -0.40 to -0.60) are not statistical artifacts. They reflect a deliberate structural feature of FIFA 23’s design: the game assigns attacking specialists systematically low defending values and vice versa, embedding an inherent bipolar specialization axis into the data that PCA will capture as the contrast between PC1 and PC2.

Physical variables (height_cm, weight_kg, power_strength) form a smaller positive cluster among themselves, with near-zero or negative correlations against movement attributes (agility, balance, sprint_speed), consistent with the real biomechanical trade-off between mass and mobility.

3.3 Assumption 2 – KMO / Measure of Sampling Adequacy

KMO (Kaiser-Meyer-Olkin) quantifies what proportion of variable variance is due to common underlying factors versus unique variance. Mathematically:

\[KMO = \frac{\sum_{i \neq j} r_{ij}^2}{\sum_{i \neq j} r_{ij}^2 + \sum_{i \neq j} a_{ij}^2}\]

where \(r_{ij}\) is the observed correlation and \(a_{ij}\) is the partial correlation between variables i and j controlling for all others. A high KMO means partial correlations are small relative to observed correlations, confirming that a common factor structure drives the intercorrelations rather than unique pairwise relationships.

Kaiser (1974) classification:

KMO Value Classification
>= 0.90 Marvelous
>= 0.80 Meritorious
>= 0.70 Middling
>= 0.60 Mediocre
>= 0.50 Miserable
< 0.50 Unacceptable (remove variable)

Procedure: Variables with individual MSAi < 0.50 are removed one at a time (lowest MSAi first), recalculating KMO after each removal until all MSAi >= 0.50.

kmo_res <- KMO(mat_corr)
msa_val <- round(kmo_res$MSA, 3)
kmo_kat <- ifelse(msa_val >= 0.90, "Marvelous",
           ifelse(msa_val >= 0.80, "Meritorious",
           ifelse(msa_val >= 0.70, "Middling",
           ifelse(msa_val >= 0.60, "Mediocre",
           ifelse(msa_val >= 0.50, "Miserable", "Unacceptable")))))

cat("Overall KMO/MSA:", msa_val, "--", kmo_kat, "\n")
## Overall KMO/MSA: 0.942 -- Marvelous
msa_df <- data.frame(
  Variable = names(kmo_res$MSAi),
  MSA      = round(kmo_res$MSAi, 3),
  Category = ifelse(kmo_res$MSAi >= 0.90, "Marvelous",
             ifelse(kmo_res$MSAi >= 0.80, "Meritorious",
             ifelse(kmo_res$MSAi >= 0.70, "Middling",
             ifelse(kmo_res$MSAi >= 0.50, "Acceptable", "Drop (< 0.50)"))))
) %>% arrange(MSA)

msa_df %>%
  kable(caption = "Table 5. Individual MSA Values per Variable (sorted ascending)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  column_spec(2, bold = TRUE) %>%
  column_spec(3, bold = TRUE,
              color = ifelse(msa_df$MSA >= 0.90, "#1B5E20",
                      ifelse(msa_df$MSA >= 0.80, "#2E7D32",
                      ifelse(msa_df$MSA >= 0.70, "#558B2F",
                      ifelse(msa_df$MSA >= 0.50, "#F57F17", "red")))))
Table 5. Individual MSA Values per Variable (sorted ascending)
Variable MSA Category
age age 0.708 Middling
potential potential 0.766 Middling
movement_sprint_speed movement_sprint_speed 0.849 Meritorious
movement_acceleration movement_acceleration 0.874 Meritorious
power_jumping power_jumping 0.879 Meritorious
overall overall 0.881 Meritorious
defending_standing_tackle defending_standing_tackle 0.889 Meritorious
defending_sliding_tackle defending_sliding_tackle 0.899 Meritorious
height_cm height_cm 0.900 Marvelous
weight_kg weight_kg 0.913 Marvelous
power_strength power_strength 0.919 Marvelous
attacking_heading_accuracy attacking_heading_accuracy 0.921 Marvelous
skill_long_passing skill_long_passing 0.938 Marvelous
movement_balance movement_balance 0.940 Marvelous
power_stamina power_stamina 0.943 Marvelous
attacking_short_passing attacking_short_passing 0.945 Marvelous
mentality_interceptions mentality_interceptions 0.946 Marvelous
skill_fk_accuracy skill_fk_accuracy 0.953 Marvelous
power_long_shots power_long_shots 0.957 Marvelous
attacking_crossing attacking_crossing 0.958 Marvelous
attacking_finishing attacking_finishing 0.960 Marvelous
mentality_positioning mentality_positioning 0.963 Marvelous
movement_reactions movement_reactions 0.964 Marvelous
defending_marking_awareness defending_marking_awareness 0.965 Marvelous
skill_curve skill_curve 0.966 Marvelous
skill_ball_control skill_ball_control 0.966 Marvelous
movement_agility movement_agility 0.968 Marvelous
international_reputation international_reputation 0.968 Marvelous
power_shot_power power_shot_power 0.969 Marvelous
mentality_aggression mentality_aggression 0.969 Marvelous
skill_dribbling skill_dribbling 0.970 Marvelous
mentality_penalties mentality_penalties 0.970 Marvelous
mentality_vision mentality_vision 0.976 Marvelous
attacking_volleys attacking_volleys 0.980 Marvelous
mentality_composure mentality_composure 0.984 Marvelous
skill_moves skill_moves 0.986 Marvelous
weak_foot weak_foot 0.989 Marvelous
drop_log <- c()
data_ok  <- data_pca
iter     <- 0
max_iter <- ncol(data_pca) - 5

cat("--- Iterative Variable Removal Check ---\n")
## --- Iterative Variable Removal Check ---
repeat {
  iter <- iter + 1
  if (iter > max_iter) { cat("Max iterations reached.\n"); break }

  mc <- tryCatch(round(cor(data_ok, use = "complete.obs"), 3),
                 error = function(e) NULL)
  if (is.null(mc)) { cat("Singular matrix: stopping.\n"); break }

  det_val <- tryCatch(det(mc), error = function(e) NA)
  if (is.na(det_val) || det_val < 1e-15) {
    cat("Near-singular matrix (det < 1e-15): stopping.\n")
    cat("Note: this is expected for a 37-variable matrix with KMO = 0.942.",
        "The near-zero determinant is caused by high intercorrelations,",
        "not by data quality problems.\n")
    break
  }

  kmo_tmp <- tryCatch(KMO(mc), error = function(e) NULL)
  if (is.null(kmo_tmp)) { cat("KMO failed: stopping.\n"); break }

  msa_clean <- kmo_tmp$MSAi[!is.na(kmo_tmp$MSAi)]
  if (length(msa_clean) == 0) break

  min_msa <- min(msa_clean)
  min_var <- names(which.min(msa_clean))
  if (min_msa >= 0.5) {
    cat("All variables have MSA >= 0.50. No removal needed.\n")
    break
  }

  cat(sprintf("Dropping '%s' (MSA = %.3f)\n", min_var, min_msa))
  drop_log <- c(drop_log, min_var)
  data_ok  <- data_ok %>% select(-all_of(min_var))
}
## Near-singular matrix (det < 1e-15): stopping.
## Note: this is expected for a 37-variable matrix with KMO = 0.942. The near-zero determinant is caused by high intercorrelations, not by data quality problems.
mat_corr_ok <- round(cor(data_ok, use = "complete.obs"), 3)
kmo_final   <- KMO(mat_corr_ok)
data_final  <- data_ok

final_kmo_kat <- ifelse(kmo_final$MSA >= 0.90, "Marvelous",
                 ifelse(kmo_final$MSA >= 0.80, "Meritorious",
                 ifelse(kmo_final$MSA >= 0.70, "Middling",
                 ifelse(kmo_final$MSA >= 0.60, "Mediocre",
                 ifelse(kmo_final$MSA >= 0.50, "Miserable", "Unacceptable")))))

data.frame(
  Metric = c("Initial Variables",
             "Variables Removed (MSA < 0.50)",
             "Final Variables",
             "Final Overall KMO/MSA",
             "Classification",
             "Decision"),
  Value  = c(ncol(data_pca),
             length(drop_log),
             ncol(data_final),
             round(kmo_final$MSA, 3),
             final_kmo_kat,
             "PASS -- all variables have MSA >= 0.50, proceed with PCA")
) %>%
  kable(caption = "Table 6. KMO/MSA Final Summary After Iterative Removal Check") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  row_spec(6, bold = TRUE, color = "darkgreen")
Table 6. KMO/MSA Final Summary After Iterative Removal Check
Metric Value
Initial Variables 37
Variables Removed (MSA < 0.50) 0
Final Variables 37
Final Overall KMO/MSA 0.942
Classification Marvelous
Decision PASS – all variables have MSA >= 0.50, proceed with PCA

Technical Interpretation – KMO:

Overall KMO = 0.942 (Marvelous) means that 94.2% of the total variance among the 37 variables is attributable to underlying common factors, with only 5.8% coming from unique or error variance. This is among the highest KMO values observable in practice and confirms that the data has an exceptionally strong and recoverable latent structure.

Individual MSA values range from 0.708 (age, Middling) to 0.989 (weak_foot, Marvelous). Age has the lowest MSA because it correlates with ability attributes only indirectly and non-linearly: young players may have high potential but modest current ratings, while older players show the opposite pattern. After controlling for all other variables, age retains non-trivial unique partial correlations, modestly reducing its MSA. Despite this, 0.708 comfortably clears the 0.50 threshold.

The near-singular matrix message in the iterative loop is a computational stopping condition, not a data problem. A 37-variable correlation matrix with KMO = 0.942 has intercorrelations so high that its determinant approaches machine-precision zero (< 1e-15), making further eigendecomposition inside the KMO loop numerically unstable. The loop correctly exits before computing invalid KMO values. Since all 37 variables already had MSAi >= 0.50 on the first iteration, zero variables were removed and the final dataset is identical to the input.

3.4 Assumption 3 – Bartlett’s Test of Sphericity

Bartlett’s Test formally tests H0: the population correlation matrix equals the identity matrix (R = I), meaning all off-diagonal correlations are exactly zero. If H0 were true, no shared variance would exist and PCA/FA would be meaningless.

Hypotheses:

  • H0: R = I (all variables are uncorrelated in the population)
  • H1: R != I (at least some variables are correlated)

Test statistic: \[\chi^2 = -\left[(n-1) - \frac{2p+5}{6}\right] \ln|R|\]

where n = sample size, p = number of variables, |R| = determinant of the correlation matrix. The term (2p+5)/6 is a bias correction for small samples. Degrees of freedom = p(p-1)/2.

Decision rule: Reject H0 if p-value < 0.05.

Note: Bartlett is computed on mat_corr_ok (the KMO-cleaned matrix) and data_final (n after KMO cleaning) to ensure consistency between the two tests.

n_obs    <- nrow(data_final)
bart_res <- cortest.bartlett(mat_corr_ok, n = n_obs, diag = TRUE)
p_val    <- bart_res$p.value
p_display <- ifelse(is.na(p_val) || p_val == 0,
                    "approx. 0 (< 2.2e-16, machine precision limit in R)",
                    format(p_val, scientific = TRUE))

data.frame(
  Statistic = c("Chi-square statistic",
                "Degrees of freedom [p(p-1)/2]",
                "p-value",
                "Significance level (alpha)",
                "Sample size (n)",
                "Decision"),
  Value = c(
    formatC(bart_res$chisq, format = "f", digits = 4),
    bart_res$df,
    p_display,
    "0.05",
    n_obs,
    "REJECT H0 -- correlation matrix is significantly non-identity"
  )
) %>%
  kable(caption = "Table 7. Bartlett's Test of Sphericity Results") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  row_spec(6, bold = TRUE, color = "darkgreen")
Table 7. Bartlett’s Test of Sphericity Results
Statistic Value
Chi-square statistic 218585.7995
Degrees of freedom [p(p-1)/2] 666
p-value approx. 0 (< 2.2e-16, machine precision limit in R)
Significance level (alpha) 0.05
Sample size (n) 5000
Decision REJECT H0 – correlation matrix is significantly non-identity

Technical Interpretation – Bartlett’s Test:

The chi-squared statistic of 218,585.80 with 666 degrees of freedom is extraordinarily large. For context, the critical value at alpha = 0.05 with 666 df is approximately 737.6. The observed statistic exceeds this critical value by a factor of approximately 296, meaning the evidence against H0 is overwhelming by any standard.

Why is the statistic so large? The Bartlett formula includes the term ln|R|. When variables are highly intercorrelated (KMO = 0.942), the correlation matrix determinant approaches zero, making ln|R| a large negative number. Multiplied by a negative sign and the large sample correction term (n - 1 = 4,999), this produces an enormous positive chi-squared value. A larger Bartlett statistic therefore indicates stronger intercorrelation structure.

The p-value appears as exactly 0 in R because chi-sq = 218,585.80 is so far into the tail of the chi-squared distribution with 666 df that the tail probability is smaller than 2.2e-16, the smallest positive number representable in R’s double-precision floating point arithmetic. This is not a computational error; it means the true p-value is indistinguishably small from zero.

Practical conclusion: H0 is rejected with certainty. The 37 FIFA 23 sub-attributes do not vary independently of one another. They share systematic variance attributable to a smaller number of latent ability dimensions. PCA and FA are fully statistically justified.

3.5 Assumption Summary

data.frame(
  Assumption  = c("1. Correlation Matrix",
                  "2. KMO / MSA",
                  "3. Bartlett's Test"),
  Result      = c("318 / 666 pairs (47.7%) with |r| > 0.3",
                  "Overall KMO = 0.942 (Marvelous); min MSAi = 0.708 (age)",
                  "chi-sq = 218,585.80; df = 666; p ~= 0"),
  Requirement = c("> 30% of pairs",
                  "Overall and all MSAi >= 0.50",
                  "p < 0.05"),
  Status      = c("PASS", "PASS", "PASS")
) %>%
  kable(caption = "Table 8. Summary -- All Three PCA/FA Assumptions Satisfied") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(4, bold = TRUE, color = "darkgreen") %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E")
Table 8. Summary – All Three PCA/FA Assumptions Satisfied
Assumption Result Requirement Status
  1. Correlation Matrix
318 / 666 pairs (47.7%) with &#124;r&#124; > 0.3 > 30% of pairs PASS
  1. KMO / MSA
Overall KMO = 0.942 (Marvelous); min MSAi = 0.708 (age) Overall and all MSAi >= 0.50 PASS
  1. Bartlett’s Test
chi-sq = 218,585.80; df = 666; p ~= 0 p < 0.05 PASS

4 C. Principal Component Analysis (PCA)

4.1 Objective and Design

Two analytical goals of PCA in this study:

  1. Latent Structure Identification – determine whether the 37 observed attributes are manifestations of fewer underlying latent ability dimensions (e.g., “offensive skill”, “physical build”, “defensive competence”)

  2. Dimensionality Reduction – compress the 37-variable space into k orthogonal principal components that collectively retain the majority of total variance, enabling more parsimonious player profiling

data.frame(
  Criterion   = c("Number of Variables (p)", "Number of Observations (n)",
                  "Obs/Variable Ratio (n/p)", "Variable Type",
                  "Analysis Type", "Missing Values"),
  Value       = c(ncol(data_pca), nrow(data_pca),
                  paste0(round(nrow(data_pca) / ncol(data_pca), 1), " : 1"),
                  "All numeric (continuous or discrete ordinal)",
                  "R-type (correlations among variables)",
                  "None (na.omit applied prior to sampling)"),
  Requirement = c(">= 10", ">= 50 (100+ preferred)", ">= 5 : 1",
                  "Required for Pearson correlation", "Standard for PCA",
                  "Must be handled"),
  Status      = rep("PASS", 6)
) %>%
  kable(caption = "Table 9. PCA Design Criteria Checklist") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(4, bold = TRUE, color = "darkgreen") %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E")
Table 9. PCA Design Criteria Checklist
Criterion Value Requirement Status
Number of Variables (p) 37 >= 10 PASS
Number of Observations (n) 5000 >= 50 (100+ preferred) PASS
Obs/Variable Ratio (n/p) 135.1 : 1 >= 5 : 1 PASS
Variable Type All numeric (continuous or discrete ordinal) Required for Pearson correlation PASS
Analysis Type R-type (correlations among variables) Standard for PCA PASS
Missing Values None (na.omit applied prior to sampling) Must be handled PASS

The n/p ratio of 135.1:1 far exceeds even the liberal 100:1 benchmark cited in Hair et al. (2019) as “excellent.” This ratio guarantees stable correlation estimates with narrow confidence intervals, reliable eigendecomposition without overfitting to sample-specific noise, and a generalizable factor structure that replicates in independent samples. This is formally confirmed by the split-sample validation results in Section D.

4.2 Running PCA

pc       <- prcomp(data_final, scale. = TRUE, center = TRUE)
eig_val  <- pc$sdev^2
prop_var <- eig_val / sum(eig_val)
cum_var  <- cumsum(prop_var)

4.3 Component Retention

Three convergent criteria for deciding how many components to retain:

  1. Kaiser’s Rule (Latent Root): retain PC_i if eigenvalue lambda_i > 1.0. Rationale: a component must explain more variance than a single standardized variable (which has variance = 1 by definition) to be worth retaining.

  2. Cumulative Variance Criterion: retain until cumulative explained variance >= 60% to 70%. This ensures the retained solution adequately represents the original data.

  3. Scree Test: identify the elbow in the eigenvalue plot where the curve flattens. Components before the elbow carry systematic variance; components after it carry mostly noise.

Agreement across all three criteria strengthens the retention decision.

n_kaiser <- sum(eig_val > 1)
n_varpc  <- which(cum_var >= 0.70)[1]
n_comp   <- n_kaiser

eig_df <- data.frame(
  Component  = paste0("PC", 1:length(eig_val)),
  Eigenvalue = round(eig_val, 4),
  Variance   = paste0(round(prop_var * 100, 2), "%"),
  Cumulative = paste0(round(cum_var  * 100, 2), "%"),
  Decision   = ifelse(eig_val > 1, "Retain", "Drop")
)

head(eig_df, 15) %>%
  kable(caption = "Table 10. Eigenvalue Table -- Top 15 Components",
        col.names = c("Component", "Eigenvalue", "Variance (%)",
                      "Cumulative (%)", "Decision")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  row_spec(1:5, background = "#EBF5FB") %>%
  column_spec(5, bold = TRUE,
              color = ifelse(head(eig_df, 15)$Decision == "Retain",
                             "darkgreen", "red"))
Table 10. Eigenvalue Table – Top 15 Components
Component Eigenvalue Variance (%) Cumulative (%) Decision
PC1 13.2199 35.73% 35.73% Retain
PC2 7.5333 20.36% 56.09% Retain
PC3 3.6463 9.85% 65.94% Retain
PC4 1.9089 5.16% 71.1% Retain
PC5 1.4580 3.94% 75.04% Retain
PC6 0.9733 2.63% 77.67% Drop
PC7 0.8720 2.36% 80.03% Drop
PC8 0.7915 2.14% 82.17% Drop
PC9 0.6164 1.67% 83.84% Drop
PC10 0.5420 1.46% 85.3% Drop
PC11 0.4600 1.24% 86.54% Drop
PC12 0.4582 1.24% 87.78% Drop
PC13 0.4165 1.13% 88.91% Drop
PC14 0.3902 1.05% 89.96% Drop
PC15 0.3420 0.92% 90.89% Drop
data.frame(
  Criterion = c("Kaiser's Rule (eigenvalue > 1)",
                "Cumulative Variance (>= 70%)",
                "Scree Test (visual elbow)"),
  Components = c(n_kaiser, n_varpc,
                 "3 to 5 (elbow visible between PC3 and PC4)"),
  Cumulative = c(
    paste0(round(cum_var[n_kaiser] * 100, 2), "%"),
    paste0(round(cum_var[n_varpc]  * 100, 2), "%"),
    paste0(round(cum_var[3] * 100, 2), "% to ",
           round(cum_var[5] * 100, 2), "%")
  ),
  Role = c("Primary (definitive)", "Supporting", "Supporting")
) %>%
  kable(caption = "Table 11. Component Retention Criteria Comparison",
        col.names = c("Criterion", "Components Retained",
                      "Cumulative Variance", "Role")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  row_spec(1, bold = TRUE, background = "#EBF5FB") %>%
  column_spec(1, bold = TRUE)
Table 11. Component Retention Criteria Comparison
Criterion Components Retained Cumulative Variance Role
Kaiser’s Rule (eigenvalue > 1) 5 75.04% Primary (definitive)
Cumulative Variance (>= 70%) 4 71.1% Supporting
Scree Test (visual elbow) 3 to 5 (elbow visible between PC3 and PC4) 65.94% to 75.04% Supporting
fviz_eig(
  pc, ncp = 15, addlabels = TRUE,
  barfill  = "#3498DB", barcolor = "#2980B9", linecolor = "#E74C3C",
  main = "Scree Plot -- PCA FIFA 23 Player Attributes"
) +
  geom_hline(yintercept = 100 / ncol(data_final),
             linetype = "dashed", color = "#E74C3C", linewidth = 0.9) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))
Figure 3. Scree Plot. Dashed red line marks eigenvalue = 1 (Kaiser threshold). Elbow is visible between PC3 and PC4, consistent with retaining 5 components under Kaiser's Rule.

Figure 3. Scree Plot. Dashed red line marks eigenvalue = 1 (Kaiser threshold). Elbow is visible between PC3 and PC4, consistent with retaining 5 components under Kaiser’s Rule.

Technical Interpretation – Component Retention:

Kaiser’s Rule retains 5 components: PC1 (eigenvalue = 13.22), PC2 (7.53), PC3 (3.65), PC4 (1.91), PC5 (1.46). PC6 (eigenvalue = 0.97) drops cleanly below 1.0 with a gap of 0.49 eigenvalue units from PC5, providing a well-defined boundary.

Variance decomposition: PC1 alone accounts for 35.73% of total variance – unusually dominant, indicating a single strong latent dimension (general technical and attacking quality) that differentiates players more powerfully than any other axis. PC1 and PC2 together explain 56.09%, meaning the attacking-defending specialization bipolar dimension (PC2) adds another 20.36%. The remaining three components (PC3 to PC5) contribute 9.85%, 5.16%, and 3.94%, capturing progressively narrower but theoretically meaningful dimensions (physical build, speed, and developmental trajectory).

Scree test: The plot shows a pronounced steep descent from PC1 to PC3, then a clear change in slope (elbow) between PC3 and PC4. Strictly interpreted, the scree elbow suggests retaining 3 components. However, Kaiser’s Rule extending to 5 is well justified because PC4 (eigenvalue = 1.91) and PC5 (eigenvalue = 1.46) each explain more variance than any single standardized variable, and they capture theoretically distinct dimensions (speed and career stage) that would be lost in a 3-component solution.

Convergent decision: Three criteria, one conclusion – retain k = 5 components, explaining 75.04% of total variance.

4.4 Component Loading Matrix

loadings_mat <- pc$rotation[, 1:n_comp] %*% diag(sqrt(eig_val[1:n_comp]))
colnames(loadings_mat) <- paste0("PC", 1:n_comp)
h2 <- rowSums(loadings_mat^2)

load_df <- as.data.frame(round(loadings_mat, 3))
load_df$h2 <- round(h2, 3)
load_df <- load_df %>% arrange(desc(h2))

load_df %>%
  kable(caption = "Table 12. Unrotated Loading Matrix with Communalities (h2). Sorted by h2 descending. L_ij = e_ij x sqrt(lambda_i) = correlation between variable i and PC j.",
        col.names = c(paste0("PC", 1:n_comp), "h2")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  column_spec(n_comp + 2, bold = TRUE,
              color = ifelse(load_df$h2 >= 0.70, "darkgreen",
                      ifelse(load_df$h2 >= 0.50, "#E67E22", "red"))) %>%
  scroll_box(width = "100%", height = "450px")
Table 12. Unrotated Loading Matrix with Communalities (h2). Sorted by h2 descending. L_ij = e_ij x sqrt(lambda_i) = correlation between variable i and PC j.
PC1 PC2 PC3 PC4 PC5 h2
defending_standing_tackle 0.134 -0.821 -0.469 -0.106 -0.034 0.925
defending_sliding_tackle 0.175 -0.799 -0.491 -0.092 -0.024 0.919
mentality_interceptions 0.091 -0.836 -0.442 -0.110 0.023 0.915
defending_marking_awareness 0.108 -0.841 -0.416 -0.101 0.010 0.902
overall -0.713 -0.568 0.046 0.216 -0.085 0.887
attacking_finishing -0.812 0.307 0.341 -0.009 0.031 0.871
potential -0.473 -0.282 -0.069 0.391 -0.639 0.869
movement_acceleration -0.449 0.437 -0.386 0.561 0.084 0.862
skill_ball_control -0.896 -0.165 -0.004 0.054 -0.165 0.859
attacking_short_passing -0.756 -0.458 -0.113 -0.091 -0.189 0.838
skill_dribbling -0.896 0.117 -0.073 0.060 -0.105 0.837
power_long_shots -0.867 0.113 0.210 -0.144 0.077 0.835
mentality_vision -0.880 -0.042 -0.048 -0.180 -0.056 0.814
attacking_volleys -0.806 0.207 0.333 -0.060 0.051 0.809
movement_agility -0.650 0.394 -0.403 0.211 0.154 0.808
power_strength 0.099 -0.668 0.512 0.234 0.161 0.798
mentality_positioning -0.855 0.202 0.133 0.016 0.076 0.796
movement_sprint_speed -0.374 0.350 -0.278 0.665 0.083 0.790
skill_long_passing -0.631 -0.503 -0.236 -0.242 -0.149 0.788
age -0.304 -0.403 0.197 -0.298 0.630 0.780
movement_reactions -0.658 -0.563 0.072 0.121 -0.030 0.771
skill_curve -0.855 0.053 -0.054 -0.169 0.035 0.766
mentality_composure -0.753 -0.434 0.092 0.018 -0.049 0.766
power_shot_power -0.805 -0.041 0.312 -0.063 0.039 0.752
height_cm 0.315 -0.457 0.612 0.160 -0.207 0.751
movement_balance -0.503 0.394 -0.545 0.016 0.181 0.739
attacking_heading_accuracy -0.069 -0.640 0.489 0.268 0.075 0.732
skill_fk_accuracy -0.781 0.018 0.018 -0.340 0.077 0.732
mentality_penalties -0.717 0.192 0.398 -0.120 0.066 0.729
attacking_crossing -0.755 -0.056 -0.316 -0.081 0.056 0.682
weight_kg 0.200 -0.487 0.616 0.141 -0.041 0.678
mentality_aggression -0.120 -0.783 -0.071 0.087 0.175 0.670
power_stamina -0.383 -0.402 -0.236 0.318 0.330 0.574
power_jumping 0.030 -0.430 0.114 0.420 0.434 0.564
skill_moves -0.714 0.198 0.015 0.018 -0.061 0.553
international_reputation -0.378 -0.270 0.118 -0.066 -0.174 0.264
weak_foot -0.361 0.072 0.058 -0.023 -0.020 0.140

Technical Interpretation – Unrotated Loading Matrix:

Each cell L_ij is the Pearson correlation between variable i and principal component j, ranging from -1 to +1. The sign of loadings in unrotated PCA is determined by eigenvector sign convention and carries no substantive meaning; only the absolute magnitude matters.

PC1 shows high absolute loadings (|L| > 0.70) on nearly all technical and attacking variables: skill_dribbling (-0.896), skill_ball_control (-0.896), mentality_vision (-0.880), power_long_shots (-0.867), skill_curve (-0.855), mentality_positioning (-0.855). All loadings are negative, which is the mathematical convention for this eigenvector direction. PC1 represents the dimension of general technical mastery and attacking capability: it primarily separates high-ability technical players from low-ability or defensively specialized ones.

PC2 loads strongly on defending attributes: defending_marking_awareness (-0.841), mentality_interceptions (-0.836), defending_standing_tackle (-0.821), mentality_aggression (-0.783), with attacking attributes loading positively or near-zero. PC2 represents the defending specialization axis, orthogonal to PC1: knowing how technically skilled a player is (PC1 score) tells you almost nothing about their defensive capability (PC2 score).

PC3 captures physical build: height_cm (0.612), weight_kg (0.616), power_strength (0.512) load positively, while movement_agility (-0.403) and movement_balance (-0.545) load negatively. This component encodes the biomechanical trade-off between body mass and mobility.

PC4 isolates pure speed: movement_sprint_speed (0.665) and movement_acceleration (0.561) dominate, with other variables near-zero. Speed as a standalone latent dimension, orthogonal to technical quality and physical bulk, is well established in sports science literature.

PC5 captures the career development axis: potential and age load in opposing directions, encoding how far a player’s current ability is from their projected ceiling.

Communalities confirm that 30 of 37 variables (81.1%) are well-represented (h2 >= 0.70). The two poorly-represented variables are weak_foot (h2 = 0.140) and international_reputation (h2 = 0.264). These attributes measure individual-specific traits (lateralization and commercial profile) that are genuinely orthogonal to the five latent ability dimensions and will not improve even with rotation.

4.5 Biplot

fviz_pca_biplot(
  pc,
  axes      = c(1, 2),
  geom.ind  = "point",
  col.ind   = "steelblue",
  alpha.ind = 0.15,
  col.var   = "contrib",
  gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
  repel     = FALSE,
  label     = "var",
  title     = "Biplot PCA FIFA 23 (PC1 vs PC2)"
) + theme_minimal()
Figure 4. PCA Biplot (PC1 vs PC2). Arrows = variable loadings. Points = individual player scores. Arrow direction and length indicate correlation with the component axes.

Figure 4. PCA Biplot (PC1 vs PC2). Arrows = variable loadings. Points = individual player scores. Arrow direction and length indicate correlation with the component axes.

Technical Interpretation – Biplot:

Arrows pointing in the same general direction indicate positively correlated variables that will load on the same factor. The attacking/skill arrows (skill_dribbling, skill_ball_control, power_long_shots, mentality_vision) all point toward the left of the PC1 axis, forming a tight bundle confirming strong positive intercorrelation within this cluster. The defending arrows (defending_standing_tackle, defending_marking_awareness, mentality_interceptions) point toward the upper right, nearly perpendicular to the attacking arrows. This near-perpendicularity in the biplot is the geometric representation of the orthogonality between RC1 and RC2 in the FA solution: attacking ability and defensive ability are statistically independent latent dimensions.

The player point cloud is concentrated near the origin (average players on both dimensions) with extensions toward the lower left (technically elite attacking players) and upper right (defensive specialists). Players near the upper left would be rare complete midfielders combining both dimensions at high levels.

4.6 Variable Contribution

fviz_contrib(
  pc, choice = "var", axes = 1:2, top = 20,
  fill = "#3498DB", color = "#1A5276"
) +
  labs(title = "Top 20 Variable Contributions to PC1 and PC2 Combined") +
  theme_minimal()
Figure 5. Top 20 Variables Contributing to PC1 and PC2. The red dashed line marks the expected equal contribution level (100/37 = 2.7%).

Figure 5. Top 20 Variables Contributing to PC1 and PC2. The red dashed line marks the expected equal contribution level (100/37 = 2.7%).


5 D. Factor Analysis (FA)

5.1 VARIMAX Rotation

Factor Analysis with VARIMAX rotation builds on the PCA solution by applying an orthogonal rotation to the 5 component axes. The rotation maximizes the variance of squared loadings within each factor (the VARIMAX criterion), achieving a “simple structure” where each variable loads highly on one factor and near-zero on all others. Total variance explained (75.04%) is unchanged by rotation; only the distribution across factors changes to improve interpretability.

The common factor model: \[X_j = \lambda_{j1}F_1 + \lambda_{j2}F_2 + \cdots + \lambda_{jk}F_k + \varepsilon_j\]

where \(F_i\) are unobservable latent factors, \(\lambda_{ji}\) are factor loadings, and \(\varepsilon_j\) is unique variance specific to variable j (not shared with any factor).

Loading interpretation threshold (Hair et al., 2019):

Absolute Loading Interpretation
>= 0.70 Strongly significant
>= 0.50 Practically significant
>= 0.40 Acceptable (n >= 200)
< 0.40 Not considered dominant
fa_rot <- principal(
  data_final,
  nfactors = n_comp,
  rotate   = "varimax",
  scores   = TRUE
)

loadings_rot <- fa_rot$loadings[, 1:n_comp]
class(loadings_rot) <- "matrix"
h2_rot <- fa_rot$communality

load_rot_df <- data.frame(round(loadings_rot, 3), h2 = round(h2_rot, 3)) %>%
  arrange(desc(h2))

load_rot_df %>%
  kable(caption = "Table 13. VARIMAX Rotated Loading Matrix with Communalities (h2). Sorted by h2 descending.",
        col.names = c(paste0("RC", 1:n_comp), "h2")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  column_spec(n_comp + 2, bold = TRUE,
              color = ifelse(load_rot_df$h2 >= 0.70, "darkgreen",
                      ifelse(load_rot_df$h2 >= 0.50, "#E67E22", "red"))) %>%
  scroll_box(width = "100%", height = "450px")
Table 13. VARIMAX Rotated Loading Matrix with Communalities (h2). Sorted by h2 descending.
RC1 RC2 RC3 RC4 RC5 h2
defending_standing_tackle -0.139 0.944 0.084 -0.081 0.027 0.925
defending_sliding_tackle -0.187 0.935 0.066 -0.067 0.024 0.919
mentality_interceptions -0.094 0.944 0.107 -0.063 -0.030 0.915
defending_marking_awareness -0.105 0.931 0.134 -0.072 -0.019 0.902
overall 0.700 0.440 0.336 0.260 0.150 0.887
attacking_finishing 0.822 -0.417 -0.019 0.136 -0.044 0.871
potential 0.431 0.237 0.173 0.218 0.741 0.869
movement_acceleration 0.220 -0.190 -0.386 0.774 0.174 0.862
skill_ball_control 0.872 0.158 -0.006 0.196 0.191 0.859
attacking_short_passing 0.765 0.470 0.042 0.031 0.167 0.838
skill_dribbling 0.834 -0.036 -0.210 0.270 0.151 0.837
power_long_shots 0.885 -0.168 -0.062 0.070 -0.120 0.835
mentality_vision 0.875 0.107 -0.185 0.056 0.018 0.814
attacking_volleys 0.830 -0.324 0.014 0.090 -0.081 0.809
movement_agility 0.471 -0.101 -0.522 0.551 -0.002 0.808
power_strength -0.002 0.238 0.848 0.013 -0.150 0.798
mentality_positioning 0.824 -0.215 -0.111 0.234 -0.055 0.796
movement_sprint_speed 0.156 -0.192 -0.212 0.804 0.194 0.790
skill_long_passing 0.650 0.589 -0.062 -0.083 0.089 0.788
age 0.372 0.269 0.239 -0.057 -0.713 0.780
movement_reactions 0.666 0.430 0.326 0.178 0.063 0.771
skill_curve 0.835 0.031 -0.233 0.102 -0.061 0.766
mentality_composure 0.771 0.326 0.222 0.114 0.050 0.766
power_shot_power 0.844 -0.108 0.135 0.065 -0.075 0.752
height_cm -0.168 0.007 0.801 -0.239 0.158 0.751
movement_balance 0.336 -0.005 -0.678 0.401 -0.080 0.739
attacking_heading_accuracy 0.153 0.229 0.806 0.063 -0.051 0.732
skill_fk_accuracy 0.808 0.036 -0.210 -0.072 -0.168 0.732
mentality_penalties 0.769 -0.345 0.061 0.002 -0.125 0.729
attacking_crossing 0.679 0.254 -0.325 0.225 -0.026 0.682
weight_kg -0.062 0.036 0.802 -0.173 -0.001 0.678
mentality_aggression 0.130 0.680 0.399 0.106 -0.142 0.670
power_stamina 0.273 0.443 0.117 0.509 -0.176 0.574
power_jumping -0.079 0.249 0.482 0.423 -0.290 0.564
skill_moves 0.675 -0.152 -0.181 0.187 0.084 0.553
international_reputation 0.428 0.173 0.165 -0.085 0.130 0.264
weak_foot 0.358 -0.078 -0.050 0.053 0.015 0.140

Technical Interpretation – VARIMAX Rotated Loading Matrix:

After VARIMAX rotation, each variable converges onto a single dominant factor with near-zero loadings on the others, confirming clean simple structure. All loadings in the rotated solution become positive for their dominant factor (the sign-flip of unrotated PC1 disappears), making interpretation more straightforward.

RC1 – Technical and Attacking Ability: Dominant variables (|L| >= 0.80): power_long_shots (0.885), mentality_vision (0.875), skill_ball_control (0.872), power_shot_power (0.844), skill_dribbling (0.834), skill_curve (0.835), attacking_volleys (0.830), mentality_positioning (0.824), attacking_finishing (0.822), skill_fk_accuracy (0.808). All 10 variables load above 0.80 on RC1 and below 0.45 on all other factors, demonstrating excellent simple structure. RC1 represents the comprehensive offensive and technical toolkit: the ability to shoot with power and accuracy, dribble past opponents, place the ball precisely, create chances through vision, and position intelligently to receive passes and score goals. Players with very high RC1 scores are creative attacking players such as classic number 10s and technically gifted forwards.

RC2 – Defensive and Aggression: Dominant variables: mentality_interceptions (0.944), defending_standing_tackle (0.944), defending_sliding_tackle (0.935), defending_marking_awareness (0.931). These four loadings at 0.93 to 0.944 are near-perfect and indicate that these four attributes are effectively measuring the same underlying defensive competence construct from slightly different operational angles: winning the ball back through anticipation (interceptions), challenging in the tackle (standing and sliding), and reading the opponent’s movement (marking awareness). mentality_aggression (0.680) adds the competitive intensity and pressing quality dimension. The negative loading of attacking_finishing (-0.417) on RC2 quantifies the systematic trade-off in FIFA 23 design: players who are defensively excellent tend to have systematically lower attacking finishing.

RC3 – Physical Strength and Aerial Ability: Dominant variables: power_strength (0.848), attacking_heading_accuracy (0.806), weight_kg (0.802), height_cm (0.801). The inclusion of attacking_heading_accuracy in this physical factor rather than RC1 is analytically informative: heading ability in FIFA 23 is determined primarily by physical attributes (height, jumping ability, strength to win aerial contests) rather than by technical skill. This is why VARIMAX correctly assigns it to RC3 rather than the technical factor. The negative loadings of movement_agility (-0.522) and movement_balance (-0.678) on RC3 quantify the known biomechanical cost of large body mass: taller, heavier, stronger players trade off mobility and balance to gain physical dominance.

RC4 – Speed and Stamina: Dominant variables: movement_sprint_speed (0.804), movement_acceleration (0.774), movement_agility (0.551), power_stamina (0.509). RC4 isolates the “athletic engine” of a player: peak running speed, rate of acceleration to reach that speed, directional quickness, and aerobic endurance. The combination of sprint speed with stamina is practically coherent: in modern high-intensity football, a player who is fast but lacks stamina cannot sustain their speed advantage for 90 minutes. RC4 is largely orthogonal to RC1 (technical skill): a technically average but explosively fast player scores high on RC4 regardless of their RC1 score, which is reflected in the near-zero cross-loading between these two factors.

RC5 – Youth Potential vs. Experience: Defining contrast: potential (0.741) versus age (-0.713). This factor does not describe what a player can do right now but rather where they are in their career development trajectory. A high RC5 score indicates a young player with a large gap between potential ceiling and current overall rating (developmental upside). A low or negative RC5 score indicates a veteran player whose current ability is at or near their potential ceiling (peak or post-peak career stage). RC5 is unique among the five in being a temporal dimension rather than a contemporaneous ability dimension.

5.2 Dominant Loadings per Factor

dom_list <- lapply(1:n_comp, function(f) {
  ld  <- loadings_rot[, f]
  dom <- sort(ld[abs(ld) >= 0.40], decreasing = TRUE)
  data.frame(Factor   = paste0("RC", f),
             Variable = names(dom),
             Loading  = round(dom, 3))
})

do.call(rbind, dom_list) %>%
  kable(caption = "Table 14. Dominant Loadings per Factor (|loading| >= 0.40, sorted descending within factor)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(3, bold = TRUE,
              color = ifelse(
                abs(do.call(rbind, dom_list)$Loading) >= 0.70,
                "darkgreen", "#E67E22"))
Table 14. Dominant Loadings per Factor (|loading| >= 0.40, sorted descending within factor)
Factor Variable Loading
power_long_shots RC1 power_long_shots 0.885
mentality_vision RC1 mentality_vision 0.875
skill_ball_control RC1 skill_ball_control 0.872
power_shot_power RC1 power_shot_power 0.844
skill_curve RC1 skill_curve 0.835
skill_dribbling RC1 skill_dribbling 0.834
attacking_volleys RC1 attacking_volleys 0.830
mentality_positioning RC1 mentality_positioning 0.824
attacking_finishing RC1 attacking_finishing 0.822
skill_fk_accuracy RC1 skill_fk_accuracy 0.808
mentality_composure RC1 mentality_composure 0.771
mentality_penalties RC1 mentality_penalties 0.769
attacking_short_passing RC1 attacking_short_passing 0.765
overall RC1 overall 0.700
attacking_crossing RC1 attacking_crossing 0.679
skill_moves RC1 skill_moves 0.675
movement_reactions RC1 movement_reactions 0.666
skill_long_passing RC1 skill_long_passing 0.650
movement_agility RC1 movement_agility 0.471
potential RC1 potential 0.431
international_reputation RC1 international_reputation 0.428
defending_standing_tackle RC2 defending_standing_tackle 0.944
mentality_interceptions RC2 mentality_interceptions 0.944
defending_sliding_tackle RC2 defending_sliding_tackle 0.935
defending_marking_awareness RC2 defending_marking_awareness 0.931
mentality_aggression RC2 mentality_aggression 0.680
skill_long_passing1 RC2 skill_long_passing 0.589
attacking_short_passing1 RC2 attacking_short_passing 0.470
power_stamina RC2 power_stamina 0.443
overall1 RC2 overall 0.440
movement_reactions1 RC2 movement_reactions 0.430
attacking_finishing1 RC2 attacking_finishing -0.417
power_strength RC3 power_strength 0.848
attacking_heading_accuracy RC3 attacking_heading_accuracy 0.806
weight_kg RC3 weight_kg 0.802
height_cm RC3 height_cm 0.801
power_jumping RC3 power_jumping 0.482
movement_agility1 RC3 movement_agility -0.522
movement_balance RC3 movement_balance -0.678
movement_sprint_speed RC4 movement_sprint_speed 0.804
movement_acceleration RC4 movement_acceleration 0.774
movement_agility2 RC4 movement_agility 0.551
power_stamina1 RC4 power_stamina 0.509
power_jumping1 RC4 power_jumping 0.423
movement_balance1 RC4 movement_balance 0.401
potential1 RC5 potential 0.741
age RC5 age -0.713

5.3 VARIMAX Loading Heatmap

load_heat <- loadings_rot[, 1:n_comp]
corrplot(
  load_heat,
  is.corr   = FALSE,
  method    = "color",
  tl.cex    = 0.7, tl.col = "black",
  col       = colorRampPalette(c("#2166AC", "white", "#D6604D"))(200),
  title     = "VARIMAX Loading Heatmap -- 5 Rotated Factors",
  mar       = c(0, 0, 2, 0),
  cl.lim    = c(-1, 1)
)
Figure 6. VARIMAX Loading Heatmap. Red = high positive loading, Blue = high negative loading. Each row should ideally show one dark cell (simple structure).

Figure 6. VARIMAX Loading Heatmap. Red = high positive loading, Blue = high negative loading. Each row should ideally show one dark cell (simple structure).

Technical Interpretation – VARIMAX Heatmap:

The heatmap is a visual test of simple structure quality. In perfect simple structure: every row (variable) has exactly one dark cell (one dominant factor), and every column (factor) has a clearly defined block of dark cells (a coherent variable cluster).

RC1 column shows a solid dark red block across all attacking and skill variables in the upper portion of the heatmap, with white or near-white cells in all other columns for these rows. This confirms textbook simple structure for the technical/attacking cluster.

RC2 column shows a concentrated dark red block for the four defending variables at the lower portion, with near-white cells for all non-defending variables. The clean separation of this column from RC1 quantifies the orthogonality between technical and defensive specialization.

RC3 column shows red for the height/weight/strength cluster and blue for agility/balance variables, encoding the physical build bipolar dimension.

RC4 and RC5 show more diffuse coloring because speed and career-stage are inherently less tightly clustered constructs. This is acceptable and expected: RC4 and RC5 explain less variance (5.16% and 3.94%) precisely because they are narrower, more specific dimensions.

Cross-loading rows (variables with multiple colored cells): skill_long_passing shows moderate loadings on both RC1 (0.650) and RC2 (0.589), reflecting its dual role as a technical attribute used by both creative playmakers and defensive midfielders. movement_agility appears in RC1, RC3, and RC4, reflecting its multidimensional nature. These cross-loadings are not failures of the rotation but genuine reflections of the multifaceted nature of certain specific attributes.

5.4 Communality Analysis

comm_df <- data.frame(
  Variable       = names(h2_rot),
  Communality    = round(h2_rot, 3),
  Representation = ifelse(h2_rot >= 0.70, "Well explained (h2 >= 0.70)",
                   ifelse(h2_rot >= 0.50, "Adequately explained (0.50 <= h2 < 0.70)",
                   "Poorly explained (h2 < 0.50)"))
) %>% arrange(Communality)

comm_df %>%
  kable(caption = "Table 15. Variable Communalities -- Proportion of Variance Explained by 5-Factor Solution (sorted ascending)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  column_spec(2, bold = TRUE) %>%
  column_spec(3, bold = TRUE,
              color = ifelse(comm_df$Communality >= 0.70, "darkgreen",
                      ifelse(comm_df$Communality >= 0.50, "#E67E22", "red")))
Table 15. Variable Communalities – Proportion of Variance Explained by 5-Factor Solution (sorted ascending)
Variable Communality Representation
weak_foot weak_foot 0.140 Poorly explained (h2 < 0.50)
international_reputation international_reputation 0.264 Poorly explained (h2 < 0.50)
skill_moves skill_moves 0.553 Adequately explained (0.50 <= h2 < 0.70)
power_jumping power_jumping 0.564 Adequately explained (0.50 <= h2 < 0.70)
power_stamina power_stamina 0.574 Adequately explained (0.50 <= h2 < 0.70)
mentality_aggression mentality_aggression 0.670 Adequately explained (0.50 <= h2 < 0.70)
weight_kg weight_kg 0.678 Adequately explained (0.50 <= h2 < 0.70)
attacking_crossing attacking_crossing 0.682 Adequately explained (0.50 <= h2 < 0.70)
mentality_penalties mentality_penalties 0.729 Well explained (h2 >= 0.70)
attacking_heading_accuracy attacking_heading_accuracy 0.732 Well explained (h2 >= 0.70)
skill_fk_accuracy skill_fk_accuracy 0.732 Well explained (h2 >= 0.70)
movement_balance movement_balance 0.739 Well explained (h2 >= 0.70)
height_cm height_cm 0.751 Well explained (h2 >= 0.70)
power_shot_power power_shot_power 0.752 Well explained (h2 >= 0.70)
skill_curve skill_curve 0.766 Well explained (h2 >= 0.70)
mentality_composure mentality_composure 0.766 Well explained (h2 >= 0.70)
movement_reactions movement_reactions 0.771 Well explained (h2 >= 0.70)
age age 0.780 Well explained (h2 >= 0.70)
skill_long_passing skill_long_passing 0.788 Well explained (h2 >= 0.70)
movement_sprint_speed movement_sprint_speed 0.790 Well explained (h2 >= 0.70)
mentality_positioning mentality_positioning 0.796 Well explained (h2 >= 0.70)
power_strength power_strength 0.798 Well explained (h2 >= 0.70)
movement_agility movement_agility 0.808 Well explained (h2 >= 0.70)
attacking_volleys attacking_volleys 0.809 Well explained (h2 >= 0.70)
mentality_vision mentality_vision 0.814 Well explained (h2 >= 0.70)
power_long_shots power_long_shots 0.835 Well explained (h2 >= 0.70)
skill_dribbling skill_dribbling 0.837 Well explained (h2 >= 0.70)
attacking_short_passing attacking_short_passing 0.838 Well explained (h2 >= 0.70)
skill_ball_control skill_ball_control 0.859 Well explained (h2 >= 0.70)
movement_acceleration movement_acceleration 0.862 Well explained (h2 >= 0.70)
potential potential 0.869 Well explained (h2 >= 0.70)
attacking_finishing attacking_finishing 0.871 Well explained (h2 >= 0.70)
overall overall 0.887 Well explained (h2 >= 0.70)
defending_marking_awareness defending_marking_awareness 0.902 Well explained (h2 >= 0.70)
mentality_interceptions mentality_interceptions 0.915 Well explained (h2 >= 0.70)
defending_sliding_tackle defending_sliding_tackle 0.919 Well explained (h2 >= 0.70)
defending_standing_tackle defending_standing_tackle 0.925 Well explained (h2 >= 0.70)

Technical Interpretation – Communalities:

Communality (h2) is the proportion of a variable’s total variance explained by the 5 retained factors. h2 = 1 means the variable is perfectly explained; h2 = 0 means it is completely unique.

30 of 37 variables (81.1%) achieve h2 >= 0.70, confirming excellent coverage. The highest communalities belong to the defending cluster: defending_standing_tackle (0.925), defending_sliding_tackle (0.919), mentality_interceptions (0.915), defending_marking_awareness (0.902). These extreme values reflect the near-perfect intercorrelation within the defending cluster: essentially all of their variance is captured by a single factor (RC2), leaving almost no unique residual.

weak_foot (h2 = 0.140): Only 14% of weak_foot variance is shared with the 5 factors. Footedness (the ability of the non-dominant foot) is largely an individual anatomical and practice-history characteristic that is statistically independent from all ability dimensions. A player can have skill_dribbling = 90 and weak_foot = 3, or skill_dribbling = 55 and weak_foot = 5. There is simply no consistent relationship between overall technical quality and weak foot rating, which is why the 5-factor model fails to capture it.

international_reputation (h2 = 0.264): Only 26.4% is explained. Reputation in FIFA 23 is influenced by commercial partnerships, historical peak performance, nationality and league visibility, and media prominence – factors with no direct mapping to any of the five ability dimensions. A veteran star may retain reputation = 4 as their actual attributes decline, and a technically exceptional but commercially overlooked player may hold reputation = 1 despite high ability scores.

These two variables are retained but carry minimal interpretive weight when discussing the factor structure. Their low communalities are substantively meaningful, not analytical failures.

5.5 Split-Sample Validation

set.seed(42)
idx_split <- sample(1:nrow(data_final), nrow(data_final) / 2)
data_s1   <- data_final[idx_split, ]
data_s2   <- data_final[-idx_split, ]

fa_s1 <- principal(data_s1, nfactors = n_comp, rotate = "varimax")
fa_s2 <- principal(data_s2, nfactors = n_comp, rotate = "varimax")

var_s1 <- round(fa_s1$Vaccounted[2, 1:n_comp] * 100, 2)
var_s2 <- round(fa_s2$Vaccounted[2, 1:n_comp] * 100, 2)

val_df <- data.frame(
  Factor     = paste0("RC", 1:n_comp),
  Sample_1   = var_s1,
  Sample_2   = var_s2,
  Difference = round(abs(var_s1 - var_s2), 2),
  Stable     = ifelse(abs(var_s1 - var_s2) < 5, "Stable", "Unstable")
)

val_df %>%
  kable(caption = "Table 16. Split-Sample Validation (n = 2,500 each, set.seed = 42)",
        col.names = c("Factor", "Sample 1 Var (%)", "Sample 2 Var (%)",
                      "Difference (%)", "Stable (diff < 5%)?")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  column_spec(4, bold = TRUE) %>%
  column_spec(5, bold = TRUE,
              color = ifelse(val_df$Stable == "Stable", "darkgreen", "red"))
Table 16. Split-Sample Validation (n = 2,500 each, set.seed = 42)
Factor Sample 1 Var (%) Sample 2 Var (%) Difference (%) Stable (diff < 5%)?
RC1 RC1 33.27 33.75 0.48 Stable
RC2 RC2 17.21 16.67 0.54 Stable
RC3 RC3 12.85 12.88 0.03 Stable
RC4 RC4 7.21 7.97 0.76 Stable
RC5 RC5 4.28 4.10 0.18 Stable
max_diff <- max(val_df$Difference)
cat(sprintf("\nMaximum variance difference across all factors: %.2f%%\n", max_diff))
## 
## Maximum variance difference across all factors: 0.76%
cat(ifelse(max_diff < 5,
           "STABLE: solution generalizes reliably beyond this specific sample.",
           "WARNING: solution may be unstable."))
## STABLE: solution generalizes reliably beyond this specific sample.

Technical Interpretation – Split-Sample Validation:

Split-sample validation directly tests whether the 5-factor structure is a stable property of the FIFA 23 player population or an artifact of the specific 5,000-player random sample. The procedure splits the data into two independent n = 2,500 subsamples (set.seed = 42 for reproducibility), runs FA separately on each half, and compares the variance explained by each factor. If the structure were overfitted, substantial variance differences would emerge between the two halves.

Maximum difference = 0.76% (RC4). All five factors show differences of less than 1%, far below the 5% stability threshold. This result has several important implications:

RC1 (33.27% vs. 33.75%, diff = 0.48%): The dominant Technical and Attacking Ability factor is virtually identical across both halves. The near-perfect replication confirms that the attacking/technical ability dimension is the strongest and most stable structural feature of the FIFA 23 player population.

RC2 (17.21% vs. 16.67%, diff = 0.54%): The defending dimension replicates equally well, consistent with the near-perfect loadings (0.93 to 0.944) that define it.

RC3 (12.85% vs. 12.88%, diff = 0.03%): The most stable factor in the entire solution. Physical measurements (height, weight, strength) are more objectively assigned in FIFA than technical ratings, producing an exceptionally consistent distribution across any subsample.

RC4 (7.21% vs. 7.97%, diff = 0.76%): The largest difference, plausibly reflecting slight sampling variability in the proportion of speed-specialist wingers and full-backs between the two halves.

RC5 (4.28% vs. 4.10%, diff = 0.18%): Despite being the most conceptually unusual factor (career stage rather than ability), it replicates with excellent stability.

Practical conclusion: The 5-factor VARIMAX solution is robust and generalizable. Factor scores computed from this solution can be used with confidence in downstream analyses (player clustering, position classification, market value prediction) without concern that the structure is sample-dependent.

5.6 Factor Scores

factor_scores <- as.data.frame(fa_rot$scores)
colnames(factor_scores) <- paste0("Factor_", 1:n_comp)

fwrite(factor_scores, "fifa23_factor_scores.csv")

pc_scores <- as.data.frame(pc$x[, 1:n_comp])
fwrite(pc_scores, "fifa23_pc_scores.csv")

cat("Factor scores saved: fifa23_factor_scores.csv\n")
## Factor scores saved: fifa23_factor_scores.csv
cat("PC scores saved: fifa23_pc_scores.csv\n")
## PC scores saved: fifa23_pc_scores.csv
round(sapply(factor_scores, function(x)
  c(Mean = mean(x), SD = sd(x), Min = min(x), Max = max(x))), 3) %>%
  t() %>%
  kable(caption = "Table 17. Factor Score Distribution (standardized: mean ~= 0, SD ~= 1)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  column_spec(1, bold = TRUE)
Table 17. Factor Score Distribution (standardized: mean ~= 0, SD ~= 1)
Mean SD Min Max
Factor_1 0 1 -2.820 3.794
Factor_2 0 1 -2.773 2.367
Factor_3 0 1 -2.787 3.618
Factor_4 0 1 -4.706 3.204
Factor_5 0 1 -3.055 3.188

6 Comprehensive Summary

6.1 Complete Results Overview

data.frame(
  Metric = c(
    "Dataset",
    "Total Raw Records",
    "Analysis Sample",
    "Variables Used",
    "Significant Correlation Pairs",
    "Bartlett Test Result",
    "Overall KMO",
    "KMO Classification",
    "Variables Removed (MSA < 0.50)",
    "Components Retained (Kaiser Rule)",
    "Total Variance Explained",
    "Variance by PC1 alone",
    "Variance by PC1 and PC2",
    "VARIMAX Factors",
    "Split-Sample Max Difference"
  ),
  Value = c(
    "FIFA 23 Complete Player Dataset (Kaggle, 2022)",
    "147,400 players",
    "5,000 players (set.seed = 123)",
    paste0(ncol(data_final),
           " numeric sub-attributes (6 aggregates excluded)"),
    "318 / 666 pairs (47.7%) with |r| > 0.3",
    "chi-sq = 218,585.80; df = 666; p ~= 0 (< 2.2e-16)",
    "0.942",
    "Marvelous (>= 0.90)",
    "0 variables removed (minimum MSAi = 0.708)",
    paste0(n_comp, " components (PC1 to PC", n_comp, ")"),
    paste0(round(cum_var[n_comp] * 100, 2), "% of total variance"),
    paste0(round(prop_var[1] * 100, 2), "%"),
    paste0(round(cum_var[2] * 100, 2), "%"),
    paste0("RC1: Technical-Attacking | RC2: Defensive-Aggression | ",
           "RC3: Physical-Aerial | RC4: Speed-Stamina | ",
           "RC5: Youth Potential vs. Experience"),
    paste0(max(val_df$Difference),
           "% (RC4) -- well below 5% stability threshold")
  )
) %>%
  kable(caption = "Table 18. Complete PCA and FA Results Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE, width = "22em") %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  row_spec(c(7, 8, 9, 10, 11, 14, 15), bold = TRUE, background = "#EBF5FB")
Table 18. Complete PCA and FA Results Summary
Metric Value
Dataset FIFA 23 Complete Player Dataset (Kaggle, 2022)
Total Raw Records 147,400 players
Analysis Sample 5,000 players (set.seed = 123)
Variables Used 37 numeric sub-attributes (6 aggregates excluded)
Significant Correlation Pairs 318 / 666 pairs (47.7%) with &#124;r&#124; > 0.3
Bartlett Test Result chi-sq = 218,585.80; df = 666; p ~= 0 (< 2.2e-16)
Overall KMO 0.942
KMO Classification Marvelous (>= 0.90)
Variables Removed (MSA < 0.50) 0 variables removed (minimum MSAi = 0.708)
Components Retained (Kaiser Rule) 5 components (PC1 to PC5)
Total Variance Explained 75.04% of total variance
Variance by PC1 alone 35.73%
Variance by PC1 and PC2 56.09%
VARIMAX Factors RC1: Technical-Attacking &#124; RC2: Defensive-Aggression &#124; RC3: Physical-Aerial &#124; RC4: Speed-Stamina &#124; RC5: Youth Potential vs. Experience
Split-Sample Max Difference 0.76% (RC4) – well below 5% stability threshold

6.2 Key Findings

6.2.1 Finding 1: Exceptional Data Suitability (KMO = 0.942)

KMO = 0.942 means 94.2% of total inter-variable variance is driven by common latent factors, placing this dataset in the highest practical category for PCA/FA. Bartlett’s chi-sq = 218,585.80 (p ~= 0) rejects the identity matrix hypothesis with a test statistic 296 times larger than the critical value. Together, these results confirm that the FIFA 23 player attribute data has an exceptionally strong and fully recoverable latent factor structure. The exclusion of 6 aggregate attributes was methodologically essential: retaining them would have produced a singular correlation matrix, making the entire analysis invalid.


6.2.2 Finding 2: Five Latent Dimensions Explain 75% of Variance (7.4:1 Compression Ratio)

The reduction from 37 original attributes to 5 principal components retaining 75.04% of total variance achieves a 7.4:1 dimensionality compression with 24.96% information loss. The five dimensions are: RC1 – Technical and Attacking Ability (35.73% of variance), RC2 – Defensive and Aggression (20.36%), RC3 – Physical Strength and Aerial (9.85%), RC4 – Speed and Stamina (5.16%), and RC5 – Youth Potential vs. Experience (3.94%). Each dimension is interpretable in terms of well-established football analytics concepts, confirming that PCA has recovered genuine latent structure rather than mathematical noise.


6.2.3 Finding 3: Attacking and Defending Are Orthogonal Independent Dimensions

The negative correlations (-0.40 to -0.60) between attacking/skill variables and defending variables, combined with the orthogonality of RC1 and RC2 in the VARIMAX solution, establish that technical attacking ability and defensive competence are statistically independent dimensions in FIFA 23. This has direct analytical implications: any method that combines these attributes into a single “overall quality” metric conflates two structurally distinct player types. The 5-factor solution correctly separates them, enabling meaningful comparison of attackers and defenders within their respective specialization dimensions.


6.2.4 Finding 4: Two Variables Structurally Outside the Five-Factor Model

weak_foot (h2 = 0.140) and international_reputation (h2 = 0.264) have critically low communalities because they measure constructs genuinely orthogonal to the five ability dimensions. Their low communalities are not data quality failures but analytical evidence that these two attributes operate on different latent dimensions entirely (individual lateralization for weak_foot; commercial and historical reputation for international_reputation). A sixth factor specifically targeting reputation and marketability might capture international_reputation, but this is beyond the current analysis scope.


6.2.5 Finding 5: Solution is Stable and Generalizable (Max Difference = 0.76%)

All five factors replicate within 1% variance difference across two independent n = 2,500 subsamples, far below the 5% stability threshold. The most stable factor is RC3 (diff = 0.03%), reflecting the objective nature of physical measurements. The least stable is RC4 (diff = 0.76%), plausibly due to subsample variation in speed-specialist positions. Overall, the factor structure is a robust property of the FIFA 23 player population, not a sampling artifact, and factor scores can be used with full confidence in downstream analyses.


7 References

7.1 Data Source

7.2 Methodology and Theory

  • Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2019). Multivariate Data Analysis (8th ed.). Cengage Learning.
  • Kaiser, H. F. (1974). An index of factorial simplicity. Psychometrika, 39(1), 31-36.
  • Xiong, S. (2026). A unified framework of principal component analysis and factor analysis. Journal of Multivariate Analysis, 211, 105529. https://doi.org/10.1016/j.jmva.2025.105529
  • Jewsbury, P. A., & Johnson, M. S. (2025). Principal component analysis on the covariance matrix for data reduction in large-scale assessments. Large-Scale Assessments in Education, 13(1), 30. https://doi.org/10.1186/s40536-025-00264-9
  • Woo, K., & Kim, K. (2024). Profiling the socioeconomic characteristics, dietary intake, and health status of Korean older adults. Epidemiology and Health, 46, e2024043. https://doi.org/10.4178/epih.e2024043
  • Ameliya, A., Piliang, Y. K. A., Hidayah, A., & Hasibuan, E. S. H. (2026). Penerapan principal component analysis untuk menentukan faktor-faktor yang mempengaruhi kemiskinan di Sumatera Utara. Algoritma, 4(1), 1-19. https://doi.org/10.62383/algoritma.v4i1.890

7.3 R Packages

  • psych – KMO(), cortest.bartlett(), principal()
  • factoextra – fviz_eig(), fviz_pca_biplot(), fviz_contrib()
  • corrplot – corrplot()
  • kableExtra – table styling
  • ggplot2 + tidyr – distribution visualization
  • data.table – fread(), fwrite()
  • Base R – prcomp(), cor(), det()

Report Generated Using R Markdown

Analysis Date: March 01, 2026

Multivariate Analysis – FIFA 23 PCA and FA Full Report

INT2024 | Course Lecturer: Ulfa Siti Nuraini, S.Stat., M.Stat.