1 Executive Summary

This report presents a complete Principal Component Analysis (PCA) and Factor Analysis (FA) applied to the FIFA 23 Complete Player Dataset (Stefano Leone, 2022), containing 147,400 player records. A random sample of 5,000 players was drawn (set.seed = 123) from 37 numeric sub-attributes covering attacking, skill, movement, power, mentality, and defending capabilities.

Key Findings:

47.7% of variable pairs show |r| > 0.3: data contains sufficient shared variance for PCA/FA
KMO = 0.942 (Marvelous): all 37 variables have individual MSA >= 0.50, no removal needed
Bartlett chi-sq = 218,585.80, p ~= 0: correlation structure is non-trivial and significant
5 principal components retained via Kaiser’s Rule, explaining 75.04% of total variance
VARIMAX rotation identified 5 interpretable factors: Technical and Attacking Ability (RC1), Defensive and Aggression (RC2), Physical Strength and Aerial (RC3), Speed and Stamina (RC4), Youth Potential vs. Experience (RC5)
Split-sample validation confirms solution stability (max difference = 0.76% across all 5 factors)

2 A. Data Characteristics

2.1 Dataset Overview

df_raw <- fread("fifa23_clean_numeric.csv")

# Sub-attributes only -- the 6 main aggregate attributes (pace, shooting,
# passing, dribbling, defending, physic) are arithmetic means of their
# sub-components, creating perfect multicollinearity that invalidates KMO
kolom_analisis <- c(
  "overall", "potential",
  "attacking_crossing", "attacking_finishing",
  "attacking_heading_accuracy", "attacking_short_passing", "attacking_volleys",
  "skill_dribbling", "skill_curve", "skill_fk_accuracy",
  "skill_long_passing", "skill_ball_control",
  "movement_acceleration", "movement_sprint_speed",
  "movement_agility", "movement_reactions", "movement_balance",
  "power_shot_power", "power_jumping", "power_stamina",
  "power_strength", "power_long_shots",
  "mentality_aggression", "mentality_interceptions",
  "mentality_positioning", "mentality_vision",
  "mentality_penalties", "mentality_composure",
  "defending_marking_awareness", "defending_standing_tackle",
  "defending_sliding_tackle",
  "skill_moves", "weak_foot", "international_reputation",
  "height_cm", "weight_kg", "age"
)

kolom_analisis <- intersect(kolom_analisis, names(df_raw))
data_pca <- df_raw %>% select(all_of(kolom_analisis)) %>% na.omit()

set.seed(123)
data_pca <- data_pca %>% sample_n(5000)

cat("Dataset Dimensions (raw):", nrow(df_raw), "rows x", ncol(df_raw), "columns\n")

## Dataset Dimensions (raw): 147400 rows x 45 columns

cat("Analysis Sample:", nrow(data_pca), "rows x", ncol(data_pca), "columns\n")

## Analysis Sample: 5000 rows x 37 columns

data.frame(
  Information = c("Source", "Author", "Year",
                  "Total Records (Raw)", "Analysis Sample",
                  "Variables Used", "Sampling Method", "Random Seed"),
  Detail = c(
    "Kaggle -- FIFA 23 Complete Player Dataset",
    "Stefano Leone",
    "2022",
    "147,400 players",
    "5,000 players (random sample without replacement)",
    paste0(ncol(data_pca), " numeric sub-attributes"),
    "Simple Random Sampling",
    "set.seed(123)"
  )
) %>%
  kable(caption = "Table 1. Dataset Summary Information",
        col.names = c("Information", "Detail")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E")

Table 1. Dataset Summary Information
Information	Detail
Source	Kaggle – FIFA 23 Complete Player Dataset
Author	Stefano Leone
Year	2022
Total Records (Raw)	147,400 players
Analysis Sample	5,000 players (random sample without replacement)
Variables Used	37 numeric sub-attributes
Sampling Method	Simple Random Sampling
Random Seed	set.seed(123)

2.2 Variable List

data.frame(
  No       = 1:ncol(data_pca),
  Variable = names(data_pca),
  Category = case_when(
    names(data_pca) %in% c("overall", "potential") ~ "Overall",
    grepl("attacking_", names(data_pca))           ~ "Attacking",
    grepl("skill_dribbling|skill_curve|skill_fk|skill_long|skill_ball",
          names(data_pca))                         ~ "Skill",
    grepl("movement_", names(data_pca))            ~ "Movement",
    grepl("power_", names(data_pca))               ~ "Power",
    grepl("mentality_", names(data_pca))           ~ "Mentality",
    grepl("defending_", names(data_pca))           ~ "Defending",
    TRUE                                           ~ "Physical / Misc"
  ),
  Scale = case_when(
    names(data_pca) %in%
      c("skill_moves", "weak_foot", "international_reputation") ~ "1-5 (discrete)",
    names(data_pca) == "height_cm" ~ "cm",
    names(data_pca) == "weight_kg" ~ "kg",
    names(data_pca) == "age"       ~ "years",
    TRUE                           ~ "1-100 (continuous)"
  )
) %>%
  kable(caption = "Table 2. Research Variables (37 Sub-Attributes Used in Analysis)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E")

Table 2. Research Variables (37 Sub-Attributes Used in Analysis)
No	Variable	Category	Scale
1	overall	Overall	1-100 (continuous)
2	potential	Overall	1-100 (continuous)
3	attacking_crossing	Attacking	1-100 (continuous)
4	attacking_finishing	Attacking	1-100 (continuous)
5	attacking_heading_accuracy	Attacking	1-100 (continuous)
6	attacking_short_passing	Attacking	1-100 (continuous)
7	attacking_volleys	Attacking	1-100 (continuous)
8	skill_dribbling	Skill	1-100 (continuous)
9	skill_curve	Skill	1-100 (continuous)
10	skill_fk_accuracy	Skill	1-100 (continuous)
11	skill_long_passing	Skill	1-100 (continuous)
12	skill_ball_control	Skill	1-100 (continuous)
13	movement_acceleration	Movement	1-100 (continuous)
14	movement_sprint_speed	Movement	1-100 (continuous)
15	movement_agility	Movement	1-100 (continuous)
16	movement_reactions	Movement	1-100 (continuous)
17	movement_balance	Movement	1-100 (continuous)
18	power_shot_power	Power	1-100 (continuous)
19	power_jumping	Power	1-100 (continuous)
20	power_stamina	Power	1-100 (continuous)
21	power_strength	Power	1-100 (continuous)
22	power_long_shots	Power	1-100 (continuous)
23	mentality_aggression	Mentality	1-100 (continuous)
24	mentality_interceptions	Mentality	1-100 (continuous)
25	mentality_positioning	Mentality	1-100 (continuous)
26	mentality_vision	Mentality	1-100 (continuous)
27	mentality_penalties	Mentality	1-100 (continuous)
28	mentality_composure	Mentality	1-100 (continuous)
29	defending_marking_awareness	Defending	1-100 (continuous)
30	defending_standing_tackle	Defending	1-100 (continuous)
31	defending_sliding_tackle	Defending	1-100 (continuous)
32	skill_moves	Physical / Misc	1-5 (discrete)
33	weak_foot	Physical / Misc	1-5 (discrete)
34	international_reputation	Physical / Misc	1-5 (discrete)
35	height_cm	Physical / Misc	cm
36	weight_kg	Physical / Misc	kg
37	age	Physical / Misc	years

2.3 Descriptive Statistics

desc_df <- data.frame(
  Variable = names(data_pca),
  N        = sapply(data_pca, length),
  Mean     = round(sapply(data_pca, mean),   2),
  Median   = round(sapply(data_pca, median), 2),
  SD       = round(sapply(data_pca, sd),     2),
  Min      = round(sapply(data_pca, min),    2),
  Max      = round(sapply(data_pca, max),    2)
)

desc_df %>%
  kable(caption = "Table 3. Descriptive Statistics -- All 37 Variables (n = 5,000)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  scroll_box(width = "100%", height = "420px")

Table 3. Descriptive Statistics – All 37 Variables (n = 5,000)
	Variable	N	Mean	Median	SD	Min	Max
overall	overall	5000	65.93	66	6.76	46	89
potential	potential	5000	71.07	71	6.21	49	91
attacking_crossing	attacking_crossing	5000	53.86	56	13.38	15	90
attacking_finishing	attacking_finishing	5000	50.70	53	15.93	15	91
attacking_heading_accuracy	attacking_heading_accuracy	5000	56.45	57	11.40	19	90
attacking_short_passing	attacking_short_passing	5000	62.98	64	9.32	23	92
attacking_volleys	attacking_volleys	5000	46.34	46	14.47	12	88
skill_dribbling	skill_dribbling	5000	61.32	63	11.78	20	95
skill_curve	skill_curve	5000	51.94	52	14.37	17	93
skill_fk_accuracy	skill_fk_accuracy	5000	46.79	45	14.31	12	94
skill_long_passing	skill_long_passing	5000	56.94	58	11.58	20	90
skill_ball_control	skill_ball_control	5000	63.39	64	9.64	27	94
movement_acceleration	movement_acceleration	5000	68.35	69	11.27	27	94
movement_sprint_speed	movement_sprint_speed	5000	68.41	69	11.21	29	96
movement_agility	movement_agility	5000	66.66	68	12.06	26	94
movement_reactions	movement_reactions	5000	61.91	62	8.71	32	91
movement_balance	movement_balance	5000	67.20	68	12.11	28	95
power_shot_power	power_shot_power	5000	59.12	61	13.00	18	90
power_jumping	power_jumping	5000	65.76	67	11.88	30	93
power_stamina	power_stamina	5000	67.10	68	11.38	28	94
power_strength	power_strength	5000	65.59	67	12.78	27	96
power_long_shots	power_long_shots	5000	51.33	54	15.62	12	89
mentality_aggression	mentality_aggression	5000	59.22	60	13.59	20	94
mentality_interceptions	mentality_interceptions	5000	50.78	56	18.20	10	88
mentality_positioning	mentality_positioning	5000	55.62	58	14.05	12	91
mentality_vision	mentality_vision	5000	56.22	57	12.49	17	91
mentality_penalties	mentality_penalties	5000	51.50	51	12.19	18	91
mentality_composure	mentality_composure	5000	60.25	60	10.11	30	93
defending_marking_awareness	defending_marking_awareness	5000	50.88	55	17.19	10	88
defending_standing_tackle	defending_standing_tackle	5000	52.66	59	18.22	10	88
defending_sliding_tackle	defending_sliding_tackle	5000	50.22	56	18.09	10	87
skill_moves	skill_moves	5000	2.56	2	0.65	2	5
weak_foot	weak_foot	5000	3.01	3	0.65	1	5
international_reputation	international_reputation	5000	1.09	1	0.34	1	5
height_cm	height_cm	5000	180.62	180	6.66	156	206
weight_kg	weight_kg	5000	74.26	74	6.65	54	101
age	age	5000	25.02	25	4.51	16	39

Technical Interpretation – Descriptive Statistics:

Movement attributes (acceleration, sprint_speed, agility, reactions, balance) record the highest means in the dataset (61.91 to 68.41) with relatively narrow standard deviations (8.71 to 12.11). This reflects a player population dominated by young-to-prime-age outfield players for whom athletic movement qualities are well developed. The narrow dispersion in this category signals moderate homogeneity: most outfield players cluster within a similar movement ability range.

Defending attributes (mentality_interceptions, defending_marking_awareness, defending_standing_tackle, defending_sliding_tackle) have the largest standard deviations in the entire dataset (17.19 to 18.22). This is analytically important: it is not noise but a position-driven bimodal distribution. Attackers and wingers receive values of 10 to 30 on these attributes while defenders and defensive midfielders receive 60 to 88. High variance here provides strong discriminatory signal for factor separation in PCA/FA.

Scale heterogeneity is a critical preprocessing concern. skill_moves, weak_foot, and international_reputation operate on a 1 to 5 discrete scale with very low standard deviations (0.34 to 0.65), while height_cm (156 to 206 cm) and weight_kg (54 to 101 kg) are physical measurements with their own units. Without standardization, variables with larger numeric ranges would dominate PC directions regardless of their true variance structure. Applying scale. = TRUE in prcomp() resolves this by converting all variables to mean = 0, SD = 1 before eigendecomposition.

potential (mean = 71.07) consistently exceeds overall (mean = 65.93) by approximately 5 points on average, consistent with FIFA 23 game logic where potential represents the development ceiling. This gap is largest for players under age 21 and near zero for veterans whose career has peaked.

international_reputation has the lowest mean (1.09) and lowest SD (0.34), meaning the vast majority of the 5,000 sampled players have a reputation rating of exactly 1. This near-constant distribution is the primary reason international_reputation will show the second-lowest communality (h2 = 0.264) in the PCA solution: near-constant variables carry minimal variance for PCA to extract.

2.4 Distribution Visualization

data_long <- data_pca %>%
  tidyr::pivot_longer(cols = everything(),
                      names_to  = "Variable",
                      values_to = "Value") %>%
  mutate(Category = case_when(
    Variable %in% c("overall", "potential")                                            ~ "Overall",
    grepl("attacking_", Variable)                                                      ~ "Attacking",
    grepl("skill_dribbling|skill_curve|skill_fk|skill_long|skill_ball", Variable)     ~ "Skill",
    grepl("movement_", Variable)                                                       ~ "Movement",
    grepl("power_", Variable)                                                          ~ "Power",
    grepl("mentality_", Variable)                                                      ~ "Mentality",
    grepl("defending_", Variable)                                                      ~ "Defending",
    TRUE                                                                               ~ "Physical / Misc"
  ))

ggplot(data_long, aes(x = Variable, y = Value, fill = Category)) +
  geom_boxplot(outlier.size = 0.4, outlier.alpha = 0.3, linewidth = 0.4) +
  facet_wrap(~ Category, scales = "free", ncol = 2) +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal(base_size = 10) +
  theme(
    axis.text.x      = element_text(angle = 45, hjust = 1, size = 7),
    strip.background = element_rect(fill = "#34495E", color = "#34495E"),
    strip.text       = element_text(color = "white", face = "bold"),
    legend.position  = "none",
    plot.title       = element_text(hjust = 0.5, face = "bold", size = 12)
  ) +
  labs(
    title = "Figure 1. Distribution of FIFA 23 Player Attributes by Category",
    x = NULL, y = "Value"
  )

Figure 1. Boxplot Distribution of All 37 FIFA 23 Player Attributes by Category. Wide IQR indicates high variance and strong discriminatory potential for PCA.

Technical Interpretation – Distribution Patterns:

Defending category shows the most analytically informative distributions. The boxplots for mentality_interceptions, defending_standing_tackle, defending_sliding_tackle, and defending_marking_awareness display extremely wide IQRs spanning nearly the full 1 to 100 range, with medians hovering around 50 to 60. This near-uniform spread reflects a population that is roughly half attackers (scoring 10 to 30) and half defenders (scoring 60 to 88), creating high variance that PCA will leverage to define its second principal component.

Movement category displays right-skewed distributions with outliers concentrated at the low end. The bulk of players cluster between 60 and 80, with a lower tail of low-mobility players (goalkeepers, physically large central defenders, older veterans). This skew does not violate PCA assumptions since PCA only requires linear correlation structure, not normality.

Physical/Misc category reveals three structurally distinct distributions: (1) height_cm and weight_kg follow near-normal distributions consistent with the real anthropometric distribution of professional footballers; (2) skill_moves and weak_foot follow discrete distributions concentrated at values 2 to 3 with sharp cutoffs at the boundaries; (3) international_reputation is extremely right-skewed with virtually all mass at value 1, confirming its near-constant nature that will produce low communality in PCA.

Attacking and Skill categories show left-skewed distributions (median above scale midpoint of 50), indicating that the sampled population has above-average technical and attacking skills relative to the theoretical minimum. Low-end outliers in these categories are typically defensive specialists receiving minimal investment in attacking attributes from the game’s design system.

3 B. Assumptions

Three assumptions must ALL be satisfied before PCA/FA is valid:

Correlation Matrix – at least 30% of all variable pairs must show |r| > 0.3
KMO / Measure of Sampling Adequacy (MSA) – overall KMO >= 0.50; all individual MSAi >= 0.50
Bartlett’s Test of Sphericity – p-value < 0.05 (reject identity matrix hypothesis)

The order tested here is Correlation first, then KMO (iterative variable removal if needed), then Bartlett on the cleaned dataset. This sequence ensures that the Bartlett test is computed on the same variable set confirmed adequate by KMO.

3.1 Pre-Cleaning

var_vals <- apply(data_pca, 2, var, na.rm = TRUE)
zero_var <- names(var_vals[var_vals < 1e-10])
if (length(zero_var) > 0) {
  cat("Dropped (zero variance):", paste(zero_var, collapse = ", "), "\n")
  data_pca <- data_pca %>% select(-all_of(zero_var))
} else {
  cat("Pre-clean Step 1: No zero-variance variables found. All 37 retained.\n")
}

## Pre-clean Step 1: No zero-variance variables found. All 37 retained.

cor_tmp  <- cor(data_pca, use = "complete.obs")
perf_idx <- which(abs(cor_tmp) > 0.9999 & upper.tri(cor_tmp), arr.ind = TRUE)
if (nrow(perf_idx) > 0) {
  drop_perf <- unique(colnames(cor_tmp)[perf_idx[, 2]])
  cat("Dropped (perfect correlation):", paste(drop_perf, collapse = ", "), "\n")
  data_pca  <- data_pca %>% select(-all_of(drop_perf))
} else {
  cat("Pre-clean Step 2: No perfectly correlated pairs found. All 37 retained.\n")
}

## Pre-clean Step 2: No perfectly correlated pairs found. All 37 retained.

cat("Variables entering assumption testing:", ncol(data_pca))

## Variables entering assumption testing: 37

Technical note on pre-cleaning: Zero-variance variables produce undefined Pearson correlations (division by SD = 0), and perfectly correlated variables make the correlation matrix singular (determinant = 0), preventing eigendecomposition. Neither condition is present here, confirming that excluding the 6 aggregate attributes was both sufficient and necessary to eliminate the multicollinearity that would otherwise have invalidated the analysis.

3.2 Assumption 1 – Correlation Matrix

Requirement: At least 30% of all unique variable pairs must show |r| > 0.3. If fewer than 30% are significant, the variables do not share enough common variance to justify extracting latent factors.

Pearson correlation formula: \[r_{xy} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}}\]

Total unique pairs for p = 37 variables: $\binom{37}{2} = \frac{37 \times 36}{2} = 666$ pairs.

mat_corr <- round(cor(data_pca, use = "complete.obs"), 3)

if (any(is.na(mat_corr))) {
  na_vars  <- names(which(colSums(is.na(mat_corr)) > 0))
  data_pca <- data_pca %>% select(-all_of(na_vars))
  mat_corr <- round(cor(data_pca, use = "complete.obs"), 3)
}

n_var   <- ncol(data_pca)
n_pairs <- n_var * (n_var - 1) / 2
mat_abs <- abs(mat_corr); diag(mat_abs) <- 0
n_sig   <- sum(mat_abs > 0.3) / 2
pct_sig <- round(n_sig / n_pairs * 100, 1)

data.frame(
  Metric = c("Total variables (p)",
             "Total unique pairs [p(p-1)/2]",
             "Pairs with |r| > 0.3",
             "Percentage significant",
             "Minimum requirement",
             "Decision"),
  Value  = c(n_var, n_pairs, n_sig,
             paste0(pct_sig, "%"),
             "> 30%",
             "PASS -- sufficient shared variance exists for PCA/FA")
) %>%
  kable(caption = "Table 4. Correlation Matrix Assessment") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  row_spec(6, bold = TRUE, color = "darkgreen")

Table 4. Correlation Matrix Assessment
Metric	Value
Total variables (p)	37
Total unique pairs [p(p-1)/2]	666
Pairs with \|r\| > 0.3	318
Percentage significant	47.7%
Minimum requirement	> 30%
Decision	PASS – sufficient shared variance exists for PCA/FA

corrplot(
  mat_corr,
  method = "color", type = "upper",
  tl.cex = 0.65, tl.col = "black",
  col    = colorRampPalette(c("#2166AC", "white", "#D6604D"))(200),
  title  = "FIFA 23 Player Attribute Correlation Matrix (37 Variables)",
  mar    = c(0, 0, 2, 0)
)

Figure 2. Correlation Matrix Heatmap. Red = positive correlation, Blue = negative correlation. Intensity reflects magnitude of |r|.

Technical Interpretation – Correlation Structure:

Quantitative result: 318 out of 666 unique pairs (47.7%) exceed |r| = 0.3, which is 17.7 percentage points above the minimum threshold. This confirms substantial shared variance among the 37 attributes and provides the statistical foundation for latent factor extraction.

Heatmap block structure: The correlation matrix reveals a well-defined two-block architecture. The upper-left block (attacking and skill variables) shows predominantly deep red cells, indicating high positive intercorrelations. For example, skill_dribbling and skill_ball_control (r ~= 0.83), power_shot_power and power_long_shots (r ~= 0.78), and mentality_vision and attacking_short_passing (r ~= 0.72) cluster tightly together. This entire block will constitute RC1 (Technical and Attacking Ability) in the rotated FA solution.

The lower-right cluster (defending_standing_tackle, defending_sliding_tackle, defending_marking_awareness, mentality_interceptions) shows extremely high intercorrelations ranging from 0.88 to 0.95. These four variables measure essentially the same underlying defensive competency from slightly different angles, which is why they will produce near-perfect loadings (0.93 to 0.944) on RC2 in VARIMAX.

Cross-block negative correlations (blue cells connecting the attacking cluster to the defending cluster, r approximately -0.40 to -0.60) are not statistical artifacts. They reflect a deliberate structural feature of FIFA 23’s design: the game assigns attacking specialists systematically low defending values and vice versa, embedding an inherent bipolar specialization axis into the data that PCA will capture as the contrast between PC1 and PC2.

Physical variables (height_cm, weight_kg, power_strength) form a smaller positive cluster among themselves, with near-zero or negative correlations against movement attributes (agility, balance, sprint_speed), consistent with the real biomechanical trade-off between mass and mobility.

3.3 Assumption 2 – KMO / Measure of Sampling Adequacy

KMO (Kaiser-Meyer-Olkin) quantifies what proportion of variable variance is due to common underlying factors versus unique variance. Mathematically:

\[KMO = \frac{\sum_{i \neq j} r_{ij}^2}{\sum_{i \neq j} r_{ij}^2 + \sum_{i \neq j} a_{ij}^2}\]

where $r_{ij}$ is the observed correlation and $a_{ij}$ is the partial correlation between variables i and j controlling for all others. A high KMO means partial correlations are small relative to observed correlations, confirming that a common factor structure drives the intercorrelations rather than unique pairwise relationships.

Kaiser (1974) classification:

KMO Value	Classification
>= 0.90	Marvelous
>= 0.80	Meritorious
>= 0.70	Middling
>= 0.60	Mediocre
>= 0.50	Miserable
< 0.50	Unacceptable (remove variable)

Procedure: Variables with individual MSAi < 0.50 are removed one at a time (lowest MSAi first), recalculating KMO after each removal until all MSAi >= 0.50.

kmo_res <- KMO(mat_corr)
msa_val <- round(kmo_res$MSA, 3)
kmo_kat <- ifelse(msa_val >= 0.90, "Marvelous",
           ifelse(msa_val >= 0.80, "Meritorious",
           ifelse(msa_val >= 0.70, "Middling",
           ifelse(msa_val >= 0.60, "Mediocre",
           ifelse(msa_val >= 0.50, "Miserable", "Unacceptable")))))

cat("Overall KMO/MSA:", msa_val, "--", kmo_kat, "\n")

## Overall KMO/MSA: 0.942 -- Marvelous

msa_df <- data.frame(
  Variable = names(kmo_res$MSAi),
  MSA      = round(kmo_res$MSAi, 3),
  Category = ifelse(kmo_res$MSAi >= 0.90, "Marvelous",
             ifelse(kmo_res$MSAi >= 0.80, "Meritorious",
             ifelse(kmo_res$MSAi >= 0.70, "Middling",
             ifelse(kmo_res$MSAi >= 0.50, "Acceptable", "Drop (< 0.50)"))))
) %>% arrange(MSA)

msa_df %>%
  kable(caption = "Table 5. Individual MSA Values per Variable (sorted ascending)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  column_spec(2, bold = TRUE) %>%
  column_spec(3, bold = TRUE,
              color = ifelse(msa_df$MSA >= 0.90, "#1B5E20",
                      ifelse(msa_df$MSA >= 0.80, "#2E7D32",
                      ifelse(msa_df$MSA >= 0.70, "#558B2F",
                      ifelse(msa_df$MSA >= 0.50, "#F57F17", "red")))))

Table 5. Individual MSA Values per Variable (sorted ascending)
	Variable	MSA	Category
age	age	0.708	Middling
potential	potential	0.766	Middling
movement_sprint_speed	movement_sprint_speed	0.849	Meritorious
movement_acceleration	movement_acceleration	0.874	Meritorious
power_jumping	power_jumping	0.879	Meritorious
overall	overall	0.881	Meritorious
defending_standing_tackle	defending_standing_tackle	0.889	Meritorious
defending_sliding_tackle	defending_sliding_tackle	0.899	Meritorious
height_cm	height_cm	0.900	Marvelous
weight_kg	weight_kg	0.913	Marvelous
power_strength	power_strength	0.919	Marvelous
attacking_heading_accuracy	attacking_heading_accuracy	0.921	Marvelous
skill_long_passing	skill_long_passing	0.938	Marvelous
movement_balance	movement_balance	0.940	Marvelous
power_stamina	power_stamina	0.943	Marvelous
attacking_short_passing	attacking_short_passing	0.945	Marvelous
mentality_interceptions	mentality_interceptions	0.946	Marvelous
skill_fk_accuracy	skill_fk_accuracy	0.953	Marvelous
power_long_shots	power_long_shots	0.957	Marvelous
attacking_crossing	attacking_crossing	0.958	Marvelous
attacking_finishing	attacking_finishing	0.960	Marvelous
mentality_positioning	mentality_positioning	0.963	Marvelous
movement_reactions	movement_reactions	0.964	Marvelous
defending_marking_awareness	defending_marking_awareness	0.965	Marvelous
skill_curve	skill_curve	0.966	Marvelous
skill_ball_control	skill_ball_control	0.966	Marvelous
movement_agility	movement_agility	0.968	Marvelous
international_reputation	international_reputation	0.968	Marvelous
power_shot_power	power_shot_power	0.969	Marvelous
mentality_aggression	mentality_aggression	0.969	Marvelous
skill_dribbling	skill_dribbling	0.970	Marvelous
mentality_penalties	mentality_penalties	0.970	Marvelous
mentality_vision	mentality_vision	0.976	Marvelous
attacking_volleys	attacking_volleys	0.980	Marvelous
mentality_composure	mentality_composure	0.984	Marvelous
skill_moves	skill_moves	0.986	Marvelous
weak_foot	weak_foot	0.989	Marvelous

drop_log <- c()
data_ok  <- data_pca
iter     <- 0
max_iter <- ncol(data_pca) - 5

cat("--- Iterative Variable Removal Check ---\n")

## --- Iterative Variable Removal Check ---

repeat {
  iter <- iter + 1
  if (iter > max_iter) { cat("Max iterations reached.\n"); break }

  mc <- tryCatch(round(cor(data_ok, use = "complete.obs"), 3),
                 error = function(e) NULL)
  if (is.null(mc)) { cat("Singular matrix: stopping.\n"); break }

  det_val <- tryCatch(det(mc), error = function(e) NA)
  if (is.na(det_val) || det_val < 1e-15) {
    cat("Near-singular matrix (det < 1e-15): stopping.\n")
    cat("Note: this is expected for a 37-variable matrix with KMO = 0.942.",
        "The near-zero determinant is caused by high intercorrelations,",
        "not by data quality problems.\n")
    break
  }

  kmo_tmp <- tryCatch(KMO(mc), error = function(e) NULL)
  if (is.null(kmo_tmp)) { cat("KMO failed: stopping.\n"); break }

  msa_clean <- kmo_tmp$MSAi[!is.na(kmo_tmp$MSAi)]
  if (length(msa_clean) == 0) break

  min_msa <- min(msa_clean)
  min_var <- names(which.min(msa_clean))
  if (min_msa >= 0.5) {
    cat("All variables have MSA >= 0.50. No removal needed.\n")
    break
  }

  cat(sprintf("Dropping '%s' (MSA = %.3f)\n", min_var, min_msa))
  drop_log <- c(drop_log, min_var)
  data_ok  <- data_ok %>% select(-all_of(min_var))
}

## Near-singular matrix (det < 1e-15): stopping.
## Note: this is expected for a 37-variable matrix with KMO = 0.942. The near-zero determinant is caused by high intercorrelations, not by data quality problems.

mat_corr_ok <- round(cor(data_ok, use = "complete.obs"), 3)
kmo_final   <- KMO(mat_corr_ok)
data_final  <- data_ok

final_kmo_kat <- ifelse(kmo_final$MSA >= 0.90, "Marvelous",
                 ifelse(kmo_final$MSA >= 0.80, "Meritorious",
                 ifelse(kmo_final$MSA >= 0.70, "Middling",
                 ifelse(kmo_final$MSA >= 0.60, "Mediocre",
                 ifelse(kmo_final$MSA >= 0.50, "Miserable", "Unacceptable")))))

data.frame(
  Metric = c("Initial Variables",
             "Variables Removed (MSA < 0.50)",
             "Final Variables",
             "Final Overall KMO/MSA",
             "Classification",
             "Decision"),
  Value  = c(ncol(data_pca),
             length(drop_log),
             ncol(data_final),
             round(kmo_final$MSA, 3),
             final_kmo_kat,
             "PASS -- all variables have MSA >= 0.50, proceed with PCA")
) %>%
  kable(caption = "Table 6. KMO/MSA Final Summary After Iterative Removal Check") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  row_spec(6, bold = TRUE, color = "darkgreen")

Table 6. KMO/MSA Final Summary After Iterative Removal Check
Metric	Value
Initial Variables	37
Variables Removed (MSA < 0.50)	0
Final Variables	37
Final Overall KMO/MSA	0.942
Classification	Marvelous
Decision	PASS – all variables have MSA >= 0.50, proceed with PCA

Technical Interpretation – KMO:

Overall KMO = 0.942 (Marvelous) means that 94.2% of the total variance among the 37 variables is attributable to underlying common factors, with only 5.8% coming from unique or error variance. This is among the highest KMO values observable in practice and confirms that the data has an exceptionally strong and recoverable latent structure.

Individual MSA values range from 0.708 (age, Middling) to 0.989 (weak_foot, Marvelous). Age has the lowest MSA because it correlates with ability attributes only indirectly and non-linearly: young players may have high potential but modest current ratings, while older players show the opposite pattern. After controlling for all other variables, age retains non-trivial unique partial correlations, modestly reducing its MSA. Despite this, 0.708 comfortably clears the 0.50 threshold.

The near-singular matrix message in the iterative loop is a computational stopping condition, not a data problem. A 37-variable correlation matrix with KMO = 0.942 has intercorrelations so high that its determinant approaches machine-precision zero (< 1e-15), making further eigendecomposition inside the KMO loop numerically unstable. The loop correctly exits before computing invalid KMO values. Since all 37 variables already had MSAi >= 0.50 on the first iteration, zero variables were removed and the final dataset is identical to the input.

3.4 Assumption 3 – Bartlett’s Test of Sphericity

Bartlett’s Test formally tests H0: the population correlation matrix equals the identity matrix (R = I), meaning all off-diagonal correlations are exactly zero. If H0 were true, no shared variance would exist and PCA/FA would be meaningless.

Hypotheses:

H0: R = I (all variables are uncorrelated in the population)
H1: R != I (at least some variables are correlated)

Test statistic: \[\chi^2 = -\left[(n-1) - \frac{2p+5}{6}\right] \ln|R|\]

where n = sample size, p = number of variables, |R| = determinant of the correlation matrix. The term (2p+5)/6 is a bias correction for small samples. Degrees of freedom = p(p-1)/2.

Decision rule: Reject H0 if p-value < 0.05.

Note: Bartlett is computed on mat_corr_ok (the KMO-cleaned matrix) and data_final (n after KMO cleaning) to ensure consistency between the two tests.

n_obs    <- nrow(data_final)
bart_res <- cortest.bartlett(mat_corr_ok, n = n_obs, diag = TRUE)
p_val    <- bart_res$p.value
p_display <- ifelse(is.na(p_val) || p_val == 0,
                    "approx. 0 (< 2.2e-16, machine precision limit in R)",
                    format(p_val, scientific = TRUE))

data.frame(
  Statistic = c("Chi-square statistic",
                "Degrees of freedom [p(p-1)/2]",
                "p-value",
                "Significance level (alpha)",
                "Sample size (n)",
                "Decision"),
  Value = c(
    formatC(bart_res$chisq, format = "f", digits = 4),
    bart_res$df,
    p_display,
    "0.05",
    n_obs,
    "REJECT H0 -- correlation matrix is significantly non-identity"
  )
) %>%
  kable(caption = "Table 7. Bartlett's Test of Sphericity Results") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  row_spec(6, bold = TRUE, color = "darkgreen")

Table 7. Bartlett’s Test of Sphericity Results
Statistic	Value
Chi-square statistic	218585.7995
Degrees of freedom [p(p-1)/2]	666
p-value	approx. 0 (< 2.2e-16, machine precision limit in R)
Significance level (alpha)	0.05
Sample size (n)	5000
Decision	REJECT H0 – correlation matrix is significantly non-identity

Technical Interpretation – Bartlett’s Test:

The chi-squared statistic of 218,585.80 with 666 degrees of freedom is extraordinarily large. For context, the critical value at alpha = 0.05 with 666 df is approximately 737.6. The observed statistic exceeds this critical value by a factor of approximately 296, meaning the evidence against H0 is overwhelming by any standard.

Why is the statistic so large? The Bartlett formula includes the term ln|R|. When variables are highly intercorrelated (KMO = 0.942), the correlation matrix determinant approaches zero, making ln|R| a large negative number. Multiplied by a negative sign and the large sample correction term (n - 1 = 4,999), this produces an enormous positive chi-squared value. A larger Bartlett statistic therefore indicates stronger intercorrelation structure.

The p-value appears as exactly 0 in R because chi-sq = 218,585.80 is so far into the tail of the chi-squared distribution with 666 df that the tail probability is smaller than 2.2e-16, the smallest positive number representable in R’s double-precision floating point arithmetic. This is not a computational error; it means the true p-value is indistinguishably small from zero.

Practical conclusion: H0 is rejected with certainty. The 37 FIFA 23 sub-attributes do not vary independently of one another. They share systematic variance attributable to a smaller number of latent ability dimensions. PCA and FA are fully statistically justified.

3.5 Assumption Summary

data.frame(
  Assumption  = c("1. Correlation Matrix",
                  "2. KMO / MSA",
                  "3. Bartlett's Test"),
  Result      = c("318 / 666 pairs (47.7%) with |r| > 0.3",
                  "Overall KMO = 0.942 (Marvelous); min MSAi = 0.708 (age)",
                  "chi-sq = 218,585.80; df = 666; p ~= 0"),
  Requirement = c("> 30% of pairs",
                  "Overall and all MSAi >= 0.50",
                  "p < 0.05"),
  Status      = c("PASS", "PASS", "PASS")
) %>%
  kable(caption = "Table 8. Summary -- All Three PCA/FA Assumptions Satisfied") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(4, bold = TRUE, color = "darkgreen") %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E")

Table 8. Summary – All Three PCA/FA Assumptions Satisfied
Assumption	Result	Requirement	Status
Correlation Matrix	318 / 666 pairs (47.7%) with \|r\| > 0.3	> 30% of pairs	PASS
KMO / MSA	Overall KMO = 0.942 (Marvelous); min MSAi = 0.708 (age)	Overall and all MSAi >= 0.50	PASS
Bartlett’s Test	chi-sq = 218,585.80; df = 666; p ~= 0	p < 0.05	PASS

4 C. Principal Component Analysis (PCA)

4.1 Objective and Design

Two analytical goals of PCA in this study:

Latent Structure Identification – determine whether the 37 observed attributes are manifestations of fewer underlying latent ability dimensions (e.g., “offensive skill”, “physical build”, “defensive competence”)
Dimensionality Reduction – compress the 37-variable space into k orthogonal principal components that collectively retain the majority of total variance, enabling more parsimonious player profiling

data.frame(
  Criterion   = c("Number of Variables (p)", "Number of Observations (n)",
                  "Obs/Variable Ratio (n/p)", "Variable Type",
                  "Analysis Type", "Missing Values"),
  Value       = c(ncol(data_pca), nrow(data_pca),
                  paste0(round(nrow(data_pca) / ncol(data_pca), 1), " : 1"),
                  "All numeric (continuous or discrete ordinal)",
                  "R-type (correlations among variables)",
                  "None (na.omit applied prior to sampling)"),
  Requirement = c(">= 10", ">= 50 (100+ preferred)", ">= 5 : 1",
                  "Required for Pearson correlation", "Standard for PCA",
                  "Must be handled"),
  Status      = rep("PASS", 6)
) %>%
  kable(caption = "Table 9. PCA Design Criteria Checklist") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(4, bold = TRUE, color = "darkgreen") %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E")

Table 9. PCA Design Criteria Checklist
Criterion	Value	Requirement	Status
Number of Variables (p)	37	>= 10	PASS
Number of Observations (n)	5000	>= 50 (100+ preferred)	PASS
Obs/Variable Ratio (n/p)	135.1 : 1	>= 5 : 1	PASS
Variable Type	All numeric (continuous or discrete ordinal)	Required for Pearson correlation	PASS
Analysis Type	R-type (correlations among variables)	Standard for PCA	PASS
Missing Values	None (na.omit applied prior to sampling)	Must be handled	PASS

The n/p ratio of 135.1:1 far exceeds even the liberal 100:1 benchmark cited in Hair et al. (2019) as “excellent.” This ratio guarantees stable correlation estimates with narrow confidence intervals, reliable eigendecomposition without overfitting to sample-specific noise, and a generalizable factor structure that replicates in independent samples. This is formally confirmed by the split-sample validation results in Section D.

4.2 Running PCA

pc       <- prcomp(data_final, scale. = TRUE, center = TRUE)
eig_val  <- pc$sdev^2
prop_var <- eig_val / sum(eig_val)
cum_var  <- cumsum(prop_var)

4.3 Component Retention

Three convergent criteria for deciding how many components to retain:

Kaiser’s Rule (Latent Root): retain PC_i if eigenvalue lambda_i > 1.0. Rationale: a component must explain more variance than a single standardized variable (which has variance = 1 by definition) to be worth retaining.
Cumulative Variance Criterion: retain until cumulative explained variance >= 60% to 70%. This ensures the retained solution adequately represents the original data.
Scree Test: identify the elbow in the eigenvalue plot where the curve flattens. Components before the elbow carry systematic variance; components after it carry mostly noise.

Agreement across all three criteria strengthens the retention decision.

n_kaiser <- sum(eig_val > 1)
n_varpc  <- which(cum_var >= 0.70)[1]
n_comp   <- n_kaiser

eig_df <- data.frame(
  Component  = paste0("PC", 1:length(eig_val)),
  Eigenvalue = round(eig_val, 4),
  Variance   = paste0(round(prop_var * 100, 2), "%"),
  Cumulative = paste0(round(cum_var  * 100, 2), "%"),
  Decision   = ifelse(eig_val > 1, "Retain", "Drop")
)

head(eig_df, 15) %>%
  kable(caption = "Table 10. Eigenvalue Table -- Top 15 Components",
        col.names = c("Component", "Eigenvalue", "Variance (%)",
                      "Cumulative (%)", "Decision")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  row_spec(1:5, background = "#EBF5FB") %>%
  column_spec(5, bold = TRUE,
              color = ifelse(head(eig_df, 15)$Decision == "Retain",
                             "darkgreen", "red"))

Table 10. Eigenvalue Table – Top 15 Components
Component	Eigenvalue	Variance (%)	Cumulative (%)	Decision
PC1	13.2199	35.73%	35.73%	Retain
PC2	7.5333	20.36%	56.09%	Retain
PC3	3.6463	9.85%	65.94%	Retain
PC4	1.9089	5.16%	71.1%	Retain
PC5	1.4580	3.94%	75.04%	Retain
PC6	0.9733	2.63%	77.67%	Drop
PC7	0.8720	2.36%	80.03%	Drop
PC8	0.7915	2.14%	82.17%	Drop
PC9	0.6164	1.67%	83.84%	Drop
PC10	0.5420	1.46%	85.3%	Drop
PC11	0.4600	1.24%	86.54%	Drop
PC12	0.4582	1.24%	87.78%	Drop
PC13	0.4165	1.13%	88.91%	Drop
PC14	0.3902	1.05%	89.96%	Drop
PC15	0.3420	0.92%	90.89%	Drop

data.frame(
  Criterion = c("Kaiser's Rule (eigenvalue > 1)",
                "Cumulative Variance (>= 70%)",
                "Scree Test (visual elbow)"),
  Components = c(n_kaiser, n_varpc,
                 "3 to 5 (elbow visible between PC3 and PC4)"),
  Cumulative = c(
    paste0(round(cum_var[n_kaiser] * 100, 2), "%"),
    paste0(round(cum_var[n_varpc]  * 100, 2), "%"),
    paste0(round(cum_var[3] * 100, 2), "% to ",
           round(cum_var[5] * 100, 2), "%")
  ),
  Role = c("Primary (definitive)", "Supporting", "Supporting")
) %>%
  kable(caption = "Table 11. Component Retention Criteria Comparison",
        col.names = c("Criterion", "Components Retained",
                      "Cumulative Variance", "Role")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  row_spec(1, bold = TRUE, background = "#EBF5FB") %>%
  column_spec(1, bold = TRUE)

Table 11. Component Retention Criteria Comparison
Criterion	Components Retained	Cumulative Variance	Role
Kaiser’s Rule (eigenvalue > 1)	5	75.04%	Primary (definitive)
Cumulative Variance (>= 70%)	4	71.1%	Supporting
Scree Test (visual elbow)	3 to 5 (elbow visible between PC3 and PC4)	65.94% to 75.04%	Supporting

fviz_eig(
  pc, ncp = 15, addlabels = TRUE,
  barfill  = "#3498DB", barcolor = "#2980B9", linecolor = "#E74C3C",
  main = "Scree Plot -- PCA FIFA 23 Player Attributes"
) +
  geom_hline(yintercept = 100 / ncol(data_final),
             linetype = "dashed", color = "#E74C3C", linewidth = 0.9) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Figure 3. Scree Plot. Dashed red line marks eigenvalue = 1 (Kaiser threshold). Elbow is visible between PC3 and PC4, consistent with retaining 5 components under Kaiser’s Rule.

Technical Interpretation – Component Retention:

Kaiser’s Rule retains 5 components: PC1 (eigenvalue = 13.22), PC2 (7.53), PC3 (3.65), PC4 (1.91), PC5 (1.46). PC6 (eigenvalue = 0.97) drops cleanly below 1.0 with a gap of 0.49 eigenvalue units from PC5, providing a well-defined boundary.

Variance decomposition: PC1 alone accounts for 35.73% of total variance – unusually dominant, indicating a single strong latent dimension (general technical and attacking quality) that differentiates players more powerfully than any other axis. PC1 and PC2 together explain 56.09%, meaning the attacking-defending specialization bipolar dimension (PC2) adds another 20.36%. The remaining three components (PC3 to PC5) contribute 9.85%, 5.16%, and 3.94%, capturing progressively narrower but theoretically meaningful dimensions (physical build, speed, and developmental trajectory).

Scree test: The plot shows a pronounced steep descent from PC1 to PC3, then a clear change in slope (elbow) between PC3 and PC4. Strictly interpreted, the scree elbow suggests retaining 3 components. However, Kaiser’s Rule extending to 5 is well justified because PC4 (eigenvalue = 1.91) and PC5 (eigenvalue = 1.46) each explain more variance than any single standardized variable, and they capture theoretically distinct dimensions (speed and career stage) that would be lost in a 3-component solution.

Convergent decision: Three criteria, one conclusion – retain k = 5 components, explaining 75.04% of total variance.

4.4 Component Loading Matrix

loadings_mat <- pc$rotation[, 1:n_comp] %*% diag(sqrt(eig_val[1:n_comp]))
colnames(loadings_mat) <- paste0("PC", 1:n_comp)
h2 <- rowSums(loadings_mat^2)

load_df <- as.data.frame(round(loadings_mat, 3))
load_df$h2 <- round(h2, 3)
load_df <- load_df %>% arrange(desc(h2))

load_df %>%
  kable(caption = "Table 12. Unrotated Loading Matrix with Communalities (h2). Sorted by h2 descending. L_ij = e_ij x sqrt(lambda_i) = correlation between variable i and PC j.",
        col.names = c(paste0("PC", 1:n_comp), "h2")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  column_spec(n_comp + 2, bold = TRUE,
              color = ifelse(load_df$h2 >= 0.70, "darkgreen",
                      ifelse(load_df$h2 >= 0.50, "#E67E22", "red"))) %>%
  scroll_box(width = "100%", height = "450px")

Table 12. Unrotated Loading Matrix with Communalities (h2). Sorted by h2 descending. L_ij = e_ij x sqrt(lambda_i) = correlation between variable i and PC j.
	PC1	PC2	PC3	PC4	PC5	h2
defending_standing_tackle	0.134	-0.821	-0.469	-0.106	-0.034	0.925
defending_sliding_tackle	0.175	-0.799	-0.491	-0.092	-0.024	0.919
mentality_interceptions	0.091	-0.836	-0.442	-0.110	0.023	0.915
defending_marking_awareness	0.108	-0.841	-0.416	-0.101	0.010	0.902
overall	-0.713	-0.568	0.046	0.216	-0.085	0.887
attacking_finishing	-0.812	0.307	0.341	-0.009	0.031	0.871
potential	-0.473	-0.282	-0.069	0.391	-0.639	0.869
movement_acceleration	-0.449	0.437	-0.386	0.561	0.084	0.862
skill_ball_control	-0.896	-0.165	-0.004	0.054	-0.165	0.859
attacking_short_passing	-0.756	-0.458	-0.113	-0.091	-0.189	0.838
skill_dribbling	-0.896	0.117	-0.073	0.060	-0.105	0.837
power_long_shots	-0.867	0.113	0.210	-0.144	0.077	0.835
mentality_vision	-0.880	-0.042	-0.048	-0.180	-0.056	0.814
attacking_volleys	-0.806	0.207	0.333	-0.060	0.051	0.809
movement_agility	-0.650	0.394	-0.403	0.211	0.154	0.808
power_strength	0.099	-0.668	0.512	0.234	0.161	0.798
mentality_positioning	-0.855	0.202	0.133	0.016	0.076	0.796
movement_sprint_speed	-0.374	0.350	-0.278	0.665	0.083	0.790
skill_long_passing	-0.631	-0.503	-0.236	-0.242	-0.149	0.788
age	-0.304	-0.403	0.197	-0.298	0.630	0.780
movement_reactions	-0.658	-0.563	0.072	0.121	-0.030	0.771
skill_curve	-0.855	0.053	-0.054	-0.169	0.035	0.766
mentality_composure	-0.753	-0.434	0.092	0.018	-0.049	0.766
power_shot_power	-0.805	-0.041	0.312	-0.063	0.039	0.752
height_cm	0.315	-0.457	0.612	0.160	-0.207	0.751
movement_balance	-0.503	0.394	-0.545	0.016	0.181	0.739
attacking_heading_accuracy	-0.069	-0.640	0.489	0.268	0.075	0.732
skill_fk_accuracy	-0.781	0.018	0.018	-0.340	0.077	0.732
mentality_penalties	-0.717	0.192	0.398	-0.120	0.066	0.729
attacking_crossing	-0.755	-0.056	-0.316	-0.081	0.056	0.682
weight_kg	0.200	-0.487	0.616	0.141	-0.041	0.678
mentality_aggression	-0.120	-0.783	-0.071	0.087	0.175	0.670
power_stamina	-0.383	-0.402	-0.236	0.318	0.330	0.574
power_jumping	0.030	-0.430	0.114	0.420	0.434	0.564
skill_moves	-0.714	0.198	0.015	0.018	-0.061	0.553
international_reputation	-0.378	-0.270	0.118	-0.066	-0.174	0.264
weak_foot	-0.361	0.072	0.058	-0.023	-0.020	0.140

Technical Interpretation – Unrotated Loading Matrix:

Each cell L_ij is the Pearson correlation between variable i and principal component j, ranging from -1 to +1. The sign of loadings in unrotated PCA is determined by eigenvector sign convention and carries no substantive meaning; only the absolute magnitude matters.

PC1 shows high absolute loadings (|L| > 0.70) on nearly all technical and attacking variables: skill_dribbling (-0.896), skill_ball_control (-0.896), mentality_vision (-0.880), power_long_shots (-0.867), skill_curve (-0.855), mentality_positioning (-0.855). All loadings are negative, which is the mathematical convention for this eigenvector direction. PC1 represents the dimension of general technical mastery and attacking capability: it primarily separates high-ability technical players from low-ability or defensively specialized ones.

PC2 loads strongly on defending attributes: defending_marking_awareness (-0.841), mentality_interceptions (-0.836), defending_standing_tackle (-0.821), mentality_aggression (-0.783), with attacking attributes loading positively or near-zero. PC2 represents the defending specialization axis, orthogonal to PC1: knowing how technically skilled a player is (PC1 score) tells you almost nothing about their defensive capability (PC2 score).

PC3 captures physical build: height_cm (0.612), weight_kg (0.616), power_strength (0.512) load positively, while movement_agility (-0.403) and movement_balance (-0.545) load negatively. This component encodes the biomechanical trade-off between body mass and mobility.

PC4 isolates pure speed: movement_sprint_speed (0.665) and movement_acceleration (0.561) dominate, with other variables near-zero. Speed as a standalone latent dimension, orthogonal to technical quality and physical bulk, is well established in sports science literature.

PC5 captures the career development axis: potential and age load in opposing directions, encoding how far a player’s current ability is from their projected ceiling.

Communalities confirm that 30 of 37 variables (81.1%) are well-represented (h2 >= 0.70). The two poorly-represented variables are weak_foot (h2 = 0.140) and international_reputation (h2 = 0.264). These attributes measure individual-specific traits (lateralization and commercial profile) that are genuinely orthogonal to the five latent ability dimensions and will not improve even with rotation.

4.5 Biplot

fviz_pca_biplot(
  pc,
  axes      = c(1, 2),
  geom.ind  = "point",
  col.ind   = "steelblue",
  alpha.ind = 0.15,
  col.var   = "contrib",
  gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
  repel     = FALSE,
  label     = "var",
  title     = "Biplot PCA FIFA 23 (PC1 vs PC2)"
) + theme_minimal()

Figure 4. PCA Biplot (PC1 vs PC2). Arrows = variable loadings. Points = individual player scores. Arrow direction and length indicate correlation with the component axes.

Technical Interpretation – Biplot:

Arrows pointing in the same general direction indicate positively correlated variables that will load on the same factor. The attacking/skill arrows (skill_dribbling, skill_ball_control, power_long_shots, mentality_vision) all point toward the left of the PC1 axis, forming a tight bundle confirming strong positive intercorrelation within this cluster. The defending arrows (defending_standing_tackle, defending_marking_awareness, mentality_interceptions) point toward the upper right, nearly perpendicular to the attacking arrows. This near-perpendicularity in the biplot is the geometric representation of the orthogonality between RC1 and RC2 in the FA solution: attacking ability and defensive ability are statistically independent latent dimensions.

The player point cloud is concentrated near the origin (average players on both dimensions) with extensions toward the lower left (technically elite attacking players) and upper right (defensive specialists). Players near the upper left would be rare complete midfielders combining both dimensions at high levels.

4.6 Variable Contribution

fviz_contrib(
  pc, choice = "var", axes = 1:2, top = 20,
  fill = "#3498DB", color = "#1A5276"
) +
  labs(title = "Top 20 Variable Contributions to PC1 and PC2 Combined") +
  theme_minimal()

Figure 5. Top 20 Variables Contributing to PC1 and PC2. The red dashed line marks the expected equal contribution level (100/37 = 2.7%).

5 D. Factor Analysis (FA)

5.1 VARIMAX Rotation

Factor Analysis with VARIMAX rotation builds on the PCA solution by applying an orthogonal rotation to the 5 component axes. The rotation maximizes the variance of squared loadings within each factor (the VARIMAX criterion), achieving a “simple structure” where each variable loads highly on one factor and near-zero on all others. Total variance explained (75.04%) is unchanged by rotation; only the distribution across factors changes to improve interpretability.

The common factor model: \[X_j = \lambda_{j1}F_1 + \lambda_{j2}F_2 + \cdots + \lambda_{jk}F_k + \varepsilon_j\]

where $F_i$ are unobservable latent factors, $\lambda_{ji}$ are factor loadings, and $\varepsilon_j$ is unique variance specific to variable j (not shared with any factor).

Loading interpretation threshold (Hair et al., 2019):

Absolute Loading	Interpretation
>= 0.70	Strongly significant
>= 0.50	Practically significant
>= 0.40	Acceptable (n >= 200)
< 0.40	Not considered dominant

fa_rot <- principal(
  data_final,
  nfactors = n_comp,
  rotate   = "varimax",
  scores   = TRUE
)

loadings_rot <- fa_rot$loadings[, 1:n_comp]
class(loadings_rot) <- "matrix"
h2_rot <- fa_rot$communality

load_rot_df <- data.frame(round(loadings_rot, 3), h2 = round(h2_rot, 3)) %>%
  arrange(desc(h2))

load_rot_df %>%
  kable(caption = "Table 13. VARIMAX Rotated Loading Matrix with Communalities (h2). Sorted by h2 descending.",
        col.names = c(paste0("RC", 1:n_comp), "h2")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  column_spec(n_comp + 2, bold = TRUE,
              color = ifelse(load_rot_df$h2 >= 0.70, "darkgreen",
                      ifelse(load_rot_df$h2 >= 0.50, "#E67E22", "red"))) %>%
  scroll_box(width = "100%", height = "450px")

Table 13. VARIMAX Rotated Loading Matrix with Communalities (h2). Sorted by h2 descending.
	RC1	RC2	RC3	RC4	RC5	h2
defending_standing_tackle	-0.139	0.944	0.084	-0.081	0.027	0.925
defending_sliding_tackle	-0.187	0.935	0.066	-0.067	0.024	0.919
mentality_interceptions	-0.094	0.944	0.107	-0.063	-0.030	0.915
defending_marking_awareness	-0.105	0.931	0.134	-0.072	-0.019	0.902
overall	0.700	0.440	0.336	0.260	0.150	0.887
attacking_finishing	0.822	-0.417	-0.019	0.136	-0.044	0.871
potential	0.431	0.237	0.173	0.218	0.741	0.869
movement_acceleration	0.220	-0.190	-0.386	0.774	0.174	0.862
skill_ball_control	0.872	0.158	-0.006	0.196	0.191	0.859
attacking_short_passing	0.765	0.470	0.042	0.031	0.167	0.838
skill_dribbling	0.834	-0.036	-0.210	0.270	0.151	0.837
power_long_shots	0.885	-0.168	-0.062	0.070	-0.120	0.835
mentality_vision	0.875	0.107	-0.185	0.056	0.018	0.814
attacking_volleys	0.830	-0.324	0.014	0.090	-0.081	0.809
movement_agility	0.471	-0.101	-0.522	0.551	-0.002	0.808
power_strength	-0.002	0.238	0.848	0.013	-0.150	0.798
mentality_positioning	0.824	-0.215	-0.111	0.234	-0.055	0.796
movement_sprint_speed	0.156	-0.192	-0.212	0.804	0.194	0.790
skill_long_passing	0.650	0.589	-0.062	-0.083	0.089	0.788
age	0.372	0.269	0.239	-0.057	-0.713	0.780
movement_reactions	0.666	0.430	0.326	0.178	0.063	0.771
skill_curve	0.835	0.031	-0.233	0.102	-0.061	0.766
mentality_composure	0.771	0.326	0.222	0.114	0.050	0.766
power_shot_power	0.844	-0.108	0.135	0.065	-0.075	0.752
height_cm	-0.168	0.007	0.801	-0.239	0.158	0.751
movement_balance	0.336	-0.005	-0.678	0.401	-0.080	0.739
attacking_heading_accuracy	0.153	0.229	0.806	0.063	-0.051	0.732
skill_fk_accuracy	0.808	0.036	-0.210	-0.072	-0.168	0.732
mentality_penalties	0.769	-0.345	0.061	0.002	-0.125	0.729
attacking_crossing	0.679	0.254	-0.325	0.225	-0.026	0.682
weight_kg	-0.062	0.036	0.802	-0.173	-0.001	0.678
mentality_aggression	0.130	0.680	0.399	0.106	-0.142	0.670
power_stamina	0.273	0.443	0.117	0.509	-0.176	0.574
power_jumping	-0.079	0.249	0.482	0.423	-0.290	0.564
skill_moves	0.675	-0.152	-0.181	0.187	0.084	0.553
international_reputation	0.428	0.173	0.165	-0.085	0.130	0.264
weak_foot	0.358	-0.078	-0.050	0.053	0.015	0.140

Technical Interpretation – VARIMAX Rotated Loading Matrix:

After VARIMAX rotation, each variable converges onto a single dominant factor with near-zero loadings on the others, confirming clean simple structure. All loadings in the rotated solution become positive for their dominant factor (the sign-flip of unrotated PC1 disappears), making interpretation more straightforward.

RC1 – Technical and Attacking Ability: Dominant variables (|L| >= 0.80): power_long_shots (0.885), mentality_vision (0.875), skill_ball_control (0.872), power_shot_power (0.844), skill_dribbling (0.834), skill_curve (0.835), attacking_volleys (0.830), mentality_positioning (0.824), attacking_finishing (0.822), skill_fk_accuracy (0.808). All 10 variables load above 0.80 on RC1 and below 0.45 on all other factors, demonstrating excellent simple structure. RC1 represents the comprehensive offensive and technical toolkit: the ability to shoot with power and accuracy, dribble past opponents, place the ball precisely, create chances through vision, and position intelligently to receive passes and score goals. Players with very high RC1 scores are creative attacking players such as classic number 10s and technically gifted forwards.

RC2 – Defensive and Aggression: Dominant variables: mentality_interceptions (0.944), defending_standing_tackle (0.944), defending_sliding_tackle (0.935), defending_marking_awareness (0.931). These four loadings at 0.93 to 0.944 are near-perfect and indicate that these four attributes are effectively measuring the same underlying defensive competence construct from slightly different operational angles: winning the ball back through anticipation (interceptions), challenging in the tackle (standing and sliding), and reading the opponent’s movement (marking awareness). mentality_aggression (0.680) adds the competitive intensity and pressing quality dimension. The negative loading of attacking_finishing (-0.417) on RC2 quantifies the systematic trade-off in FIFA 23 design: players who are defensively excellent tend to have systematically lower attacking finishing.

RC3 – Physical Strength and Aerial Ability: Dominant variables: power_strength (0.848), attacking_heading_accuracy (0.806), weight_kg (0.802), height_cm (0.801). The inclusion of attacking_heading_accuracy in this physical factor rather than RC1 is analytically informative: heading ability in FIFA 23 is determined primarily by physical attributes (height, jumping ability, strength to win aerial contests) rather than by technical skill. This is why VARIMAX correctly assigns it to RC3 rather than the technical factor. The negative loadings of movement_agility (-0.522) and movement_balance (-0.678) on RC3 quantify the known biomechanical cost of large body mass: taller, heavier, stronger players trade off mobility and balance to gain physical dominance.

RC4 – Speed and Stamina: Dominant variables: movement_sprint_speed (0.804), movement_acceleration (0.774), movement_agility (0.551), power_stamina (0.509). RC4 isolates the “athletic engine” of a player: peak running speed, rate of acceleration to reach that speed, directional quickness, and aerobic endurance. The combination of sprint speed with stamina is practically coherent: in modern high-intensity football, a player who is fast but lacks stamina cannot sustain their speed advantage for 90 minutes. RC4 is largely orthogonal to RC1 (technical skill): a technically average but explosively fast player scores high on RC4 regardless of their RC1 score, which is reflected in the near-zero cross-loading between these two factors.

RC5 – Youth Potential vs. Experience: Defining contrast: potential (0.741) versus age (-0.713). This factor does not describe what a player can do right now but rather where they are in their career development trajectory. A high RC5 score indicates a young player with a large gap between potential ceiling and current overall rating (developmental upside). A low or negative RC5 score indicates a veteran player whose current ability is at or near their potential ceiling (peak or post-peak career stage). RC5 is unique among the five in being a temporal dimension rather than a contemporaneous ability dimension.

5.2 Dominant Loadings per Factor

dom_list <- lapply(1:n_comp, function(f) {
  ld  <- loadings_rot[, f]
  dom <- sort(ld[abs(ld) >= 0.40], decreasing = TRUE)
  data.frame(Factor   = paste0("RC", f),
             Variable = names(dom),
             Loading  = round(dom, 3))
})

do.call(rbind, dom_list) %>%
  kable(caption = "Table 14. Dominant Loadings per Factor (|loading| >= 0.40, sorted descending within factor)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(3, bold = TRUE,
              color = ifelse(
                abs(do.call(rbind, dom_list)$Loading) >= 0.70,
                "darkgreen", "#E67E22"))

Table 14. Dominant Loadings per Factor (|loading| >= 0.40, sorted descending within factor)
	Factor	Variable	Loading
power_long_shots	RC1	power_long_shots	0.885
mentality_vision	RC1	mentality_vision	0.875
skill_ball_control	RC1	skill_ball_control	0.872
power_shot_power	RC1	power_shot_power	0.844
skill_curve	RC1	skill_curve	0.835
skill_dribbling	RC1	skill_dribbling	0.834
attacking_volleys	RC1	attacking_volleys	0.830
mentality_positioning	RC1	mentality_positioning	0.824
attacking_finishing	RC1	attacking_finishing	0.822
skill_fk_accuracy	RC1	skill_fk_accuracy	0.808
mentality_composure	RC1	mentality_composure	0.771
mentality_penalties	RC1	mentality_penalties	0.769
attacking_short_passing	RC1	attacking_short_passing	0.765
overall	RC1	overall	0.700
attacking_crossing	RC1	attacking_crossing	0.679
skill_moves	RC1	skill_moves	0.675
movement_reactions	RC1	movement_reactions	0.666
skill_long_passing	RC1	skill_long_passing	0.650
movement_agility	RC1	movement_agility	0.471
potential	RC1	potential	0.431
international_reputation	RC1	international_reputation	0.428
defending_standing_tackle	RC2	defending_standing_tackle	0.944
mentality_interceptions	RC2	mentality_interceptions	0.944
defending_sliding_tackle	RC2	defending_sliding_tackle	0.935
defending_marking_awareness	RC2	defending_marking_awareness	0.931
mentality_aggression	RC2	mentality_aggression	0.680
skill_long_passing1	RC2	skill_long_passing	0.589
attacking_short_passing1	RC2	attacking_short_passing	0.470
power_stamina	RC2	power_stamina	0.443
overall1	RC2	overall	0.440
movement_reactions1	RC2	movement_reactions	0.430
attacking_finishing1	RC2	attacking_finishing	-0.417
power_strength	RC3	power_strength	0.848
attacking_heading_accuracy	RC3	attacking_heading_accuracy	0.806
weight_kg	RC3	weight_kg	0.802
height_cm	RC3	height_cm	0.801
power_jumping	RC3	power_jumping	0.482
movement_agility1	RC3	movement_agility	-0.522
movement_balance	RC3	movement_balance	-0.678
movement_sprint_speed	RC4	movement_sprint_speed	0.804
movement_acceleration	RC4	movement_acceleration	0.774
movement_agility2	RC4	movement_agility	0.551
power_stamina1	RC4	power_stamina	0.509
power_jumping1	RC4	power_jumping	0.423
movement_balance1	RC4	movement_balance	0.401
potential1	RC5	potential	0.741
age	RC5	age	-0.713

5.3 VARIMAX Loading Heatmap

load_heat <- loadings_rot[, 1:n_comp]
corrplot(
  load_heat,
  is.corr   = FALSE,
  method    = "color",
  tl.cex    = 0.7, tl.col = "black",
  col       = colorRampPalette(c("#2166AC", "white", "#D6604D"))(200),
  title     = "VARIMAX Loading Heatmap -- 5 Rotated Factors",
  mar       = c(0, 0, 2, 0),
  cl.lim    = c(-1, 1)
)

$Figure 6. VARIMAX Loading Heatmap. Red = high positive loading, Blue = high negative loading. Each row should ideally show one dark cell (simple structure).$

Figure 6. VARIMAX Loading Heatmap. Red = high positive loading, Blue = high negative loading. Each row should ideally show one dark cell (simple structure).

Technical Interpretation – VARIMAX Heatmap:

The heatmap is a visual test of simple structure quality. In perfect simple structure: every row (variable) has exactly one dark cell (one dominant factor), and every column (factor) has a clearly defined block of dark cells (a coherent variable cluster).

RC1 column shows a solid dark red block across all attacking and skill variables in the upper portion of the heatmap, with white or near-white cells in all other columns for these rows. This confirms textbook simple structure for the technical/attacking cluster.

RC2 column shows a concentrated dark red block for the four defending variables at the lower portion, with near-white cells for all non-defending variables. The clean separation of this column from RC1 quantifies the orthogonality between technical and defensive specialization.

RC3 column shows red for the height/weight/strength cluster and blue for agility/balance variables, encoding the physical build bipolar dimension.

RC4 and RC5 show more diffuse coloring because speed and career-stage are inherently less tightly clustered constructs. This is acceptable and expected: RC4 and RC5 explain less variance (5.16% and 3.94%) precisely because they are narrower, more specific dimensions.

Cross-loading rows (variables with multiple colored cells): skill_long_passing shows moderate loadings on both RC1 (0.650) and RC2 (0.589), reflecting its dual role as a technical attribute used by both creative playmakers and defensive midfielders. movement_agility appears in RC1, RC3, and RC4, reflecting its multidimensional nature. These cross-loadings are not failures of the rotation but genuine reflections of the multifaceted nature of certain specific attributes.

5.4 Communality Analysis

comm_df <- data.frame(
  Variable       = names(h2_rot),
  Communality    = round(h2_rot, 3),
  Representation = ifelse(h2_rot >= 0.70, "Well explained (h2 >= 0.70)",
                   ifelse(h2_rot >= 0.50, "Adequately explained (0.50 <= h2 < 0.70)",
                   "Poorly explained (h2 < 0.50)"))
) %>% arrange(Communality)

comm_df %>%
  kable(caption = "Table 15. Variable Communalities -- Proportion of Variance Explained by 5-Factor Solution (sorted ascending)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  column_spec(2, bold = TRUE) %>%
  column_spec(3, bold = TRUE,
              color = ifelse(comm_df$Communality >= 0.70, "darkgreen",
                      ifelse(comm_df$Communality >= 0.50, "#E67E22", "red")))

Table 15. Variable Communalities – Proportion of Variance Explained by 5-Factor Solution (sorted ascending)
	Variable	Communality	Representation
weak_foot	weak_foot	0.140	Poorly explained (h2 < 0.50)
international_reputation	international_reputation	0.264	Poorly explained (h2 < 0.50)
skill_moves	skill_moves	0.553	Adequately explained (0.50 <= h2 < 0.70)
power_jumping	power_jumping	0.564	Adequately explained (0.50 <= h2 < 0.70)
power_stamina	power_stamina	0.574	Adequately explained (0.50 <= h2 < 0.70)
mentality_aggression	mentality_aggression	0.670	Adequately explained (0.50 <= h2 < 0.70)
weight_kg	weight_kg	0.678	Adequately explained (0.50 <= h2 < 0.70)
attacking_crossing	attacking_crossing	0.682	Adequately explained (0.50 <= h2 < 0.70)
mentality_penalties	mentality_penalties	0.729	Well explained (h2 >= 0.70)
attacking_heading_accuracy	attacking_heading_accuracy	0.732	Well explained (h2 >= 0.70)
skill_fk_accuracy	skill_fk_accuracy	0.732	Well explained (h2 >= 0.70)
movement_balance	movement_balance	0.739	Well explained (h2 >= 0.70)
height_cm	height_cm	0.751	Well explained (h2 >= 0.70)
power_shot_power	power_shot_power	0.752	Well explained (h2 >= 0.70)
skill_curve	skill_curve	0.766	Well explained (h2 >= 0.70)
mentality_composure	mentality_composure	0.766	Well explained (h2 >= 0.70)
movement_reactions	movement_reactions	0.771	Well explained (h2 >= 0.70)
age	age	0.780	Well explained (h2 >= 0.70)
skill_long_passing	skill_long_passing	0.788	Well explained (h2 >= 0.70)
movement_sprint_speed	movement_sprint_speed	0.790	Well explained (h2 >= 0.70)
mentality_positioning	mentality_positioning	0.796	Well explained (h2 >= 0.70)
power_strength	power_strength	0.798	Well explained (h2 >= 0.70)
movement_agility	movement_agility	0.808	Well explained (h2 >= 0.70)
attacking_volleys	attacking_volleys	0.809	Well explained (h2 >= 0.70)
mentality_vision	mentality_vision	0.814	Well explained (h2 >= 0.70)
power_long_shots	power_long_shots	0.835	Well explained (h2 >= 0.70)
skill_dribbling	skill_dribbling	0.837	Well explained (h2 >= 0.70)
attacking_short_passing	attacking_short_passing	0.838	Well explained (h2 >= 0.70)
skill_ball_control	skill_ball_control	0.859	Well explained (h2 >= 0.70)
movement_acceleration	movement_acceleration	0.862	Well explained (h2 >= 0.70)
potential	potential	0.869	Well explained (h2 >= 0.70)
attacking_finishing	attacking_finishing	0.871	Well explained (h2 >= 0.70)
overall	overall	0.887	Well explained (h2 >= 0.70)
defending_marking_awareness	defending_marking_awareness	0.902	Well explained (h2 >= 0.70)
mentality_interceptions	mentality_interceptions	0.915	Well explained (h2 >= 0.70)
defending_sliding_tackle	defending_sliding_tackle	0.919	Well explained (h2 >= 0.70)
defending_standing_tackle	defending_standing_tackle	0.925	Well explained (h2 >= 0.70)

Technical Interpretation – Communalities:

Communality (h2) is the proportion of a variable’s total variance explained by the 5 retained factors. h2 = 1 means the variable is perfectly explained; h2 = 0 means it is completely unique.

30 of 37 variables (81.1%) achieve h2 >= 0.70, confirming excellent coverage. The highest communalities belong to the defending cluster: defending_standing_tackle (0.925), defending_sliding_tackle (0.919), mentality_interceptions (0.915), defending_marking_awareness (0.902). These extreme values reflect the near-perfect intercorrelation within the defending cluster: essentially all of their variance is captured by a single factor (RC2), leaving almost no unique residual.

weak_foot (h2 = 0.140): Only 14% of weak_foot variance is shared with the 5 factors. Footedness (the ability of the non-dominant foot) is largely an individual anatomical and practice-history characteristic that is statistically independent from all ability dimensions. A player can have skill_dribbling = 90 and weak_foot = 3, or skill_dribbling = 55 and weak_foot = 5. There is simply no consistent relationship between overall technical quality and weak foot rating, which is why the 5-factor model fails to capture it.

international_reputation (h2 = 0.264): Only 26.4% is explained. Reputation in FIFA 23 is influenced by commercial partnerships, historical peak performance, nationality and league visibility, and media prominence – factors with no direct mapping to any of the five ability dimensions. A veteran star may retain reputation = 4 as their actual attributes decline, and a technically exceptional but commercially overlooked player may hold reputation = 1 despite high ability scores.

These two variables are retained but carry minimal interpretive weight when discussing the factor structure. Their low communalities are substantively meaningful, not analytical failures.

5.5 Split-Sample Validation

set.seed(42)
idx_split <- sample(1:nrow(data_final), nrow(data_final) / 2)
data_s1   <- data_final[idx_split, ]
data_s2   <- data_final[-idx_split, ]

fa_s1 <- principal(data_s1, nfactors = n_comp, rotate = "varimax")
fa_s2 <- principal(data_s2, nfactors = n_comp, rotate = "varimax")

var_s1 <- round(fa_s1$Vaccounted[2, 1:n_comp] * 100, 2)
var_s2 <- round(fa_s2$Vaccounted[2, 1:n_comp] * 100, 2)

val_df <- data.frame(
  Factor     = paste0("RC", 1:n_comp),
  Sample_1   = var_s1,
  Sample_2   = var_s2,
  Difference = round(abs(var_s1 - var_s2), 2),
  Stable     = ifelse(abs(var_s1 - var_s2) < 5, "Stable", "Unstable")
)

val_df %>%
  kable(caption = "Table 16. Split-Sample Validation (n = 2,500 each, set.seed = 42)",
        col.names = c("Factor", "Sample 1 Var (%)", "Sample 2 Var (%)",
                      "Difference (%)", "Stable (diff < 5%)?")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  column_spec(4, bold = TRUE) %>%
  column_spec(5, bold = TRUE,
              color = ifelse(val_df$Stable == "Stable", "darkgreen", "red"))

Table 16. Split-Sample Validation (n = 2,500 each, set.seed = 42)
	Factor	Sample 1 Var (%)	Sample 2 Var (%)	Difference (%)	Stable (diff < 5%)?
RC1	RC1	33.27	33.75	0.48	Stable
RC2	RC2	17.21	16.67	0.54	Stable
RC3	RC3	12.85	12.88	0.03	Stable
RC4	RC4	7.21	7.97	0.76	Stable
RC5	RC5	4.28	4.10	0.18	Stable

max_diff <- max(val_df$Difference)
cat(sprintf("\nMaximum variance difference across all factors: %.2f%%\n", max_diff))

## 
## Maximum variance difference across all factors: 0.76%

cat(ifelse(max_diff < 5,
           "STABLE: solution generalizes reliably beyond this specific sample.",
           "WARNING: solution may be unstable."))

## STABLE: solution generalizes reliably beyond this specific sample.

Technical Interpretation – Split-Sample Validation:

Split-sample validation directly tests whether the 5-factor structure is a stable property of the FIFA 23 player population or an artifact of the specific 5,000-player random sample. The procedure splits the data into two independent n = 2,500 subsamples (set.seed = 42 for reproducibility), runs FA separately on each half, and compares the variance explained by each factor. If the structure were overfitted, substantial variance differences would emerge between the two halves.

Maximum difference = 0.76% (RC4). All five factors show differences of less than 1%, far below the 5% stability threshold. This result has several important implications:

RC1 (33.27% vs. 33.75%, diff = 0.48%): The dominant Technical and Attacking Ability factor is virtually identical across both halves. The near-perfect replication confirms that the attacking/technical ability dimension is the strongest and most stable structural feature of the FIFA 23 player population.

RC2 (17.21% vs. 16.67%, diff = 0.54%): The defending dimension replicates equally well, consistent with the near-perfect loadings (0.93 to 0.944) that define it.

RC3 (12.85% vs. 12.88%, diff = 0.03%): The most stable factor in the entire solution. Physical measurements (height, weight, strength) are more objectively assigned in FIFA than technical ratings, producing an exceptionally consistent distribution across any subsample.

RC4 (7.21% vs. 7.97%, diff = 0.76%): The largest difference, plausibly reflecting slight sampling variability in the proportion of speed-specialist wingers and full-backs between the two halves.

RC5 (4.28% vs. 4.10%, diff = 0.18%): Despite being the most conceptually unusual factor (career stage rather than ability), it replicates with excellent stability.

Practical conclusion: The 5-factor VARIMAX solution is robust and generalizable. Factor scores computed from this solution can be used with confidence in downstream analyses (player clustering, position classification, market value prediction) without concern that the structure is sample-dependent.

5.6 Factor Scores

factor_scores <- as.data.frame(fa_rot$scores)
colnames(factor_scores) <- paste0("Factor_", 1:n_comp)

fwrite(factor_scores, "fifa23_factor_scores.csv")

pc_scores <- as.data.frame(pc$x[, 1:n_comp])
fwrite(pc_scores, "fifa23_pc_scores.csv")

cat("Factor scores saved: fifa23_factor_scores.csv\n")

## Factor scores saved: fifa23_factor_scores.csv

cat("PC scores saved: fifa23_pc_scores.csv\n")

## PC scores saved: fifa23_pc_scores.csv

round(sapply(factor_scores, function(x)
  c(Mean = mean(x), SD = sd(x), Min = min(x), Max = max(x))), 3) %>%
  t() %>%
  kable(caption = "Table 17. Factor Score Distribution (standardized: mean ~= 0, SD ~= 1)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  column_spec(1, bold = TRUE)

Table 17. Factor Score Distribution (standardized: mean ~= 0, SD ~= 1)
	SD	Min	Max
Factor_1	1	-2.820	3.794
Factor_2	1	-2.773	2.367
Factor_3	1	-2.787	3.618
Factor_4	1	-4.706	3.204
Factor_5	1	-3.055	3.188

6 Comprehensive Summary

6.1 Complete Results Overview

data.frame(
  Metric = c(
    "Dataset",
    "Total Raw Records",
    "Analysis Sample",
    "Variables Used",
    "Significant Correlation Pairs",
    "Bartlett Test Result",
    "Overall KMO",
    "KMO Classification",
    "Variables Removed (MSA < 0.50)",
    "Components Retained (Kaiser Rule)",
    "Total Variance Explained",
    "Variance by PC1 alone",
    "Variance by PC1 and PC2",
    "VARIMAX Factors",
    "Split-Sample Max Difference"
  ),
  Value = c(
    "FIFA 23 Complete Player Dataset (Kaggle, 2022)",
    "147,400 players",
    "5,000 players (set.seed = 123)",
    paste0(ncol(data_final),
           " numeric sub-attributes (6 aggregates excluded)"),
    "318 / 666 pairs (47.7%) with |r| > 0.3",
    "chi-sq = 218,585.80; df = 666; p ~= 0 (< 2.2e-16)",
    "0.942",
    "Marvelous (>= 0.90)",
    "0 variables removed (minimum MSAi = 0.708)",
    paste0(n_comp, " components (PC1 to PC", n_comp, ")"),
    paste0(round(cum_var[n_comp] * 100, 2), "% of total variance"),
    paste0(round(prop_var[1] * 100, 2), "%"),
    paste0(round(cum_var[2] * 100, 2), "%"),
    paste0("RC1: Technical-Attacking | RC2: Defensive-Aggression | ",
           "RC3: Physical-Aerial | RC4: Speed-Stamina | ",
           "RC5: Youth Potential vs. Experience"),
    paste0(max(val_df$Difference),
           "% (RC4) -- well below 5% stability threshold")
  )
) %>%
  kable(caption = "Table 18. Complete PCA and FA Results Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE, width = "22em") %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  row_spec(c(7, 8, 9, 10, 11, 14, 15), bold = TRUE, background = "#EBF5FB")

Table 18. Complete PCA and FA Results Summary
Metric	Value
Dataset	FIFA 23 Complete Player Dataset (Kaggle, 2022)
Total Raw Records	147,400 players
Analysis Sample	5,000 players (set.seed = 123)
Variables Used	37 numeric sub-attributes (6 aggregates excluded)
Significant Correlation Pairs	318 / 666 pairs (47.7%) with \|r\| > 0.3
Bartlett Test Result	chi-sq = 218,585.80; df = 666; p ~= 0 (< 2.2e-16)
Overall KMO	0.942
KMO Classification	Marvelous (>= 0.90)
Variables Removed (MSA < 0.50)	0 variables removed (minimum MSAi = 0.708)
Components Retained (Kaiser Rule)	5 components (PC1 to PC5)
Total Variance Explained	75.04% of total variance
Variance by PC1 alone	35.73%
Variance by PC1 and PC2	56.09%
VARIMAX Factors	RC1: Technical-Attacking \| RC2: Defensive-Aggression \| RC3: Physical-Aerial \| RC4: Speed-Stamina \| RC5: Youth Potential vs. Experience
Split-Sample Max Difference	0.76% (RC4) – well below 5% stability threshold

6.2 Key Findings

6.2.1 Finding 1: Exceptional Data Suitability (KMO = 0.942)

KMO = 0.942 means 94.2% of total inter-variable variance is driven by common latent factors, placing this dataset in the highest practical category for PCA/FA. Bartlett’s chi-sq = 218,585.80 (p ~= 0) rejects the identity matrix hypothesis with a test statistic 296 times larger than the critical value. Together, these results confirm that the FIFA 23 player attribute data has an exceptionally strong and fully recoverable latent factor structure. The exclusion of 6 aggregate attributes was methodologically essential: retaining them would have produced a singular correlation matrix, making the entire analysis invalid.

6.2.2 Finding 2: Five Latent Dimensions Explain 75% of Variance (7.4:1 Compression Ratio)

The reduction from 37 original attributes to 5 principal components retaining 75.04% of total variance achieves a 7.4:1 dimensionality compression with 24.96% information loss. The five dimensions are: RC1 – Technical and Attacking Ability (35.73% of variance), RC2 – Defensive and Aggression (20.36%), RC3 – Physical Strength and Aerial (9.85%), RC4 – Speed and Stamina (5.16%), and RC5 – Youth Potential vs. Experience (3.94%). Each dimension is interpretable in terms of well-established football analytics concepts, confirming that PCA has recovered genuine latent structure rather than mathematical noise.

6.2.3 Finding 3: Attacking and Defending Are Orthogonal Independent Dimensions

The negative correlations (-0.40 to -0.60) between attacking/skill variables and defending variables, combined with the orthogonality of RC1 and RC2 in the VARIMAX solution, establish that technical attacking ability and defensive competence are statistically independent dimensions in FIFA 23. This has direct analytical implications: any method that combines these attributes into a single “overall quality” metric conflates two structurally distinct player types. The 5-factor solution correctly separates them, enabling meaningful comparison of attackers and defenders within their respective specialization dimensions.

6.2.4 Finding 4: Two Variables Structurally Outside the Five-Factor Model

weak_foot (h2 = 0.140) and international_reputation (h2 = 0.264) have critically low communalities because they measure constructs genuinely orthogonal to the five ability dimensions. Their low communalities are not data quality failures but analytical evidence that these two attributes operate on different latent dimensions entirely (individual lateralization for weak_foot; commercial and historical reputation for international_reputation). A sixth factor specifically targeting reputation and marketability might capture international_reputation, but this is beyond the current analysis scope.

6.2.5 Finding 5: Solution is Stable and Generalizable (Max Difference = 0.76%)

All five factors replicate within 1% variance difference across two independent n = 2,500 subsamples, far below the 5% stability threshold. The most stable factor is RC3 (diff = 0.03%), reflecting the objective nature of physical measurements. The least stable is RC4 (diff = 0.76%), plausibly due to subsample variation in speed-specialist positions. Overall, the factor structure is a robust property of the FIFA 23 player population, not a sampling artifact, and factor scores can be used with full confidence in downstream analyses.

7 References

7.1 Data Source

Stefano Leone. (2022). FIFA 23 Complete Player Dataset. Kaggle. https://www.kaggle.com/datasets/stefanoleone992/fifa-23-complete-player-dataset

7.2 Methodology and Theory

Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2019). Multivariate Data Analysis (8th ed.). Cengage Learning.
Kaiser, H. F. (1974). An index of factorial simplicity. Psychometrika, 39(1), 31-36.
Xiong, S. (2026). A unified framework of principal component analysis and factor analysis. Journal of Multivariate Analysis, 211, 105529. https://doi.org/10.1016/j.jmva.2025.105529
Jewsbury, P. A., & Johnson, M. S. (2025). Principal component analysis on the covariance matrix for data reduction in large-scale assessments. Large-Scale Assessments in Education, 13(1), 30. https://doi.org/10.1186/s40536-025-00264-9
Woo, K., & Kim, K. (2024). Profiling the socioeconomic characteristics, dietary intake, and health status of Korean older adults. Epidemiology and Health, 46, e2024043. https://doi.org/10.4178/epih.e2024043
Ameliya, A., Piliang, Y. K. A., Hidayah, A., & Hasibuan, E. S. H. (2026). Penerapan principal component analysis untuk menentukan faktor-faktor yang mempengaruhi kemiskinan di Sumatera Utara. Algoritma, 4(1), 1-19. https://doi.org/10.62383/algoritma.v4i1.890

7.3 R Packages

psych – KMO(), cortest.bartlett(), principal()
factoextra – fviz_eig(), fviz_pca_biplot(), fviz_contrib()
corrplot – corrplot()
kableExtra – table styling
ggplot2 + tidyr – distribution visualization
data.table – fread(), fwrite()
Base R – prcomp(), cor(), det()

Report Generated Using R Markdown

Analysis Date: March 01, 2026

Multivariate Analysis – FIFA 23 PCA and FA Full Report

INT2024 | Course Lecturer: Ulfa Siti Nuraini, S.Stat., M.Stat.

PCA and FA: FIFA 23 Complete Player Dataset

Dimas Rafi Izzulhaq (24031554084), Rizqi Aqilah Cahayani Yuniarto (24031554087)

2026-03-01