Dataset Overview

This dataset originates from the UCI Machine Learning Repository and contains clinical records of patients evaluated for gallstone disease. The data was collected to identify physiological and biochemical patterns potentially associated with gallstone formation, through a combination of body composition measurements (via Bioelectrical Impedance Analysis) and standard blood biomarker panels.

General Information

Attribute Details
Source UCI Machine Learning Repository
Clinical context Gallstone disease evaluation
Total observations 319 patients
Total variables (raw) 40 (numeric + categorical)
Numeric variables analyzed 32 (after excluding categoricals and binary variables)
Final variables after KMO selection 29

Excluded Categorical Variables

Seven categorical variables were excluded prior to PCA/EFA, as both techniques require continuous numeric data:

Variable Description
Gallstone Status Diagnosis label (target variable)
Gender Patient sex
Comorbidity General comorbidity status
CAD Coronary Artery Disease
Hypothyroidism Hypothyroidism diagnosis
Hyperlipidemia Hyperlipidemia diagnosis
DM Diabetes Mellitus

Numeric Variables Analyzed

Variables fall into two functional groups:

Group 1 - Body Composition (BCA via BIA)
Measured using Bioelectrical Impedance Analysis (BIA), a non-invasive method estimating fluid distribution, fat mass, and lean mass: Age, Height, Weight, BMI, Total Body Water (TBW), Extracellular Water (ECW), Intracellular Water (ICW), ECF/TBW Ratio, Total Body Fat Ratio (TBFR), Lean Mass (LM), Body Protein Content, Visceral Fat Rating (VFR), Bone Mass (BM), Muscle Mass (MM), Obesity (%), Total Fat Content (TFC), Visceral Fat Area (VFA), Visceral Muscle Area (VMA), Hepatic Fat Accumulation (HFA)

Group 2 - Blood Biomarkers
Standard laboratory panel reflecting metabolic, hepatic, renal, and inflammatory function: Glucose, Total Cholesterol (TC), LDL, HDL, Triglyceride, AST, ALT, ALP, Creatinine, GFR, CRP, Hemoglobin (HGB), Vitamin D

Clinical Relevance: Body composition and blood biomarkers are closely interrelated. Visceral obesity influences lipid profiles, insulin resistance, and hepatic function. This complex, intercorrelated variable set makes PCA and EFA ideal tools for uncovering the underlying latent structure empirically.


Setup & Data Loading

library(readxl)
library(psych)
library(corrplot)
library(ggplot2)
library(dplyr)
library(factoextra)
library(tidyr)
library(tibble)
library(GPArotation)
library(ggrepel)
library(knitr)
library(kableExtra)

out_path <- getwd()
data_raw <- read_excel("dataset-uci.xlsx")

categorical_vars <- c("Gallstone Status", "Gender", "Comorbidity",
                      "CAD", "Hypothyroidism", "Hyperlipidemia", "DM")

data_num <- data_raw %>%
  select(-any_of(categorical_vars)) %>%
  select(where(is.numeric)) %>%
  select(where(~ n_distinct(.) > 2)) %>%
  na.omit()

n <- nrow(data_num)
p <- ncol(data_num)

Stage 1: Sample Size & Variable Ratio

Before running PCA or EFA, sample adequacy must be verified. Two criteria are evaluated: the minimum sample size and the observations-to-variables ratio. Both directly affect the stability and generalizability of extracted components or factors.

Criteria evaluated:

  • Minimum sample size (n ≥ 50): Smaller samples produce unstable loading estimates. A sample of n ≥ 200 is considered excellent for stable factor solutions (Goretzko & Bühner, 2020).
  • Observations-to-variables ratio (≥ 5:1): A ratio of 10:1 or higher is ideal (Watkins, 2021). At this level, the correlation matrix is stable and reliably invertible, which is a mathematical prerequisite for factor extraction.
cat(sprintf("Observations : %d\n", n))
## Observations : 319
cat(sprintf("Variables    : %d\n", p))
## Variables    : 32
cat(sprintf("Ratio        : %.2f:1  [%s]\n",
            n / p, ifelse(n >= 50 & (n / p) >= 5, "PASSED", "FAILED")))
## Ratio        : 9.97:1  [PASSED]

Result: With 319 observations and 32 variables, the ratio is 9.97:1 - nearly ideal at 10:1. This sample size is fully sufficient for stable factor extraction and generalization of results (Goretzko & Bühner, 2020).


Stage 2: Descriptive Statistics

Descriptive statistics provide an overview of the distribution of each variable, including skewness and kurtosis. These are particularly important when using Maximum Likelihood (ML) extraction in EFA, which formally assumes multivariate normality. Severe departures from normality can inflate factor loadings and distort communality estimates.

desc <- data.frame(
  Variable = colnames(data_num),
  Mean     = round(sapply(data_num, mean, na.rm = TRUE), 3),
  SD       = round(sapply(data_num, sd,   na.rm = TRUE), 3),
  Min      = round(sapply(data_num, min,  na.rm = TRUE), 3),
  Max      = round(sapply(data_num, max,  na.rm = TRUE), 3),
  Skewness = round(sapply(data_num, psych::skew),    3),
  Kurtosis = round(sapply(data_num, psych::kurtosi), 3)
)

desc %>%
  kbl(caption = "Descriptive Statistics - All Numeric Variables") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE, font_size = 12) %>%
  column_spec(6, color = ifelse(abs(desc$Skewness) > 2, "#c0392b", "#27ae60"),
              bold = ifelse(abs(desc$Skewness) > 2, TRUE, FALSE)) %>%
  column_spec(7, color = ifelse(abs(desc$Kurtosis) > 7, "#c0392b", "#27ae60"),
              bold = ifelse(abs(desc$Kurtosis) > 7, TRUE, FALSE))
Descriptive Statistics - All Numeric Variables
Variable Mean SD Min Max Skewness Kurtosis
Age Age 48.069 12.115 20.00 96.00 0.133 0.349
Height Height 167.157 10.053 145.00 191.00 -0.080 -0.701
Weight Weight 80.565 15.709 42.90 143.50 0.428 0.315
Body Mass Index (BMI) Body Mass Index (BMI) 28.877 5.314 17.40 49.70 0.663 1.133
Total Body Water (TBW) Total Body Water (TBW) 40.588 7.930 13.00 66.20 0.206 -0.378
Extracellular Water (ECW) Extracellular Water (ECW) 17.071 3.162 9.00 27.80 0.020 -0.371
Intracellular Water (ICW) Intracellular Water (ICW) 23.634 5.349 13.80 57.10 0.943 3.468
Extracellular Fluid/Total Body Water (ECF/TBW) Extracellular Fluid/Total Body Water (ECF/TBW) 42.212 3.244 29.23 52.00 -0.509 1.374
Total Body Fat Ratio (TBFR) (%) Total Body Fat Ratio (TBFR) (%) 28.275 8.444 6.30 50.92 0.133 -0.525
Lean Mass (LM) (%) Lean Mass (LM) (%) 71.638 8.438 48.99 93.67 -0.124 -0.507
Body Protein Content (Protein) (%) Body Protein Content (Protein) (%) 15.939 2.335 5.56 24.81 -0.047 1.820
Visceral Fat Rating (VFR) Visceral Fat Rating (VFR) 9.078 4.333 1.00 31.00 0.790 2.134
Bone Mass (BM) Bone Mass (BM) 2.803 0.509 1.40 4.00 0.205 -0.825
Muscle Mass (MM) Muscle Mass (MM) 54.273 10.604 4.70 78.80 -0.102 0.359
Obesity (%) Obesity (%) 35.850 109.800 0.40 1954.00 16.715 288.973
Total Fat Content (TFC) Total Fat Content (TFC) 23.488 9.608 3.10 62.50 0.802 1.115
Visceral Fat Area (VFA) Visceral Fat Area (VFA) 12.172 5.262 0.90 41.00 1.048 2.919
Visceral Muscle Area (VMA) (Kg) Visceral Muscle Area (VMA) (Kg) 30.403 4.461 18.90 41.10 -0.056 -0.473
Hepatic Fat Accumulation (HFA) Hepatic Fat Accumulation (HFA) 1.150 1.059 0.00 4.00 0.174 -1.357
Glucose Glucose 108.689 44.849 69.00 575.00 5.882 45.419
Total Cholesterol (TC) Total Cholesterol (TC) 203.495 45.759 60.00 360.00 0.430 0.485
Low Density Lipoprotein (LDL) Low Density Lipoprotein (LDL) 126.652 38.541 11.00 293.00 0.535 1.080
High Density Lipoprotein (HDL) High Density Lipoprotein (HDL) 49.476 17.719 25.00 273.00 6.468 77.178
Triglyceride Triglyceride 144.502 97.904 1.39 838.00 2.760 12.649
Aspartat Aminotransferaz (AST) Aspartat Aminotransferaz (AST) 21.685 16.698 8.00 195.00 6.929 60.062
Alanin Aminotransferaz (ALT) Alanin Aminotransferaz (ALT) 26.856 27.884 3.00 372.00 7.209 76.924
Alkaline Phosphatase (ALP) Alkaline Phosphatase (ALP) 73.113 24.181 7.00 197.00 0.788 2.507
Creatinine Creatinine 0.801 0.176 0.46 1.46 0.615 0.135
Glomerular Filtration Rate (GFR) Glomerular Filtration Rate (GFR) 100.819 16.971 10.60 132.00 -1.796 6.391
C-Reactive Protein (CRP) C-Reactive Protein (CRP) 1.854 4.990 0.00 43.40 5.360 33.233
Hemoglobin (HGB) Hemoglobin (HGB) 14.418 1.776 8.50 18.80 -0.381 0.133
Vitamin D Vitamin D 21.401 9.982 3.50 53.10 0.276 -0.238

Interpretation: Variables are highlighted in red when |Skewness| > 2 or |Kurtosis| > 7, indicating substantial departure from normality.

Most body composition variables exhibit approximately symmetric distributions (skewness ≈ 0), which is expected in clinical cohort data. However, several blood biomarkers show extreme non-normality:

  • Obesity (%): Skewness = 16.72, Kurtosis = 288.97 - an extreme outlier-driven distribution. The maximum value of 1954% is pathologically extreme and suggests data entry issues or a very severe clinical case. This explains why Obesity (%) has very low communality in EFA (h² = 0.044).
  • Glucose: Skewness = 5.88 - reflects patients with severely uncontrolled hyperglycemia (max = 575 mg/dL), likely representing diabetic crises.
  • AST & ALT: Skewness = 6.93 and 7.21 respectively - extreme values consistent with acute hepatocellular injury in a subset of patients.
  • HDL: Skewness = 6.47 - a small number of patients have very high HDL (e.g., due to pharmacotherapy or genetic conditions).
  • CRP & Triglyceride: Skewness > 2 - reflect acute inflammatory episodes and metabolic syndrome cases in the tail of the distribution.

These non-normal biomarkers are clinically meaningful but analytically problematic. Their low communalities in EFA are a direct consequence of high uniqueness (varians unik) driven by extreme observations.


Stage 3: Correlation Matrix

The correlation matrix is the input for both PCA and EFA. For these techniques to be meaningful, there must be sufficient intercorrelation among variables - both techniques decompose this shared variance into latent dimensions. A minimum of ~25–30% of variable pairs with |r| > 0.30 is considered sufficient evidence of factorability (Watkins, 2021).

mat_corr  <- cor(data_num, use = "complete.obs")
high_corr <- which(abs(mat_corr) > 0.7 & upper.tri(mat_corr), arr.ind = TRUE)
cat(sprintf("Pairs with |r| > 0.70: %d out of %d total pairs (%.1f%%)\n",
            nrow(high_corr),
            ncol(mat_corr)*(ncol(mat_corr)-1)/2,
            nrow(high_corr) / (ncol(mat_corr)*(ncol(mat_corr)-1)/2) * 100))
## Pairs with |r| > 0.70: 39 out of 496 total pairs (7.9%)
cat("\nTop 10 strongest correlations:\n")
## 
## Top 10 strongest correlations:
for (i in seq_len(min(10, nrow(high_corr)))) {
  r1 <- rownames(mat_corr)[high_corr[i, 1]]
  r2 <- colnames(mat_corr)[high_corr[i, 2]]
  cat(sprintf("  %s  <-->  %s    r = %.3f\n", r1, r2,
              mat_corr[high_corr[i, 1], high_corr[i, 2]]))
}
##   Weight  <-->  Body Mass Index (BMI)    r = 0.791
##   Height  <-->  Total Body Water (TBW)    r = 0.713
##   Weight  <-->  Total Body Water (TBW)    r = 0.767
##   Weight  <-->  Extracellular Water (ECW)    r = 0.789
##   Total Body Water (TBW)  <-->  Extracellular Water (ECW)    r = 0.904
##   Height  <-->  Intracellular Water (ICW)    r = 0.739
##   Total Body Water (TBW)  <-->  Intracellular Water (ICW)    r = 0.817
##   Extracellular Water (ECW)  <-->  Intracellular Water (ICW)    r = 0.784
##   Body Mass Index (BMI)  <-->  Total Body Fat Ratio (TBFR) (%)    r = 0.753
##   Body Mass Index (BMI)  <-->  Lean Mass (LM) (%)    r = -0.752
cat(sprintf("  ... and %d more pairs with |r| > 0.70\n", nrow(high_corr) - 10))
##   ... and 29 more pairs with |r| > 0.70
corrplot(mat_corr,
         method      = "color",
         type        = "upper",
         addCoef.col = "black",
         number.cex  = 0.38,
         tl.cex      = 0.52,
         tl.col      = "black",
         tl.srt      = 45,
         col         = colorRampPalette(
           c("#053061","#2166AC","#F7F7F7","#D6604D","#67001F"))(200),
         title       = "Correlation Matrix - Numerical Variables",
         mar         = c(0, 0, 3, 0))

Interpretation: There are 39 variable pairs with |r| > 0.70, representing strong multicollinearity that confirms rich latent structure suitable for PCA/EFA.

Three distinct correlation clusters are visible in the heatmap:

Cluster 1 - Body Fluid & Lean Mass (top-left block, deep red):
TBW ↔︎ ECW (r = 0.904), TBW ↔︎ ICW (r = 0.817), ECW ↔︎ ICW (r = 0.784), Weight ↔︎ TBW (r = 0.767), Weight ↔︎ ECW (r = 0.789). These variables all measure aspects of the same biological phenomenon - the distribution of body water and lean mass. Their strong intercorrelations are physiologically expected: larger bodies have more total water, and water is compartmentalized across intracellular and extracellular spaces proportionally.

Cluster 2 - Adiposity vs Lean Composition (middle block, high contrast):
BMI ↔︎ TBFR (r = 0.753), BMI ↔︎ LM (r = -0.752), Weight ↔︎ BMI (r = 0.791). The negative correlation between BMI and Lean Mass reflects the fundamental biological trade-off: as the proportion of body fat increases, lean mass proportion necessarily decreases. This contrast will manifest as a strong bipolar second component in PCA.

Cluster 3 - Biomarker variables (bottom-right, lighter colors):
Biomarkers generally show weaker intercorrelations (|r| mostly < 0.50), with the notable exception of AST ↔︎ ALT (both liver enzymes), Glucose ↔︎ Triglyceride (metabolic syndrome axis), and Age ↔︎ GFR (renal aging). The overall weaker correlations in this group partially explain why EFA extracts fewer factors from biomarkers than from body composition variables.


Stage 4: KMO Test - Iterative Variable Removal

The Kaiser-Meyer-Olkin (KMO) Test measures the Measure of Sampling Adequacy (MSA) - the proportion of variance among variables that is common variance (shared through latent factors) rather than unique variance. KMO evaluates whether the correlation patterns are compact enough to support factor analysis.

The MSA for each variable is computed as:

\[MSA_j = \frac{\sum_{k \neq j} r_{jk}^2}{\sum_{k \neq j} r_{jk}^2 + \sum_{k \neq j} a_{jk}^2}\]

where \(r_{jk}\) is the bivariate correlation and \(a_{jk}\) is the partial correlation between variables j and k controlling for all others. An MSA close to 1.0 means partial correlations are very small relative to bivariate correlations - indicating that variable j shares its variance with others through common factors. Variables with MSA < 0.50 are considered “unfit” and are removed iteratively.

KMO Classification (Kaiser, 1974; Watkins, 2021):

MSA Range Interpretation
≥ 0.90 Marvelous
0.80 – 0.89 Meritorious
0.70 – 0.79 Middling
0.60 – 0.69 Mediocre
0.50 – 0.59 Miserable
< 0.50 Unacceptable - remove variable
data_iter    <- data_num
dropped_vars <- c()
iter         <- 0

repeat {
  iter    <- iter + 1
  kmo_res <- KMO(data_iter)
  msa_vec <- kmo_res$MSAi
  min_msa <- min(msa_vec)
  min_var <- names(which.min(msa_vec))
  if (min_msa >= 0.50) break
  cat(sprintf("Iter %d: removing '%s'  (MSA = %.4f)\n", iter, min_var, min_msa))
  dropped_vars <- c(dropped_vars, min_var)
  data_iter    <- data_iter[, colnames(data_iter) != min_var]
}
## Iter 1: removing 'Total Cholesterol (TC)'  (MSA = 0.4001)
## Iter 2: removing 'Low Density Lipoprotein (LDL)'  (MSA = 0.3743)
## Iter 3: removing 'Extracellular Fluid/Total Body Water (ECF/TBW)'  (MSA = 0.4832)
kmo_final <- KMO(data_iter)
msa_final <- data.frame(
  Variable = names(kmo_final$MSAi),
  MSA      = round(as.numeric(kmo_final$MSAi), 4)
)
low_final <- msa_final[msa_final$MSA < 0.50, ]
mat_final <- cor(data_iter, use = "complete.obs")

kmo_cat <- switch(
  as.character(findInterval(kmo_final$MSA, c(0.5, 0.6, 0.7, 0.8, 0.9))),
  "0" = "Unacceptable", "1" = "Miserable", "2" = "Mediocre",
  "3" = "Middling", "4" = "Meritorious", "5" = "Marvelous"
)

cat(sprintf("\nFinal Overall KMO MSA  : %.4f  (%s)\n", kmo_final$MSA, kmo_cat))
## 
## Final Overall KMO MSA  : 0.8388  (Meritorious)
cat(sprintf("Variables dropped      : %d  {%s}\n",
            length(dropped_vars),
            paste(dropped_vars, collapse = ", ")))
## Variables dropped      : 3  {Total Cholesterol (TC), Low Density Lipoprotein (LDL), Extracellular Fluid/Total Body Water (ECF/TBW)}
cat(sprintf("Variables retained     : %d\n", ncol(data_iter)))
## Variables retained     : 29
ggplot(
  msa_final %>% mutate(
    Short    = abbreviate(Variable, minlength = 10),
    Category = case_when(
      MSA < 0.50 ~ "< 0.50",
      MSA < 0.70 ~ "0.50-0.70",
      TRUE       ~ ">= 0.70"
    )
  ),
  aes(x = reorder(Short, MSA), y = MSA, fill = Category)
) +
  geom_col(width = 0.75, alpha = 0.9) +
  geom_hline(yintercept = 0.50, linetype = "dashed", color = "#D73027", linewidth = 0.7) +
  geom_hline(yintercept = 0.70, linetype = "dashed", color = "#FC8D59", linewidth = 0.7) +
  geom_hline(yintercept = 0.80, linetype = "dashed", color = "#1A9641", linewidth = 0.7) +
  annotate("text", x = 0.6, y = c(0.52, 0.72, 0.82),
           label = c("0.50 - Acceptable", "0.70 - Good", "0.80 - Great"),
           hjust = 0, size = 3, color = c("#D73027","#FC8D59","#1A9641")) +
  scale_fill_manual(values = c("< 0.50" = "#D73027",
                               "0.50-0.70" = "#FEE090",
                               ">= 0.70"   = "#4575B4")) +
  coord_flip() +
  scale_y_continuous(limits = c(0, 1.05), breaks = seq(0, 1, 0.1)) +
  labs(title = "KMO Per-Variable MSA (Post-Iteration)",
       x = NULL, y = "MSA Value", fill = "MSA Category") +
  theme_minimal(base_size = 11) +
  theme(legend.position  = "bottom",
        plot.title       = element_text(face = "bold", hjust = 0.5),
        panel.grid.minor = element_blank())

Interpretation - Why were these three variables removed?

Iteration 1 - Total Cholesterol (TC, MSA = 0.4001):
TC correlates strongly with LDL (which was also eventually removed) but only weakly with the body composition variables that dominate the dataset. The high partial correlations of TC relative to its bivariate correlations indicate that TC’s relationship with other variables is not mediated through a common latent factor - it behaves more like an “isolated island” in the correlation network.

Iteration 2 - LDL (MSA = 0.3743):
After TC was removed, LDL - which was heavily co-linear with TC - became the most isolated variable. LDL’s correlations with the remaining body composition variables and other biomarkers are weak and inconsistent, without a common factor structure tying them together.

Iteration 3 - ECF/TBW Ratio (MSA = 0.4832):
This is a derived variable computed as ECF divided by TBW. Since both ECF and TBW are already present in the dataset, the ratio creates a mathematical dependency that inflates partial correlations. Derived ratios often behave poorly in factor analysis precisely because their variance is algebraically constrained by their components.

Final result: KMO Overall = 0.8388 (Meritorious) - an excellent level. Virtually all 29 retained variables have MSA ≥ 0.70, confirming that each variable shares substantial variance with the others through common latent factors. The analysis is fully supported.


Stage 5: Bartlett’s Test of Sphericity

Bartlett’s Test of Sphericity tests the null hypothesis that the population correlation matrix is an identity matrix (i.e., all off-diagonal correlations = 0). If H₀ cannot be rejected, there are no significant correlations among variables and PCA/EFA is meaningless. The test statistic follows a chi-square distribution with df = p(p-1)/2.

Note: This test is sensitive to sample size - with n = 319, even weak correlations will be statistically significant. The KMO test is therefore a more substantively meaningful indicator of factorability.

bartlett_final <- cortest.bartlett(mat_final, n = n, diag = TRUE)
cat(sprintf("Chi-square : %.3f\n", bartlett_final$chisq))
## Chi-square : 11645.016
cat(sprintf("df         : %d\n",   bartlett_final$df))
## df         : 406
cat(sprintf("p-value    : %s  [%s]\n",
            format(bartlett_final$p.value, scientific = TRUE),
            ifelse(bartlett_final$p.value < 0.05, "PASSED", "FAILED")))
## p-value    : 0e+00  [PASSED]

Result: χ²(406) = 11,645.016, p ≈ 0. The null hypothesis is decisively rejected. The correlation matrix is definitively non-spherical, confirming that substantial shared variance exists among variables - a prerequisite for meaningful factor extraction.

Assumption Test Summary

p_final <- ncol(data_iter)
mat_abs <- abs(mat_final); diag(mat_abs) <- 0
n_pairs <- p_final * (p_final - 1) / 2
n_sig   <- sum(mat_abs > 0.3) / 2
pct_sig <- round(n_sig / n_pairs * 100, 1)

assumption_df <- data.frame(
  Assumption = c(
    "Sample size (n ≥ 50)",
    "Obs-to-variable ratio (≥ 5:1)",
    "Correlation pairs |r| > 0.30",
    "KMO Overall MSA (≥ 0.50)",
    "Per-variable MSA (all ≥ 0.50)",
    "Bartlett's Test (p < 0.05)"
  ),
  Result = c(
    as.character(n),
    paste0(round(n / p_final, 2), ":1"),
    sprintf("%d / %d  (%.1f%%)", as.integer(n_sig), n_pairs, pct_sig),
    sprintf("%.4f  (%s)", kmo_final$MSA, kmo_cat),
    ifelse(nrow(low_final) == 0, "All passed",
           paste0(nrow(low_final), " variable(s) below 0.50")),
    format(bartlett_final$p.value, digits = 3, scientific = TRUE)
  ),
  Status = c(
    ifelse(n >= 50,                       "PASSED", "FAILED"),
    ifelse((n / p_final) >= 5,            "PASSED", "FAILED"),
    ifelse(pct_sig >= 25,                 "PASSED", "FAILED"),
    ifelse(kmo_final$MSA >= 0.50,         "PASSED", "FAILED"),
    ifelse(nrow(low_final) == 0,          "PASSED", "FAILED"),
    ifelse(bartlett_final$p.value < 0.05, "PASSED", "FAILED")
  )
)

assumption_df %>%
  kbl(caption = "Assumption Test Summary - PCA/EFA Prerequisites") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE) %>%
  column_spec(3,
              color = ifelse(assumption_df$Status == "PASSED", "#27ae60", "#c0392b"),
              bold  = TRUE)
Assumption Test Summary - PCA/EFA Prerequisites
Assumption Result Status
Sample size (n ≥ 50) 319 PASSED
Obs-to-variable ratio (≥ 5:1) 11:1 PASSED
Correlation pairs |r| > 0.30 135 / 406 (33.3%) PASSED
KMO Overall MSA (≥ 0.50) 0.8388 (Meritorious) PASSED
Per-variable MSA (all ≥ 0.50) All passed PASSED
Bartlett’s Test (p < 0.05) 0e+00 PASSED

All 6 prerequisites are satisfied. The dataset demonstrates excellent factorability: a meritorious KMO (0.8388), 33.3% of variable pairs with |r| > 0.30, and an overwhelmingly significant Bartlett test. PCA and EFA are fully justified.


Stage 6: PCA - Principal Component Extraction

Principal Component Analysis (PCA) transforms the original correlated variables into a set of orthogonal (uncorrelated) linear combinations called principal components (PCs). Each PC captures a decreasing amount of total variance.

The mathematical transformation is:

\[PC_i = e_{i1}Z_1 + e_{i2}Z_2 + \cdots + e_{ip}Z_p\]

where \(e_{ij}\) are the eigenvectors of the correlation matrix and \(Z_j\) are standardized variables. Components are ordered so that PC1 captures the most variance, PC2 the most remaining variance, and so on.

Component retention criteria applied:

  • Kaiser Rule (primary): Retain components with eigenvalue λ > 1. A component with λ < 1 explains less variance than a single standardized variable.
  • Cumulative variance threshold (supporting): Target ≥ 70% cumulative variance (Hair et al., 2019).
  • Scree plot (visual support): Identify the “elbow” - the inflection point where the eigenvalue decline becomes gradual.
data_pca    <- data_iter
pca_result  <- prcomp(data_pca, scale. = TRUE, center = TRUE)
eigenvalues <- pca_result$sdev^2
prop_var    <- eigenvalues / sum(eigenvalues)
cum_var     <- cumsum(prop_var)

n_kaiser <- sum(eigenvalues > 1)
n_var70  <- which(cum_var >= 0.70)[1]
n_var80  <- which(cum_var >= 0.80)[1]
n_comp   <- n_kaiser
eig_df <- data.frame(
  Component  = paste0("PC", seq_along(eigenvalues)),
  Eigenvalue = round(eigenvalues, 4),
  Variance   = paste0(round(prop_var * 100, 2), "%"),
  Cumulative = paste0(round(cum_var  * 100, 2), "%"),
  Decision   = ifelse(eigenvalues > 1, "Retain", "Drop")
)

head(eig_df, 15) %>%
  kbl(caption = "Eigenvalue Table - First 15 Principal Components") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE) %>%
  row_spec(which(head(eig_df,15)$Decision == "Retain"),
           background = "#eafaf1") %>%
  row_spec(which(head(eig_df,15)$Decision == "Drop"),
           color = "#95a5a6")
Eigenvalue Table - First 15 Principal Components
Component Eigenvalue Variance Cumulative Decision
PC1 8.7859 30.3% 30.3% Retain
PC2 6.1447 21.19% 51.48% Retain
PC3 2.0819 7.18% 58.66% Retain
PC4 1.6823 5.8% 64.46% Retain
PC5 1.3655 4.71% 69.17% Retain
PC6 1.1053 3.81% 72.98% Retain
PC7 1.0051 3.47% 76.45% Retain
PC8 0.9758 3.36% 79.82% Drop
PC9 0.8816 3.04% 82.86% Drop
PC10 0.7620 2.63% 85.48% Drop
PC11 0.7099 2.45% 87.93% Drop
PC12 0.6177 2.13% 90.06% Drop
PC13 0.5336 1.84% 91.9% Drop
PC14 0.4492 1.55% 93.45% Drop
PC15 0.3856 1.33% 94.78% Drop
data.frame(
  Criterion    = c("Kaiser Rule (eigenvalue > 1)",
                   "Cumulative Variance ≥ 70%",
                   "Cumulative Variance ≥ 80%"),
  N_Components = c(n_kaiser, n_var70, n_var80),
  Cumulative   = c(sprintf("%.2f%%", cum_var[n_kaiser]*100),
                   sprintf("%.2f%%", cum_var[n_var70]*100),
                   sprintf("%.2f%%", cum_var[n_var80]*100)),
  Role = c("PRIMARY CRITERION", "Supporting", "Supporting")
) %>%
  kbl(caption = "Component Retention Criteria Comparison") %>%
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE) %>%
  row_spec(1, bold = TRUE, background = "#eaf4fb")
Component Retention Criteria Comparison
Criterion N_Components Cumulative Role
Kaiser Rule (eigenvalue > 1) 7 76.45% PRIMARY CRITERION
Cumulative Variance ≥ 70% 6 72.98% Supporting
Cumulative Variance ≥ 80% 9 82.86% Supporting
fviz_eig(pca_result,
         ncp       = p_final,
         addlabels = TRUE,
         barfill   = "#3498DB",
         barcolor  = "#2980B9",
         linecolor = "#E74C3C") +
  geom_hline(yintercept = 100 / p_final,
             linetype = "dashed", color = "#E74C3C", linewidth = 0.8) +
  annotate("text",
           x     = n_comp + 1.2,
           y     = eigenvalues[n_comp] / sum(eigenvalues) * 100 + 1.8,
           label = sprintf("Retain PC1–PC%d\n(%.2f%% variance)",
                           n_comp, cum_var[n_comp]*100),
           color = "#1A5276", size = 3.2, hjust = 0) +
  labs(title    = "Scree Plot - PCA",
       subtitle = sprintf("Kaiser Rule: %d components retained  |  Cumulative variance: %.2f%%",
                          n_comp, cum_var[n_comp]*100),
       x = "Principal Component", y = "Variance Explained (%)") +
  theme_minimal(base_size = 12) +
  theme(plot.title       = element_text(face = "bold", hjust = 0.5),
        plot.subtitle    = element_text(hjust = 0.5, color = "grey40"),
        panel.grid.minor = element_blank())

Interpretation - Eigenvalue Structure:

The PCA retains 7 components (Kaiser Rule) explaining 76.45% of total variance. This is an excellent result for a 29-variable dataset of mixed clinical data.

The eigenvalue pattern is highly informative:

  • PC1 (λ = 8.79, 30.30%): Nearly one-third of all variance is captured by a single general dimension. This is the General Body Mass & Hydration component - almost all body composition variables (muscle, water, bone, weight) load positively on it. Biologically, this reflects the strong correlation among all indicators of body size: bigger bodies have more muscle, more water, more bone, and more fat.

  • PC2 (λ = 6.14, 21.19%): The second component is unusually large, nearly as strong as PC1. This indicates a major contrast dimension between lean composition (positive: LM, Protein) and adipose composition (negative: TBFR, TFC, BMI). This bipolar component captures the fundamental biological trade-off between fat mass and lean mass across the sample.

  • PC3 (λ = 2.08, 7.18%): Captures the Renal Aging axis - Age loads positively, GFR loads negatively. This reflects the well-established physiological relationship between advancing age and declining kidney function (measured by GFR).

  • PC4 (λ = 1.68, 5.80%): A Hepatocellular Injury component dominated by AST and ALT - both transaminase enzymes that elevate during liver damage.

  • PC5 (λ = 1.37, 4.71%): A Metabolic Risk component with negative loadings for Glucose and Triglyceride, capturing insulin resistance and dyslipidemia.

  • PC6 (λ = 1.11, 3.81%): An Inflammation vs Vitamin D component - CRP (inflammatory marker) loads positively while Vitamin D loads negatively. Low Vitamin D and elevated CRP are both markers of poor metabolic health and are inversely correlated.

  • PC7 (λ = 1.01, 3.47%): An Obesity (%) singleton component. Note that Obesity (%) has extreme skewness (16.7) and appears to be a near-outlier variable that forms its own isolated component.

PC7’s eigenvalue of 1.0051 is only marginally above 1.0, suggesting it represents borderline-justifiable variance. The Parallel Analysis in EFA (Stage 10) will provide a more conservative estimate.


Stage 7: Unrotated Factor Loading Matrix

The factor loading matrix shows the correlation between each variable and each principal component. Loadings are calculated as:

\[\lambda_{jk} = e_{jk} \cdot \sqrt{\text{eigenvalue}_k}\]

Communality (h²) is the sum of squared loadings across all retained components for each variable - representing the proportion of that variable’s variance explained by the overall solution. A variable with h² < 0.50 means more than half its variance is unique and not shared with the extracted components.

loadings_unrot <- pca_result$rotation[, 1:n_comp] %*%
  diag(sqrt(eigenvalues[1:n_comp]), nrow = n_comp)
colnames(loadings_unrot) <- paste0("PC", 1:n_comp)
h2_unrot <- rowSums(loadings_unrot^2)

load_unrot_df    <- as.data.frame(round(loadings_unrot, 3))
load_unrot_df$h2 <- round(h2_unrot, 3)
load_unrot_df    <- load_unrot_df %>% arrange(desc(h2))

load_unrot_df %>%
  kbl(caption = "Unrotated Loading Matrix + Communality (h²) - Sorted by h²") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE, font_size = 11) %>%
  column_spec(ncol(load_unrot_df),
              color = ifelse(load_unrot_df$h2 < 0.50, "#c0392b", "#27ae60"),
              bold  = TRUE)
Unrotated Loading Matrix + Communality (h²) - Sorted by h²
PC1 PC2 PC3 PC4 PC5 PC6 PC7 h2
Weight 0.894 -0.382 -0.111 0.100 0.047 0.061 -0.003 0.973
Lean Mass (LM) (%) 0.022 0.965 0.066 0.013 0.019 -0.025 -0.028 0.939
Muscle Mass (MM) 0.930 0.236 -0.061 0.093 0.051 0.050 -0.030 0.939
Total Body Fat Ratio (TBFR) (%) -0.017 -0.964 -0.069 -0.010 -0.009 0.019 0.021 0.936
Total Fat Content (TFC) 0.383 -0.875 -0.105 0.042 0.020 0.046 0.014 0.929
Alanin Aminotransferaz (ALT) 0.374 0.156 -0.242 -0.830 0.086 0.061 -0.019 0.923
Body Mass Index (BMI) 0.523 -0.801 -0.034 0.008 0.032 -0.001 0.015 0.918
Visceral Fat Area (VFA) 0.673 -0.666 -0.073 0.071 0.059 0.039 0.012 0.912
Total Body Water (TBW) 0.923 0.196 -0.048 0.080 0.073 0.008 0.001 0.904
Aspartat Aminotransferaz (AST) 0.271 0.154 -0.268 -0.848 0.124 -0.024 -0.023 0.904
Extracellular Water (ECW) 0.919 0.054 -0.035 0.056 0.125 -0.097 0.046 0.879
Visceral Fat Rating (VFR) 0.711 -0.485 0.320 -0.074 0.130 -0.092 0.022 0.875
Intracellular Water (ICW) 0.860 0.302 -0.119 0.099 0.022 0.055 -0.034 0.859
Height 0.643 0.592 -0.137 0.155 0.015 0.129 -0.018 0.824
Visceral Muscle Area (VMA) (Kg) 0.860 0.220 -0.032 0.071 0.027 0.087 -0.021 0.803
Glomerular Filtration Rate (GFR) -0.084 0.116 -0.818 0.109 -0.219 -0.172 0.116 0.794
Bone Mass (BM) 0.848 0.213 -0.040 0.121 0.064 -0.032 -0.012 0.785
Age 0.019 -0.330 0.749 -0.208 0.169 -0.093 0.003 0.751
Obesity (%) 0.044 -0.208 0.005 -0.097 -0.223 -0.245 -0.756 0.736
Creatinine 0.448 0.454 0.508 -0.019 0.115 0.090 -0.111 0.698
Body Protein Content (Protein) (%) -0.021 0.722 0.271 -0.011 -0.024 0.155 -0.080 0.626
Vitamin D -0.010 0.144 0.151 0.017 0.397 -0.496 0.422 0.626
Triglyceride 0.431 0.026 0.180 -0.076 -0.588 -0.088 0.084 0.584
Glucose 0.201 -0.085 0.345 -0.049 -0.611 -0.168 0.074 0.575
Hemoglobin (HGB) 0.580 0.438 0.061 -0.054 -0.120 -0.109 -0.009 0.561
Alkaline Phosphatase (ALP) 0.065 -0.150 0.144 -0.301 -0.347 0.360 0.397 0.546
C-Reactive Protein (CRP) -0.033 -0.200 0.052 0.081 0.078 0.680 -0.066 0.524
Hepatic Fat Accumulation (HFA) 0.525 -0.386 0.059 -0.051 -0.087 -0.159 -0.046 0.466
High Density Lipoprotein (HDL) -0.410 -0.214 0.000 -0.057 0.354 0.041 -0.195 0.382
low_h2 <- rownames(load_unrot_df)[load_unrot_df$h2 < 0.50]
cat(sprintf("[!] Variables with h2 < 0.50: %s\n\n",
            ifelse(length(low_h2) == 0, "None", paste(low_h2, collapse = ", "))))
## [!] Variables with h2 < 0.50: Hepatic Fat Accumulation (HFA), High Density Lipoprotein (HDL)
for (k in 1:n_comp) {
  ld  <- loadings_unrot[, k]
  dom <- sort(ld[abs(ld) > 0.50], decreasing = TRUE)
  cat(sprintf("PC%d (%.2f%% variance):\n", k, prop_var[k] * 100))
  if (length(dom) > 0)
    for (nm in names(dom)) cat(sprintf("  %+.4f  %s\n", dom[nm], nm))
  else cat("  No loading > 0.50\n")
  cat("\n")
}
## PC1 (30.30% variance):
##   +0.9300  Muscle Mass (MM)
##   +0.9228  Total Body Water (TBW)
##   +0.9191  Extracellular Water (ECW)
##   +0.8942  Weight
##   +0.8599  Visceral Muscle Area (VMA) (Kg)
##   +0.8597  Intracellular Water (ICW)
##   +0.8477  Bone Mass (BM)
##   +0.7109  Visceral Fat Rating (VFR)
##   +0.6726  Visceral Fat Area (VFA)
##   +0.6430  Height
##   +0.5803  Hemoglobin (HGB)
##   +0.5251  Hepatic Fat Accumulation (HFA)
##   +0.5226  Body Mass Index (BMI)
## 
## PC2 (21.19% variance):
##   +0.9654  Lean Mass (LM) (%)
##   +0.7217  Body Protein Content (Protein) (%)
##   +0.5918  Height
##   -0.6660  Visceral Fat Area (VFA)
##   -0.8014  Body Mass Index (BMI)
##   -0.8753  Total Fat Content (TFC)
##   -0.9644  Total Body Fat Ratio (TBFR) (%)
## 
## PC3 (7.18% variance):
##   +0.7495  Age
##   +0.5075  Creatinine
##   -0.8185  Glomerular Filtration Rate (GFR)
## 
## PC4 (5.80% variance):
##   -0.8301  Alanin Aminotransferaz (ALT)
##   -0.8478  Aspartat Aminotransferaz (AST)
## 
## PC5 (4.71% variance):
##   -0.5875  Triglyceride
##   -0.6106  Glucose
## 
## PC6 (3.81% variance):
##   +0.6803  C-Reactive Protein (CRP)
## 
## PC7 (3.47% variance):
##   -0.7562  Obesity (%)
load_unrot_df %>%
  select(-h2) %>%
  rownames_to_column("Variable") %>%
  pivot_longer(-Variable, names_to = "Component", values_to = "Loading") %>%
  ggplot(aes(x = Component, y = reorder(Variable, Loading), fill = Loading)) +
  geom_tile(color = "white", linewidth = 0.4) +
  geom_text(aes(label = sprintf("%.2f", Loading)), size = 2.2, color = "black") +
  scale_fill_gradient2(low = "#2166AC", mid = "white", high = "#D6604D",
                       midpoint = 0, limits = c(-1, 1), name = "Loading") +
  labs(title    = "Unrotated Component Loadings",
       subtitle = "All loading values displayed",
       x = "Principal Component", y = NULL) +
  theme_minimal(base_size = 9) +
  theme(axis.text.y     = element_text(size = 7),
        legend.position = "right",
        plot.title      = element_text(face = "bold", hjust = 0.5),
        plot.subtitle   = element_text(hjust = 0.5, color = "grey50"),
        panel.grid      = element_blank())

Interpretation - Communality (h²) Analysis:

The communality values reveal how well each variable is represented by the 7-component solution:

Excellent representation (h² ≥ 0.90): Weight (0.973), Lean Mass (0.939), Muscle Mass (0.939), TBFR (0.936), TFC (0.929), ALT (0.923), BMI (0.918), VFA (0.912), TBW (0.904), AST (0.904). These variables are almost entirely explained by the extracted components - they are the “core” variables of this dataset.

Good representation (h² = 0.70–0.89): ECW, VFR, ICW, Height, VMA, GFR, Bone Mass, Age, Obesity (%), Creatinine. Well-represented with moderate uniqueness.

Weak representation (h² < 0.50): - Hepatic Fat Accumulation (HFA, h² = 0.466): HFA correlates moderately with multiple components but doesn’t dominate any one. Its unique variance is high (u² = 0.534), suggesting it represents a dimension partially captured but not fully explained by this solution. - HDL (h² = 0.382): HDL has consistently low MSA and low communality throughout the analysis. This reflects its complex physiological role - HDL is inversely protective for cardiovascular disease and correlates with multiple body composition dimensions simultaneously without clustering cleanly with any one group.

Key pattern in unrotated loadings: PC1 acts as a general factor with positive loadings across virtually all body composition variables - this “blurring” makes interpretation difficult. The rotated solution (Stage 8) will clarify the structure by creating more differentiated, interpretable components.


Stage 8: Rotated Component Matrix - VARIMAX

VARIMAX rotation (orthogonal) redistributes variance among components to maximize the variance of squared loadings within each column. The goal is simple structure - each variable loads highly on one component and near-zero on others, making each component substantively interpretable. Rotation does not change the total variance explained or communalities; it only redistributes the variance more evenly across components (Watkins, 2021).

pca_rotated <- principal(data_pca, nfactors = n_comp,
                         rotate = "varimax", scores = TRUE)

rot_load <- unclass(pca_rotated$loadings)[, 1:n_comp]
colnames(rot_load) <- paste0("RC", 1:n_comp)
h2_rot <- pca_rotated$communality

cat("Rotated Component Matrix (cutoff |loading| >= 0.30):\n\n")
## Rotated Component Matrix (cutoff |loading| >= 0.30):
print(pca_rotated$loadings, cutoff = 0.30, sort = TRUE)
## 
## Loadings:
##                                    RC1    RC2    RC3    RC4    RC5    RC6   
## Height                              0.825 -0.339                            
## Weight                              0.733  0.649                            
## Total Body Water (TBW)              0.934                                   
## Extracellular Water (ECW)           0.879                                   
## Intracellular Water (ICW)           0.916                                   
## Bone Mass (BM)                      0.876                                   
## Muscle Mass (MM)                    0.958                                   
## Visceral Muscle Area (VMA) (Kg)     0.881                                   
## Creatinine                          0.540 -0.368  0.512                     
## Hemoglobin (HGB)                    0.635                                   
## Body Mass Index (BMI)                      0.917                            
## Total Body Fat Ratio (TBFR) (%)    -0.319  0.903                            
## Lean Mass (LM) (%)                  0.326 -0.902                            
## Body Protein Content (Protein) (%)        -0.737                            
## Visceral Fat Rating (VFR)           0.471  0.640  0.459                     
## Total Fat Content (TFC)                    0.949                            
## Visceral Fat Area (VFA)             0.432  0.842                            
## Hepatic Fat Accumulation (HFA)      0.328  0.516                            
## Age                                               0.807                     
## Glomerular Filtration Rate (GFR)                 -0.881                     
## Aspartat Aminotransferaz (AST)                           0.934              
## Alanin Aminotransferaz (ALT)                             0.927              
## Glucose                                                         0.727       
## Triglyceride                                                    0.697       
## C-Reactive Protein (CRP)                                               0.649
## Vitamin D                                                             -0.692
## Obesity (%)                                                                 
## High Density Lipoprotein (HDL)     -0.401                      -0.433       
## Alkaline Phosphatase (ALP)                                      0.446       
##                                    RC7   
## Height                                   
## Weight                                   
## Total Body Water (TBW)                   
## Extracellular Water (ECW)                
## Intracellular Water (ICW)                
## Bone Mass (BM)                           
## Muscle Mass (MM)                         
## Visceral Muscle Area (VMA) (Kg)          
## Creatinine                               
## Hemoglobin (HGB)                         
## Body Mass Index (BMI)                    
## Total Body Fat Ratio (TBFR) (%)          
## Lean Mass (LM) (%)                       
## Body Protein Content (Protein) (%)       
## Visceral Fat Rating (VFR)                
## Total Fat Content (TFC)                  
## Visceral Fat Area (VFA)                  
## Hepatic Fat Accumulation (HFA)           
## Age                                      
## Glomerular Filtration Rate (GFR)         
## Aspartat Aminotransferaz (AST)           
## Alanin Aminotransferaz (ALT)             
## Glucose                                  
## Triglyceride                             
## C-Reactive Protein (CRP)                 
## Vitamin D                                
## Obesity (%)                         0.832
## High Density Lipoprotein (HDL)           
## Alkaline Phosphatase (ALP)         -0.419
## 
##                  RC1   RC2   RC3   RC4   RC5   RC6   RC7
## SS loadings    8.061 6.242 2.098 1.897 1.662 1.165 1.046
## Proportion Var 0.278 0.215 0.072 0.065 0.057 0.040 0.036
## Cumulative Var 0.278 0.493 0.566 0.631 0.688 0.728 0.765
data.frame(
  PC          = paste0("PC",  1:n_comp),
  Unrotated   = paste0(round(prop_var[1:n_comp] * 100, 2), "%"),
  RC          = paste0("RC",  1:n_comp),
  Rotated_Var = paste0(round(pca_rotated$Vaccounted[2, 1:n_comp] * 100, 2), "%"),
  Rotated_Cum = paste0(round(pca_rotated$Vaccounted[3, 1:n_comp] * 100, 2), "%")
) %>%
  kbl(caption = "Variance Comparison: Unrotated vs VARIMAX Rotated") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Variance Comparison: Unrotated vs VARIMAX Rotated
PC Unrotated RC Rotated_Var Rotated_Cum
PC1 30.3% RC1 27.8% 27.8%
PC2 21.19% RC2 21.52% 49.32%
PC3 7.18% RC3 7.24% 56.56%
PC4 5.8% RC4 6.54% 63.1%
PC5 4.71% RC5 5.73% 68.83%
PC6 3.81% RC6 4.02% 72.84%
PC7 3.47% RC7 3.61% 76.45%
rot_df <- as.data.frame(round(rot_load, 3))
rot_df$h2 <- round(h2_rot, 3)

rot_df %>%
  select(-h2) %>%
  rownames_to_column("Variable") %>%
  pivot_longer(-Variable, names_to = "RC", values_to = "Loading") %>%
  ggplot(aes(x = RC, y = reorder(Variable, Loading), fill = Loading)) +
  geom_tile(color = "white", linewidth = 0.4) +
  geom_text(aes(label = ifelse(abs(Loading) >= 0.30,
                               sprintf("%.2f", Loading), "")),
            size = 2.3, color = "black") +
  scale_fill_gradient2(low = "#2166AC", mid = "white", high = "#D6604D",
                       midpoint = 0, limits = c(-1, 1), name = "Loading") +
  labs(title    = "Rotated Component Loadings (VARIMAX)",
       subtitle = "Labels displayed for |loading| ≥ 0.30",
       x = "Rotated Component", y = NULL) +
  theme_minimal(base_size = 9) +
  theme(axis.text.y     = element_text(size = 7),
        legend.position = "right",
        plot.title      = element_text(face = "bold", hjust = 0.5),
        plot.subtitle   = element_text(hjust = 0.5, color = "grey50"),
        panel.grid      = element_blank())

Interpretation - Rotated Components (VARIMAX):

After rotation, each component becomes clearly interpretable:

RC1 - “Body Size & Lean Mass” (27.8% variance)
High positive loadings: MM (0.958), TBW (0.934), ICW (0.916), VMA (0.881), ECW (0.879), BM (0.876), Height (0.825), Weight (0.733), HGB (0.635), Creatinine (0.540). This component represents overall body size and its lean tissue composition. Larger, more muscular individuals score high. The positive loading of Hemoglobin reflects the higher blood volume in larger lean bodies. HDL loads negatively (-0.40), consistent with known inverse relationships between muscle mass and HDL in non-athletic populations.

RC2 - “Adiposity vs Lean Composition” (21.5% variance)
Strong contrast: TFC (0.949), BMI (0.917), TBFR (0.903), VFA (0.842) positive vs LM (-0.902), Protein (-0.737) negative. This component captures the fundamental fat-lean trade-off. High scorers have high fat mass, high BMI, and low lean mass percentage. This is the core “obesity dimension” of the dataset and is clinically the most directly relevant component for gallstone risk.

RC3 - “Renal Age” (7.2% variance)
Age (0.807) positive, GFR (-0.881) negative. This component cleanly captures the biological aging of kidney function. GFR naturally declines with age (approximately 1 mL/min/year after age 40), making this component a direct physiological age indicator. Creatinine (0.512) and VFR (0.459) also contribute, as both tend to increase with age.

RC4 - “Hepatocellular Enzyme Activity” (6.5% variance)
AST (0.934) and ALT (0.927) load almost exclusively on this component. Both are intracellular liver enzymes that leak into the bloodstream during hepatocyte injury. This component specifically captures hepatic inflammatory or metabolic stress, independent of body composition.

RC5 - “Glycemic-Lipid Metabolic Risk” (5.7% variance)
Glucose (0.727), Triglyceride (0.697), ALP (0.446). This component represents the metabolic syndrome cluster - elevated blood sugar and triglycerides co-occur in insulin-resistant patients. ALP’s moderate loading may reflect hepatic involvement in metabolic syndrome. HDL loads negatively (-0.433), consistent with its protective role and inverse relationship with triglycerides.

RC6 - “Inflammation-Vitamin D Axis” (4.0% variance)
CRP (0.649) positive, Vitamin D (-0.692) negative. This bipolar component captures the well-documented inverse relationship between chronic inflammation (high CRP) and Vitamin D deficiency. Low Vitamin D is associated with increased inflammatory cytokines, and this appears as a meaningful clinical dimension even in the PCA solution.

RC7 - “Extreme Obesity” (3.6% variance)
Obesity (%) loads exclusively here (0.832), with ALP loading negatively (-0.419). The extreme skewness of Obesity (%) (skewness = 16.7) effectively created a component that almost entirely represents the extreme-obesity tail of the distribution. This singleton component should be interpreted with caution.

Effect of rotation: PC1’s original dominance (30.3%) was partially redistributed to create a more balanced RC1 (27.8%) and RC2 (21.5%). The rotated solution achieves much cleaner simple structure, with most variables now loading clearly on just one component.


Stage 9: PCA Biplot & Variable Contributions

fviz_pca_biplot(pca_result,
                axes      = c(1, 2),
                geom.ind  = "point",
                col.ind   = "grey70",
                alpha.ind = 0.25,
                col.var   = "contrib",
                gradient.cols = c("#00AFBB","#E7B800","#FC4E07"),
                repel     = TRUE,
                label     = "var",
                arrowsize = 0.7,
                labelsize = 3.0) +
  scale_color_gradient(low = "#00AFBB", high = "#FC4E07",
                       name = "Contribution (%)") +
  labs(title    = "PCA Biplot - PC1 vs PC2",
       subtitle = sprintf("PC1 = %.1f%%  |  PC2 = %.1f%%  |  Combined = %.1f%%",
                          prop_var[1]*100, prop_var[2]*100,
                          (prop_var[1]+prop_var[2])*100)) +
  theme_minimal(base_size = 11) +
  theme(plot.title       = element_text(face = "bold", hjust = 0.5),
        plot.subtitle    = element_text(hjust = 0.5, color = "grey40"),
        legend.position  = "right",
        panel.grid.minor = element_blank())

fviz_pca_biplot(pca_result,
                axes      = c(1, 3),
                geom.ind  = "point",
                col.ind   = "grey70",
                alpha.ind = 0.25,
                col.var   = "contrib",
                gradient.cols = c("#00AFBB","#E7B800","#FC4E07"),
                repel     = TRUE,
                label     = "var",
                arrowsize = 0.7,
                labelsize = 3.0) +
  scale_color_gradient(low = "#00AFBB", high = "#FC4E07",
                       name = "Contribution (%)") +
  labs(title    = "PCA Biplot - PC1 vs PC3",
       subtitle = sprintf("PC1 = %.1f%%  |  PC3 = %.1f%%",
                          prop_var[1]*100, prop_var[3]*100)) +
  theme_minimal(base_size = 11) +
  theme(plot.title       = element_text(face = "bold", hjust = 0.5),
        plot.subtitle    = element_text(hjust = 0.5, color = "grey40"),
        legend.position  = "right",
        panel.grid.minor = element_blank())

fviz_contrib(pca_result, choice = "var",
             axes = 1:2, top = 20,
             fill = "#3498DB", color = "#2980B9") +
  labs(title    = "Top 20 Variable Contributions to PC1 & PC2",
       subtitle = "Red dashed line = expected contribution under uniform distribution") +
  theme_minimal(base_size = 11) +
  theme(axis.text.x   = element_text(angle = 45, hjust = 1, size = 8),
        plot.title    = element_text(face = "bold", hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5, color = "grey40"))

Interpretation - Biplot:

The PC1 vs PC2 biplot shows a characteristic “fan” shape. Arrows pointing in the same direction indicate positively correlated variables that cluster on the same component. Key patterns:

  • All body fluid and lean mass arrows (TBW, ECW, ICW, MM, BM, VMA, Height) point toward the right (positive PC1), forming a tight cluster - confirming RC1’s interpretation as Body Size & Lean Mass.
  • Fat composition arrows (TBFR, TFC, BMI) point downward-right (positive RC2), while LM and Protein point upward - the classic fat-lean opposition on PC2.
  • The near-perpendicular angle between the lean mass cluster and the fat mass cluster confirms their orthogonality after rotation.
  • Biomarker arrows (AST, ALT, Glucose, Triglyceride, CRP) are shorter and point in diverse directions, reflecting their lower communalities and more diffuse contributions across components.

The PC1 vs PC3 biplot shows a clear vertical separation: Age arrows point upward while GFR points downward on PC3, visually confirming the Renal Age interpretation. Creatinine aligns between PC1 and PC3, consistent with its cross-loading.


Stage 10: Exploratory Factor Analysis (EFA)

Exploratory Factor Analysis (EFA) differs fundamentally from PCA. While PCA is a variance decomposition technique (no underlying model), EFA is a latent variable model - it posits that observed variables are caused by (or reflect) unobserved latent factors, plus unique error:

\[X_j = \lambda_{j1}F_1 + \lambda_{j2}F_2 + \cdots + \lambda_{jk}F_k + \varepsilon_j\]

EFA estimates factor loadings (λ), communalities, and unique variances while accounting for measurement error. This makes EFA more appropriate when the goal is theoretical construct identification rather than pure data reduction.

Method used: Maximum Likelihood (ML)
ML estimation is preferred because it provides model fit indices (RMSEA, TLI, BIC, p-value) that allow formal evaluation of whether the factor model adequately reproduces the observed correlation matrix (Shi & Maydeu-Olivares, 2020).

References: Watkins (2021); Goretzko & Bühner (2020); Shi & Maydeu-Olivares (2020); McNeish (2023); Goretzko (2022).

Stage 10a: Parallel Analysis - Determining Number of Factors

Parallel Analysis (PA) is the most accurate method for determining the number of factors (Goretzko & Bühner, 2020). PA compares the actual eigenvalues to eigenvalues generated from random data matrices of the same size (500 iterations at the 95th percentile). Only factors whose eigenvalue exceeds the corresponding random eigenvalue are retained. PA consistently outperforms Kaiser Rule, which tends to over-retain factors.

set.seed(42)
pa_result <- fa.parallel(data_pca,
                         fm          = "ml",
                         fa          = "fa",
                         n.iter      = 500,
                         quant       = 0.95,
                         main        = "Parallel Analysis Scree Plot - EFA",
                         show.legend = TRUE)

## Parallel analysis suggests that the number of factors =  5  and the number of components =  NA
n_fa_parallel <- pa_result$nfact
cat(sprintf("\nParallel Analysis suggests : %d factor(s)\n", n_fa_parallel))
## 
## Parallel Analysis suggests : 5 factor(s)
cat(sprintf("Kaiser Rule suggests       : %d factor(s)\n", n_kaiser))
## Kaiser Rule suggests       : 7 factor(s)
n_fa <- max(n_fa_parallel, 1)
cat(sprintf("=> Using %d factor(s) for EFA\n", n_fa))
## => Using 5 factor(s) for EFA

Interpretation: Parallel Analysis suggests 5 factors, while Kaiser Rule suggested 7. This discrepancy is common and expected - Kaiser Rule applied to factor analysis is known to over-retain factors, especially when variables have moderate-to-high intercorrelations (Goretzko & Bühner, 2020).

Parallel Analysis is the superior criterion here: the 6th and 7th eigenvalues from the actual data do not meaningfully exceed what random data would produce at the 95th percentile, suggesting they do not represent genuine latent constructs. The decision to use 5 factors is well-supported by this gold-standard criterion.

Stage 10b: EFA with Varimax Rotation - Model Fit Check

efa_varimax <- fa(data_pca,
                  nfactors = n_fa,
                  fm       = "ml",
                  rotate   = "varimax",
                  scores   = "regression",
                  use      = "complete.obs")

data.frame(
  Index       = c("RMSEA", "TLI", "BIC", "Chi-sq p-value"),
  Value       = c(round(efa_varimax$RMSEA[1], 4),
                  round(efa_varimax$TLI, 4),
                  round(efa_varimax$BIC, 4),
                  round(efa_varimax$PVAL, 4)),
  Threshold   = c("< 0.08 acceptable; < 0.05 good",
                  "> 0.90 acceptable",
                  "Lower = better fit",
                  "p > 0.05 = exact fit"),
  Status = c(ifelse(efa_varimax$RMSEA[1] < 0.08, "Acceptable", "Poor"),
             ifelse(efa_varimax$TLI > 0.90, "Acceptable", "Below threshold"),
             "Reference",
             ifelse(efa_varimax$PVAL > 0.05, "Exact fit", "Significant misfit"))
) %>%
  kbl(caption = "EFA Model Fit Indices - 5-Factor ML Solution") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
  column_spec(4, color = c("#c0392b","#e67e22","#27ae60","#c0392b"))
EFA Model Fit Indices - 5-Factor ML Solution
Index Value Threshold Status
RMSEA 0.1108 < 0.08 acceptable; < 0.05 good Poor
TLI 0.8569 > 0.90 acceptable Below threshold
BIC -229.6157 Lower = better fit Reference
Chi-sq p-value 0.0000 p > 0.05 = exact fit Significant misfit

Important Note on Model Fit: RMSEA = 0.1108 and TLI = 0.857 indicate that the 5-factor ML model does not achieve ideal fit indices. This is a substantively meaningful finding, not a failure of the analysis. Several factors explain this:

  1. 9 variables with h² < 0.40 (Obesity, HDL, Glucose, CRP, Vitamin D, ALP, GFR, Triglyceride, HFA) have high uniqueness - they don’t fit cleanly into any common factor. Their residual correlations inflate model misfit statistics.

  2. Extreme non-normality in several biomarkers (skewness > 5) violates ML’s normality assumption, which inflates χ² and worsens RMSEA.

  3. Genuine complexity: Clinical biomarker data rarely fits a clean factor structure because the same variable (e.g., Creatinine) can reflect multiple independent physiological processes simultaneously.

The BIC = -229.6 is negative, indicating that the 5-factor model fits substantially better than a saturated model - a positive indicator. The split-sample validation (Stage 11) showing stability < 2.15% difference provides the most practically meaningful evidence of solution quality.

Stage 10c: Rotation Selection - Testing Factor Correlations

efa_oblimin <- fa(data_pca,
                  nfactors = n_fa,
                  fm       = "ml",
                  rotate   = "oblimin",
                  scores   = "regression",
                  use      = "complete.obs")

cat("Factor intercorrelations (Phi matrix - Oblimin rotation):\n\n")
## Factor intercorrelations (Phi matrix - Oblimin rotation):
print(round(efa_oblimin$Phi, 3))
##       ML2    ML1    ML5    ML3    ML4
## ML2 1.000  0.165  0.289  0.267  0.386
## ML1 0.165  1.000 -0.052  0.325 -0.471
## ML5 0.289 -0.052  1.000  0.044  0.157
## ML3 0.267  0.325  0.044  1.000 -0.163
## ML4 0.386 -0.471  0.157 -0.163  1.000
max_phi <- max(abs(efa_oblimin$Phi[lower.tri(efa_oblimin$Phi)]))
cat(sprintf("\nMaximum factor correlation: %.3f\n", max_phi))
## 
## Maximum factor correlation: 0.471
cat(sprintf("Decision: %s\n",
            ifelse(max_phi > 0.32,
                   "Oblique (Oblimin) rotation preferred [correlation > 0.32]",
                   "Orthogonal (Varimax) rotation preferred [correlation <= 0.32]")))
## Decision: Oblique (Oblimin) rotation preferred [correlation > 0.32]
if (max_phi > 0.32) {
  efa_final  <- efa_oblimin
  rot_method <- "oblimin"
} else {
  efa_final  <- efa_varimax
  rot_method <- "varimax"
}
cat(sprintf("=> Final rotation method: %s\n", toupper(rot_method)))
## => Final rotation method: OBLIMIN

Interpretation - Why Oblimin Rotation?

The maximum inter-factor correlation is 0.471 (between ML2 and ML4), well above the 0.32 threshold. This means the latent factors are not independent of each other - they are correlated. In biological/clinical data, this is expected: body size (ML2) and height (ML4) are naturally correlated; adiposity (ML1) and metabolic risk are related constructs.

Forcing orthogonality (Varimax) in this situation would artificially constrain the solution and distort loadings, as it would attempt to represent correlated real-world constructs as uncorrelated - an unrealistic assumption. Oblimin rotation allows factors to correlate, producing a more honest representation of the underlying factor structure (Watkins, 2021; McNeish, 2023).

The Phi matrix shows that: - ML2 (Body Size) ↔︎ ML4 (Height): r = 0.386 - taller people have proportionally larger body size - ML1 (Adiposity) ↔︎ ML4 (Height): r = -0.471 - taller people tend to have relatively lower fat ratios - ML2 ↔︎ ML3 (Aging): r = 0.267 - larger body size is modestly associated with older age in this sample

Stage 10d: Final EFA Solution - OBLIMIN Rotation

cat("Final EFA Factor Loadings - OBLIMIN (cutoff |loading| >= 0.30):\n\n")
## Final EFA Factor Loadings - OBLIMIN (cutoff |loading| >= 0.30):
print(efa_final$loadings, cutoff = 0.30, sort = TRUE)
## 
## Loadings:
##                                    ML2    ML1    ML5    ML3    ML4   
## Weight                              0.759  0.508                     
## Body Mass Index (BMI)               0.554  0.527               -0.465
## Total Body Water (TBW)              0.884                            
## Extracellular Water (ECW)           0.799                            
## Intracellular Water (ICW)           0.899                            
## Bone Mass (BM)                      0.839                            
## Muscle Mass (MM)                    0.882                            
## Visceral Muscle Area (VMA) (Kg)     0.796                            
## Total Body Fat Ratio (TBFR) (%)            0.952                     
## Lean Mass (LM) (%)                        -0.953                     
## Body Protein Content (Protein) (%)        -0.566                     
## Total Fat Content (TFC)                    0.867                     
## Visceral Fat Area (VFA)             0.357  0.759                     
## Aspartat Aminotransferaz (AST)                    0.923              
## Alanin Aminotransferaz (ALT)                      0.950              
## Age                                                      0.821       
## Visceral Fat Rating (VFR)           0.414                0.687       
## Height                              0.415                       0.740
## Obesity (%)                                                          
## Hepatic Fat Accumulation (HFA)      0.443                            
## Glucose                                                              
## High Density Lipoprotein (HDL)     -0.418                            
## Triglyceride                                                         
## Alkaline Phosphatase (ALP)                                           
## Creatinine                          0.342 -0.403                     
## Glomerular Filtration Rate (GFR)                        -0.489       
## C-Reactive Protein (CRP)                                             
## Hemoglobin (HGB)                    0.376                            
## Vitamin D                                                            
## 
##                  ML2   ML1   ML5   ML3   ML4
## SS loadings    6.679 4.473 1.832 1.700 1.147
## Proportion Var 0.230 0.154 0.063 0.059 0.040
## Cumulative Var 0.230 0.385 0.448 0.506 0.546
efa_load <- unclass(efa_final$loadings)
h2_efa   <- efa_final$communality
u2_efa   <- efa_final$uniquenesses
efa_df   <- as.data.frame(round(efa_load, 3))
efa_df$h2 <- round(h2_efa, 3)
efa_df$u2 <- round(u2_efa, 3)

efa_df %>%
  kbl(caption = "EFA Communality (h²) & Uniqueness (u²) - OBLIMIN Solution") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE, font_size = 11) %>%
  column_spec(which(colnames(efa_df) == "h2"),
              color = ifelse(efa_df$h2 < 0.40, "#c0392b",
                             ifelse(efa_df$h2 < 0.70, "#e67e22", "#27ae60")),
              bold = TRUE)
EFA Communality (h²) & Uniqueness (u²) - OBLIMIN Solution
ML2 ML1 ML5 ML3 ML4 h2 u2
Age -0.263 -0.060 -0.021 0.821 -0.050 0.631 0.369
Height 0.415 0.019 0.038 -0.122 0.740 0.982 0.018
Weight 0.759 0.508 0.022 0.024 0.048 0.996 0.004
Body Mass Index (BMI) 0.554 0.527 -0.009 0.078 -0.465 0.995 0.005
Total Body Water (TBW) 0.884 -0.088 0.060 0.064 0.087 0.903 0.097
Extracellular Water (ECW) 0.799 -0.005 0.073 0.186 0.076 0.840 0.160
Intracellular Water (ICW) 0.899 -0.149 0.059 -0.078 0.065 0.859 0.141
Total Body Fat Ratio (TBFR) (%) -0.245 0.952 -0.018 0.031 -0.070 0.993 0.007
Lean Mass (LM) (%) 0.249 -0.953 0.019 -0.031 0.070 0.995 0.005
Body Protein Content (Protein) (%) 0.077 -0.566 -0.023 0.010 0.208 0.471 0.529
Visceral Fat Rating (VFR) 0.414 0.219 0.042 0.687 -0.068 0.995 0.005
Bone Mass (BM) 0.839 -0.110 0.017 0.022 0.058 0.750 0.250
Muscle Mass (MM) 0.882 -0.055 0.059 0.015 0.149 0.939 0.061
Obesity (%) 0.070 0.059 0.027 0.005 -0.183 0.044 0.956
Total Fat Content (TFC) 0.201 0.867 -0.013 0.040 -0.078 0.936 0.064
Visceral Fat Area (VFA) 0.357 0.759 0.012 0.213 0.149 0.933 0.067
Visceral Muscle Area (VMA) (Kg) 0.796 -0.004 0.067 -0.029 0.155 0.781 0.219
Hepatic Fat Accumulation (HFA) 0.443 0.210 0.055 0.134 -0.210 0.375 0.625
Glucose 0.079 -0.005 -0.048 0.219 0.017 0.062 0.938
High Density Lipoprotein (HDL) -0.418 0.172 0.000 0.007 0.030 0.166 0.834
Triglyceride 0.274 0.055 0.057 0.052 0.063 0.122 0.878
Aspartat Aminotransferaz (AST) -0.072 -0.004 0.923 -0.012 -0.030 0.813 0.187
Alanin Aminotransferaz (ALT) 0.014 0.011 0.950 -0.008 -0.014 0.904 0.096
Alkaline Phosphatase (ALP) -0.010 0.081 0.131 0.063 -0.083 0.042 0.958
Creatinine 0.342 -0.403 0.005 0.285 0.165 0.412 0.588
Glomerular Filtration Rate (GFR) 0.124 0.048 0.059 -0.489 -0.057 0.206 0.794
C-Reactive Protein (CRP) 0.049 0.112 -0.063 -0.084 -0.116 0.040 0.960
Hemoglobin (HGB) 0.376 -0.217 0.127 0.150 0.285 0.460 0.540
Vitamin D 0.053 -0.259 -0.029 0.118 -0.097 0.048 0.952
cat("EFA Variance Explained:\n\n")
## EFA Variance Explained:
vacc <- efa_final$Vaccounted
data.frame(
  Factor     = colnames(vacc),
  SS_Loading = round(vacc[1,], 3),
  Proportion = paste0(round(vacc[2,]*100, 2), "%"),
  Cumulative = paste0(round(vacc[3,]*100, 2), "%")
) %>%
  kbl(caption = "Variance Explained by EFA Factors (Oblimin)") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Variance Explained by EFA Factors (Oblimin)
Factor SS_Loading Proportion Cumulative
ML2 ML2 7.218 24.89% 24.89%
ML1 ML1 4.875 16.81% 41.7%
ML5 ML5 1.949 6.72% 48.42%
ML3 ML3 1.914 6.6% 55.02%
ML4 ML4 1.737 5.99% 61.01%
efa_df %>%
  select(-h2, -u2) %>%
  rownames_to_column("Variable") %>%
  pivot_longer(-Variable, names_to = "Factor", values_to = "Loading") %>%
  ggplot(aes(x = Factor, y = reorder(Variable, Loading), fill = Loading)) +
  geom_tile(color = "white", linewidth = 0.4) +
  geom_text(aes(label = ifelse(abs(Loading) >= 0.30,
                               sprintf("%.2f", Loading), "")),
            size = 2.3, color = "black") +
  scale_fill_gradient2(low = "#2166AC", mid = "white", high = "#D6604D",
                       midpoint = 0, limits = c(-1, 1), name = "Loading") +
  labs(title    = "EFA Factor Loadings - OBLIMIN Rotation",
       subtitle = "Maximum Likelihood extraction  |  Labels for |loading| ≥ 0.30",
       x = "Factor", y = NULL) +
  theme_minimal(base_size = 9) +
  theme(axis.text.y     = element_text(size = 7),
        legend.position = "right",
        plot.title      = element_text(face = "bold", hjust = 0.5),
        plot.subtitle   = element_text(hjust = 0.5, color = "grey50"),
        panel.grid      = element_blank())

data.frame(
  Variable    = names(h2_efa),
  Communality = round(h2_efa, 3),
  Uniqueness  = round(u2_efa, 3)
) %>%
  arrange(desc(Communality)) %>%
  ggplot(aes(x = reorder(Variable, Communality), y = Communality,
             fill = Communality)) +
  geom_col(alpha = 0.85) +
  geom_hline(yintercept = 0.40, linetype = "dashed",
             color = "#D73027", linewidth = 0.7) +
  geom_hline(yintercept = 0.70, linetype = "dashed",
             color = "#1A9641", linewidth = 0.7) +
  annotate("text", x = 0.6, y = c(0.42, 0.72),
           label = c("0.40 - Minimum threshold", "0.70 - Good"),
           hjust = 0, size = 3,
           color = c("#D73027","#1A9641")) +
  scale_fill_gradient(low = "#FEE090", high = "#4575B4", name = "h²") +
  scale_y_continuous(limits = c(0, 1.05), breaks = seq(0, 1, 0.1)) +
  coord_flip() +
  labs(title    = "Communality (h²) per Variable - EFA OBLIMIN",
       subtitle = "Proportion of variance explained by the 5 common factors",
       x = NULL, y = "Communality (h²)") +
  theme_minimal(base_size = 10) +
  theme(plot.title       = element_text(face = "bold", hjust = 0.5),
        plot.subtitle    = element_text(hjust = 0.5, color = "grey40"),
        legend.position  = "right",
        panel.grid.minor = element_blank())

Interpretation - EFA Factor Structure (Oblimin, 5 Factors, 61.01% total variance):

ML2 - “Body Size & Hydration” (24.89%)
Primary loadings: TBW (0.884), ICW (0.899), MM (0.882), Bone Mass (0.839), ECW (0.799), VMA (0.796), Weight (0.759), Height (0.415). This factor represents total body mass and its structural components (muscle, bone, intracellular water). High scorers are physically larger individuals with more lean tissue. This is the dominant clinical factor in the dataset, explaining nearly a quarter of all variance. Substantive label: Lean Body Mass & Hydration.

ML1 - “Adiposity vs Lean Composition” (16.81%)
Strong bipolar: TBFR (+0.952), LM (-0.953), TFC (+0.867), VFA (+0.759), BMI (+0.527) vs Protein (-0.566), Creatinine (-0.403). This factor captures the fat-lean dichotomy within body composition. High positive scores indicate high fat mass, low lean mass, and high visceral fat - the classical obesity phenotype. This factor is directly clinically relevant to gallstone risk, as visceral adiposity is a known risk factor. Substantive label: Adiposity & Fat Composition.

ML5 - “Hepatocellular Stress” (6.72%)
Near-exclusive loadings: ALT (0.950), AST (0.923). These two transaminase enzymes are the definitive markers of hepatocyte injury. Their extremely high and focused loadings on a single factor confirms that they measure the same underlying biological process (liver cell damage) almost perfectly. This factor is particularly relevant to the gallstone context, as hepatic dysfunction and biliary disease are closely linked. Substantive label: Liver Enzyme Activity.

ML3 - “Visceral-Renal Aging” (6.60%)
Age (0.821), VFR (0.687), GFR (-0.489). As age increases, visceral fat accumulates (positive VFR loading) while kidney function declines (negative GFR loading). This factor captures the multidimensional aging process - specifically the metabolic and renal dimensions that co-occur with advancing age. Substantive label: Age-Related Visceral & Renal Function.

ML4 - “Height-Based Body Frame” (5.99%)
Height (0.740), with cross-loadings on ML2. This factor represents skeletal body frame size independent of fat or muscle mass. Taller individuals have a different body composition profile even at similar weights. BMI cross-loads negatively (-0.465), confirming that BMI is partially confounded by height - a known limitation of BMI as an adiposity index. Substantive label: Skeletal Frame Size.

Variables not well-represented (h² < 0.40):
Obesity (%), Hepatic Fat Accumulation, Glucose, HDL, Triglyceride, ALP, GFR, CRP, Vitamin D. These 9 variables have high uniqueness - their variance is primarily “noise” or unique to each variable, not shared through any common latent factor. This is clinically interpretable: variables like CRP and Vitamin D are influenced by a wide range of patient-specific conditions (infections, sun exposure, diet) that don’t cluster with other variables in this particular dataset.


Stage 11: Split-Sample Validation

Split-sample validation tests the stability and reproducibility of the factor solution. The data is randomly split 50:50 and EFA is run independently on both halves. If the variance explained per factor differs by less than 5% between the two samples, the solution is considered stable and not merely a product of sample-specific noise (Goretzko, 2022).

set.seed(42)
idx_split <- sample(seq_len(nrow(data_pca)), nrow(data_pca) %/% 2)
data_s1   <- data_pca[ idx_split, ]
data_s2   <- data_pca[-idx_split, ]

fa_s1 <- fa(data_s1, nfactors = n_fa, fm = "ml", rotate = rot_method)
fa_s2 <- fa(data_s2, nfactors = n_fa, fm = "ml", rotate = rot_method)

n_show <- min(n_fa, 5)
var_s1 <- round(fa_s1$Vaccounted[2, 1:n_show] * 100, 2)
var_s2 <- round(fa_s2$Vaccounted[2, 1:n_show] * 100, 2)
diff_v <- round(abs(var_s1 - var_s2), 2)

data.frame(
  Factor     = paste0("F", 1:n_show),
  Sample1    = paste0(var_s1, "%"),
  Sample2    = paste0(var_s2, "%"),
  Difference = paste0(diff_v, "%"),
  Stable     = ifelse(diff_v < 5, "Yes", "No")
) %>%
  kbl(caption = sprintf("Split-Sample Validation  |  S1: %d obs  |  S2: %d obs",
                        nrow(data_s1), nrow(data_s2))) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
  column_spec(5, color = ifelse(diff_v < 5, "#27ae60", "#c0392b"), bold = TRUE)
Split-Sample Validation | S1: 159 obs | S2: 160 obs
Factor Sample1 Sample2 Difference Stable
ML2 F1 28.95% 26.76% 2.19% Yes
ML1 F2 15.33% 16.59% 1.26% Yes
ML4 F3 7.52% 6.91% 0.61% Yes
ML3 F4 6.4% 6.52% 0.12% Yes
ML5 F5 4.36% 5.84% 1.48% Yes

Result: The largest difference between split samples is only 2.15% (Factor 1), well below the 5% stability threshold. All 5 factors show consistent variance contributions across both halves of the sample. This confirms that the factor solution is stable and generalizable, not a product of chance variation in a specific subset of patients (Goretzko, 2022).


Stage 12: Factor Scores & Data Export

pc_scores <- as.data.frame(pca_result$x[, 1:n_comp])
colnames(pc_scores) <- paste0("PC", 1:n_comp)

efa_scores <- as.data.frame(efa_final$scores)
colnames(efa_scores) <- paste0("Factor_", seq_len(ncol(efa_scores)))

cat("PC score orthogonality check (correlations should be exactly 0):\n\n")
## PC score orthogonality check (correlations should be exactly 0):
print(round(cor(pc_scores), 4))
##     PC1 PC2 PC3 PC4 PC5 PC6 PC7
## PC1   1   0   0   0   0   0   0
## PC2   0   1   0   0   0   0   0
## PC3   0   0   1   0   0   0   0
## PC4   0   0   0   1   0   0   0
## PC5   0   0   0   0   1   0   0
## PC6   0   0   0   0   0   1   0
## PC7   0   0   0   0   0   0   1
write.csv(cbind(data_pca, pc_scores),  "OUTPUT_pca_scores.csv",  row.names = FALSE)
write.csv(cbind(data_pca, efa_scores), "OUTPUT_efa_scores.csv",  row.names = FALSE)
cat("\n[Saved] OUTPUT_pca_scores.csv  - raw data + PCA component scores\n")
## 
## [Saved] OUTPUT_pca_scores.csv  - raw data + PCA component scores
cat("[Saved] OUTPUT_efa_scores.csv  - raw data + EFA factor scores (Oblimin)\n")
## [Saved] OUTPUT_efa_scores.csv  - raw data + EFA factor scores (Oblimin)

Factor Scores: PCA scores are perfectly orthogonal (all inter-PC correlations = 0 by mathematical construction). These can be used as uncorrelated predictor variables in regression models. EFA Oblimin scores will have small non-zero inter-factor correlations, reflecting the genuine correlations among the latent constructs identified in Stage 10c.

Use cases for exported scores: - Use PCA scores for dimensionality reduction in machine learning models predicting gallstone status - Use EFA factor scores for structural equation modeling or clinical clustering analyses - Both score sets allow downstream analyses with a 75% reduction in dimensionality (29 variables → 5–7 scores)


Final Summary

data.frame(
  Metric = c(
    "Dataset",
    "Observations (n)",
    "Initial numeric variables",
    "Variables dropped via KMO",
    "Final variables for analysis",
    "Correlated pairs |r| > 0.30",
    "KMO Overall MSA",
    "Bartlett's Test p-value",
    "PCA components retained (Kaiser Rule)",
    "PCA cumulative variance explained",
    "EFA factors (Parallel Analysis)",
    "EFA rotation method selected",
    "Max inter-factor correlation (Phi)",
    "EFA total variance explained",
    "Split-sample stability (max diff)"
  ),
  Value = c(
    "Gallstone / Body Composition & Biomarkers (UCI)",
    "319 patients",
    "32 variables",
    "3 variables: TC, LDL, ECF/TBW Ratio",
    "29 variables",
    sprintf("135 / 406  (33.3%%)"),
    "0.8388  (Meritorious)",
    "≈ 0  (p << 0.05)",
    "7 components (PC1–PC7)",
    "76.45%",
    "5 factors (F1–F5)",
    "OBLIMIN (oblique) - max phi = 0.471",
    "0.471  (ML1 ↔ ML4)",
    "61.01%",
    "2.15% - STABLE"
  )
) %>%
  kbl(caption = "Complete Analysis Summary - PCA & EFA") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)
Complete Analysis Summary - PCA & EFA
Metric Value
Dataset Gallstone / Body Composition & Biomarkers (UCI)
Observations (n) 319 patients
Initial numeric variables 32 variables
Variables dropped via KMO 3 variables: TC, LDL, ECF/TBW Ratio
Final variables for analysis 29 variables
Correlated pairs |r| > 0.30 135 / 406 (33.3%)
KMO Overall MSA 0.8388 (Meritorious)
Bartlett’s Test p-value ≈ 0 (p << 0.05)
PCA components retained (Kaiser Rule) 7 components (PC1–PC7)
PCA cumulative variance explained 76.45%
EFA factors (Parallel Analysis) 5 factors (F1–F5)
EFA rotation method selected OBLIMIN (oblique) - max phi = 0.471
Max inter-factor correlation (Phi) 0.471 (ML1 ↔︎ ML4)
EFA total variance explained 61.01%
Split-sample stability (max diff) 2.15% - STABLE
cat("PCA Component Detail:\n")
## PCA Component Detail:
for (k in 1:n_comp)
  cat(sprintf("  PC%d:  lambda = %.4f  |  Var = %.2f%%  |  Cumul = %.2f%%\n",
              k, eigenvalues[k], prop_var[k]*100, cum_var[k]*100))
##   PC1:  lambda = 8.7859  |  Var = 30.30%  |  Cumul = 30.30%
##   PC2:  lambda = 6.1447  |  Var = 21.19%  |  Cumul = 51.48%
##   PC3:  lambda = 2.0819  |  Var = 7.18%  |  Cumul = 58.66%
##   PC4:  lambda = 1.6823  |  Var = 5.80%  |  Cumul = 64.46%
##   PC5:  lambda = 1.3655  |  Var = 4.71%  |  Cumul = 69.17%
##   PC6:  lambda = 1.1053  |  Var = 3.81%  |  Cumul = 72.98%
##   PC7:  lambda = 1.0051  |  Var = 3.47%  |  Cumul = 76.45%
cat("\nEFA Factor Detail:\n")
## 
## EFA Factor Detail:
for (k in 1:n_fa)
  cat(sprintf("  F%d:   SS = %.4f  |  Var = %.2f%%  |  Cumul = %.2f%%\n",
              k, efa_final$Vaccounted[1,k],
              efa_final$Vaccounted[2,k]*100,
              efa_final$Vaccounted[3,k]*100))
##   F1:   SS = 7.2176  |  Var = 24.89%  |  Cumul = 24.89%
##   F2:   SS = 4.8747  |  Var = 16.81%  |  Cumul = 41.70%
##   F3:   SS = 1.9495  |  Var = 6.72%  |  Cumul = 48.42%
##   F4:   SS = 1.9145  |  Var = 6.60%  |  Cumul = 55.02%
##   F5:   SS = 1.7368  |  Var = 5.99%  |  Cumul = 61.01%

Overall Interpretation & Conclusions:

This analysis successfully identified the latent structure underlying 29 body composition and biomarker variables in a sample of 319 gallstone patients. The two complementary techniques revealed:

From PCA (7 components, 76.45% variance):
The data is organized around seven distinct dimensions: (1) General Body Size & Lean Mass, (2) Adiposity vs Lean Composition, (3) Renal Aging, (4) Hepatic Enzyme Activity, (5) Glycemic-Lipid Metabolic Risk, (6) Inflammation-Vitamin D Axis, and (7) Extreme Obesity. This 7-dimensional structure suggests that the assessment of gallstone risk requires a multi-dimensional clinical framework - no single measurement captures the full picture.

From EFA (5 factors, 61.01% variance, Oblimin):
The more conservative Parallel Analysis identified 5 genuine latent constructs: Lean Body Mass & Hydration, Adiposity & Fat Composition, Liver Enzyme Activity, Age-Related Visceral & Renal Function, and Skeletal Frame Size. The necessity of oblique rotation (max factor correlation = 0.471) confirms that these clinical dimensions are genuinely interrelated in real patients - for example, body size and adiposity co-vary, as do aging and renal function.

Clinical implications for gallstone research:
The Adiposity factor (ML1) and Liver Enzyme factor (ML5) are particularly relevant, as visceral obesity and hepatic dysfunction are both established risk factors for cholesterol gallstone formation. Future predictive models for gallstone risk should incorporate scores from these two factors at minimum, alongside the aging dimension (ML3). The 9 variables with poor communality (h² < 0.40) - including HDL, Glucose, CRP, and Vitamin D - likely represent individual patient variability not captured by group-level factor patterns and may require separate clinical consideration.


References

Esen, I., Arslan, H., Aktürk, S., Gülşen, M., Kültekin, N., & Özdemir, O. (2024). Gallstone [Dataset]. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/1150/gallstone-1

Esen, I., Arslan, H., Aktürk Esen, S., Gülşen, M., Kültekin, N., & Özdemir, O. (2024). Early prediction of gallstone disease with a machine learning-based method from bioimpedance and laboratory data. Medicine, 103(6), e37258. https://doi.org/10.1097/md.0000000000037258

Goretzko, D., Siemund, K., & Sterner, P. (2024). Evaluating model fit of measurement models in confirmatory factor analysis. Educational and Psychological Measurement, 84(1), 123–144. https://doi.org/10.1177/00131644231163813

McNeish, D. (2023). Dynamic fit index cutoffs for confirmatory factor analysis models. Psychological Methods, 28(1), 61–88. https://doi.org/10.1037/met0000395

Midi, H., Bagheri, A., & Imon, A. H. M. R. (2022). Mitigating the multicollinearity problem and its machine learning approach: A review. Mathematics, 10(8), 1283. https://doi.org/10.3390/math10081283

Watkins, M. W. (2021). A step-by-step guide to exploratory factor analysis with R and R Studio. Routledge. https://doi.org/10.4324/9781003120001

Xia, Y., & Zhou, X. (2024). Improving the use of parallel analysis by accounting for sampling variability of the observed correlation matrix. Educational and Psychological Measurement, 85(1), 114–133. https://doi.org/10.1177/00131644241268073