This dataset originates from the UCI Machine Learning Repository and contains clinical records of patients evaluated for gallstone disease. The data was collected to identify physiological and biochemical patterns potentially associated with gallstone formation, through a combination of body composition measurements (via Bioelectrical Impedance Analysis) and standard blood biomarker panels.
| Attribute | Details |
|---|---|
| Source | UCI Machine Learning Repository |
| Clinical context | Gallstone disease evaluation |
| Total observations | 319 patients |
| Total variables (raw) | 40 (numeric + categorical) |
| Numeric variables analyzed | 32 (after excluding categoricals and binary variables) |
| Final variables after KMO selection | 29 |
Seven categorical variables were excluded prior to PCA/EFA, as both techniques require continuous numeric data:
| Variable | Description |
|---|---|
| Gallstone Status | Diagnosis label (target variable) |
| Gender | Patient sex |
| Comorbidity | General comorbidity status |
| CAD | Coronary Artery Disease |
| Hypothyroidism | Hypothyroidism diagnosis |
| Hyperlipidemia | Hyperlipidemia diagnosis |
| DM | Diabetes Mellitus |
Variables fall into two functional groups:
Group 1 - Body Composition (BCA via BIA)
Measured using Bioelectrical Impedance Analysis (BIA), a non-invasive
method estimating fluid distribution, fat mass, and lean mass: Age,
Height, Weight, BMI, Total Body Water (TBW), Extracellular Water (ECW),
Intracellular Water (ICW), ECF/TBW Ratio, Total Body Fat Ratio (TBFR),
Lean Mass (LM), Body Protein Content, Visceral Fat Rating (VFR), Bone
Mass (BM), Muscle Mass (MM), Obesity (%), Total Fat Content (TFC),
Visceral Fat Area (VFA), Visceral Muscle Area (VMA), Hepatic Fat
Accumulation (HFA)
Group 2 - Blood Biomarkers
Standard laboratory panel reflecting metabolic, hepatic, renal, and
inflammatory function: Glucose, Total Cholesterol (TC), LDL, HDL,
Triglyceride, AST, ALT, ALP, Creatinine, GFR, CRP, Hemoglobin (HGB),
Vitamin D
Clinical Relevance: Body composition and blood biomarkers are closely interrelated. Visceral obesity influences lipid profiles, insulin resistance, and hepatic function. This complex, intercorrelated variable set makes PCA and EFA ideal tools for uncovering the underlying latent structure empirically.
library(readxl)
library(psych)
library(corrplot)
library(ggplot2)
library(dplyr)
library(factoextra)
library(tidyr)
library(tibble)
library(GPArotation)
library(ggrepel)
library(knitr)
library(kableExtra)
out_path <- getwd()data_raw <- read_excel("dataset-uci.xlsx")
categorical_vars <- c("Gallstone Status", "Gender", "Comorbidity",
"CAD", "Hypothyroidism", "Hyperlipidemia", "DM")
data_num <- data_raw %>%
select(-any_of(categorical_vars)) %>%
select(where(is.numeric)) %>%
select(where(~ n_distinct(.) > 2)) %>%
na.omit()
n <- nrow(data_num)
p <- ncol(data_num)Before running PCA or EFA, sample adequacy must be verified. Two criteria are evaluated: the minimum sample size and the observations-to-variables ratio. Both directly affect the stability and generalizability of extracted components or factors.
Criteria evaluated:
## Observations : 319
## Variables : 32
## Ratio : 9.97:1 [PASSED]
Result: With 319 observations and 32 variables, the ratio is 9.97:1 - nearly ideal at 10:1. This sample size is fully sufficient for stable factor extraction and generalization of results (Goretzko & Bühner, 2020).
Descriptive statistics provide an overview of the distribution of each variable, including skewness and kurtosis. These are particularly important when using Maximum Likelihood (ML) extraction in EFA, which formally assumes multivariate normality. Severe departures from normality can inflate factor loadings and distort communality estimates.
desc <- data.frame(
Variable = colnames(data_num),
Mean = round(sapply(data_num, mean, na.rm = TRUE), 3),
SD = round(sapply(data_num, sd, na.rm = TRUE), 3),
Min = round(sapply(data_num, min, na.rm = TRUE), 3),
Max = round(sapply(data_num, max, na.rm = TRUE), 3),
Skewness = round(sapply(data_num, psych::skew), 3),
Kurtosis = round(sapply(data_num, psych::kurtosi), 3)
)
desc %>%
kbl(caption = "Descriptive Statistics - All Numeric Variables") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE, font_size = 12) %>%
column_spec(6, color = ifelse(abs(desc$Skewness) > 2, "#c0392b", "#27ae60"),
bold = ifelse(abs(desc$Skewness) > 2, TRUE, FALSE)) %>%
column_spec(7, color = ifelse(abs(desc$Kurtosis) > 7, "#c0392b", "#27ae60"),
bold = ifelse(abs(desc$Kurtosis) > 7, TRUE, FALSE))| Variable | Mean | SD | Min | Max | Skewness | Kurtosis | |
|---|---|---|---|---|---|---|---|
| Age | Age | 48.069 | 12.115 | 20.00 | 96.00 | 0.133 | 0.349 |
| Height | Height | 167.157 | 10.053 | 145.00 | 191.00 | -0.080 | -0.701 |
| Weight | Weight | 80.565 | 15.709 | 42.90 | 143.50 | 0.428 | 0.315 |
| Body Mass Index (BMI) | Body Mass Index (BMI) | 28.877 | 5.314 | 17.40 | 49.70 | 0.663 | 1.133 |
| Total Body Water (TBW) | Total Body Water (TBW) | 40.588 | 7.930 | 13.00 | 66.20 | 0.206 | -0.378 |
| Extracellular Water (ECW) | Extracellular Water (ECW) | 17.071 | 3.162 | 9.00 | 27.80 | 0.020 | -0.371 |
| Intracellular Water (ICW) | Intracellular Water (ICW) | 23.634 | 5.349 | 13.80 | 57.10 | 0.943 | 3.468 |
| Extracellular Fluid/Total Body Water (ECF/TBW) | Extracellular Fluid/Total Body Water (ECF/TBW) | 42.212 | 3.244 | 29.23 | 52.00 | -0.509 | 1.374 |
| Total Body Fat Ratio (TBFR) (%) | Total Body Fat Ratio (TBFR) (%) | 28.275 | 8.444 | 6.30 | 50.92 | 0.133 | -0.525 |
| Lean Mass (LM) (%) | Lean Mass (LM) (%) | 71.638 | 8.438 | 48.99 | 93.67 | -0.124 | -0.507 |
| Body Protein Content (Protein) (%) | Body Protein Content (Protein) (%) | 15.939 | 2.335 | 5.56 | 24.81 | -0.047 | 1.820 |
| Visceral Fat Rating (VFR) | Visceral Fat Rating (VFR) | 9.078 | 4.333 | 1.00 | 31.00 | 0.790 | 2.134 |
| Bone Mass (BM) | Bone Mass (BM) | 2.803 | 0.509 | 1.40 | 4.00 | 0.205 | -0.825 |
| Muscle Mass (MM) | Muscle Mass (MM) | 54.273 | 10.604 | 4.70 | 78.80 | -0.102 | 0.359 |
| Obesity (%) | Obesity (%) | 35.850 | 109.800 | 0.40 | 1954.00 | 16.715 | 288.973 |
| Total Fat Content (TFC) | Total Fat Content (TFC) | 23.488 | 9.608 | 3.10 | 62.50 | 0.802 | 1.115 |
| Visceral Fat Area (VFA) | Visceral Fat Area (VFA) | 12.172 | 5.262 | 0.90 | 41.00 | 1.048 | 2.919 |
| Visceral Muscle Area (VMA) (Kg) | Visceral Muscle Area (VMA) (Kg) | 30.403 | 4.461 | 18.90 | 41.10 | -0.056 | -0.473 |
| Hepatic Fat Accumulation (HFA) | Hepatic Fat Accumulation (HFA) | 1.150 | 1.059 | 0.00 | 4.00 | 0.174 | -1.357 |
| Glucose | Glucose | 108.689 | 44.849 | 69.00 | 575.00 | 5.882 | 45.419 |
| Total Cholesterol (TC) | Total Cholesterol (TC) | 203.495 | 45.759 | 60.00 | 360.00 | 0.430 | 0.485 |
| Low Density Lipoprotein (LDL) | Low Density Lipoprotein (LDL) | 126.652 | 38.541 | 11.00 | 293.00 | 0.535 | 1.080 |
| High Density Lipoprotein (HDL) | High Density Lipoprotein (HDL) | 49.476 | 17.719 | 25.00 | 273.00 | 6.468 | 77.178 |
| Triglyceride | Triglyceride | 144.502 | 97.904 | 1.39 | 838.00 | 2.760 | 12.649 |
| Aspartat Aminotransferaz (AST) | Aspartat Aminotransferaz (AST) | 21.685 | 16.698 | 8.00 | 195.00 | 6.929 | 60.062 |
| Alanin Aminotransferaz (ALT) | Alanin Aminotransferaz (ALT) | 26.856 | 27.884 | 3.00 | 372.00 | 7.209 | 76.924 |
| Alkaline Phosphatase (ALP) | Alkaline Phosphatase (ALP) | 73.113 | 24.181 | 7.00 | 197.00 | 0.788 | 2.507 |
| Creatinine | Creatinine | 0.801 | 0.176 | 0.46 | 1.46 | 0.615 | 0.135 |
| Glomerular Filtration Rate (GFR) | Glomerular Filtration Rate (GFR) | 100.819 | 16.971 | 10.60 | 132.00 | -1.796 | 6.391 |
| C-Reactive Protein (CRP) | C-Reactive Protein (CRP) | 1.854 | 4.990 | 0.00 | 43.40 | 5.360 | 33.233 |
| Hemoglobin (HGB) | Hemoglobin (HGB) | 14.418 | 1.776 | 8.50 | 18.80 | -0.381 | 0.133 |
| Vitamin D | Vitamin D | 21.401 | 9.982 | 3.50 | 53.10 | 0.276 | -0.238 |
Interpretation: Variables are highlighted in red when |Skewness| > 2 or |Kurtosis| > 7, indicating substantial departure from normality.
Most body composition variables exhibit approximately symmetric distributions (skewness ≈ 0), which is expected in clinical cohort data. However, several blood biomarkers show extreme non-normality:
These non-normal biomarkers are clinically meaningful but analytically problematic. Their low communalities in EFA are a direct consequence of high uniqueness (varians unik) driven by extreme observations.
The correlation matrix is the input for both PCA and EFA. For these techniques to be meaningful, there must be sufficient intercorrelation among variables - both techniques decompose this shared variance into latent dimensions. A minimum of ~25–30% of variable pairs with |r| > 0.30 is considered sufficient evidence of factorability (Watkins, 2021).
mat_corr <- cor(data_num, use = "complete.obs")
high_corr <- which(abs(mat_corr) > 0.7 & upper.tri(mat_corr), arr.ind = TRUE)
cat(sprintf("Pairs with |r| > 0.70: %d out of %d total pairs (%.1f%%)\n",
nrow(high_corr),
ncol(mat_corr)*(ncol(mat_corr)-1)/2,
nrow(high_corr) / (ncol(mat_corr)*(ncol(mat_corr)-1)/2) * 100))## Pairs with |r| > 0.70: 39 out of 496 total pairs (7.9%)
##
## Top 10 strongest correlations:
for (i in seq_len(min(10, nrow(high_corr)))) {
r1 <- rownames(mat_corr)[high_corr[i, 1]]
r2 <- colnames(mat_corr)[high_corr[i, 2]]
cat(sprintf(" %s <--> %s r = %.3f\n", r1, r2,
mat_corr[high_corr[i, 1], high_corr[i, 2]]))
}## Weight <--> Body Mass Index (BMI) r = 0.791
## Height <--> Total Body Water (TBW) r = 0.713
## Weight <--> Total Body Water (TBW) r = 0.767
## Weight <--> Extracellular Water (ECW) r = 0.789
## Total Body Water (TBW) <--> Extracellular Water (ECW) r = 0.904
## Height <--> Intracellular Water (ICW) r = 0.739
## Total Body Water (TBW) <--> Intracellular Water (ICW) r = 0.817
## Extracellular Water (ECW) <--> Intracellular Water (ICW) r = 0.784
## Body Mass Index (BMI) <--> Total Body Fat Ratio (TBFR) (%) r = 0.753
## Body Mass Index (BMI) <--> Lean Mass (LM) (%) r = -0.752
## ... and 29 more pairs with |r| > 0.70
corrplot(mat_corr,
method = "color",
type = "upper",
addCoef.col = "black",
number.cex = 0.38,
tl.cex = 0.52,
tl.col = "black",
tl.srt = 45,
col = colorRampPalette(
c("#053061","#2166AC","#F7F7F7","#D6604D","#67001F"))(200),
title = "Correlation Matrix - Numerical Variables",
mar = c(0, 0, 3, 0))Interpretation: There are 39 variable pairs with |r| > 0.70, representing strong multicollinearity that confirms rich latent structure suitable for PCA/EFA.
Three distinct correlation clusters are visible in the heatmap:
Cluster 1 - Body Fluid & Lean Mass (top-left block, deep
red):
TBW ↔︎ ECW (r = 0.904), TBW ↔︎ ICW (r = 0.817), ECW ↔︎ ICW (r = 0.784),
Weight ↔︎ TBW (r = 0.767), Weight ↔︎ ECW (r = 0.789). These variables all
measure aspects of the same biological phenomenon - the distribution of
body water and lean mass. Their strong intercorrelations are
physiologically expected: larger bodies have more total water, and water
is compartmentalized across intracellular and extracellular spaces
proportionally.
Cluster 2 - Adiposity vs Lean Composition (middle block, high
contrast):
BMI ↔︎ TBFR (r = 0.753), BMI ↔︎ LM (r = -0.752), Weight ↔︎ BMI (r = 0.791).
The negative correlation between BMI and Lean Mass reflects the
fundamental biological trade-off: as the proportion of body fat
increases, lean mass proportion necessarily decreases. This contrast
will manifest as a strong bipolar second component in PCA.
Cluster 3 - Biomarker variables (bottom-right, lighter
colors):
Biomarkers generally show weaker intercorrelations (|r| mostly <
0.50), with the notable exception of AST ↔︎ ALT (both liver enzymes),
Glucose ↔︎ Triglyceride (metabolic syndrome axis), and Age ↔︎ GFR (renal
aging). The overall weaker correlations in this group partially explain
why EFA extracts fewer factors from biomarkers than from body
composition variables.
The Kaiser-Meyer-Olkin (KMO) Test measures the Measure of Sampling Adequacy (MSA) - the proportion of variance among variables that is common variance (shared through latent factors) rather than unique variance. KMO evaluates whether the correlation patterns are compact enough to support factor analysis.
The MSA for each variable is computed as:
\[MSA_j = \frac{\sum_{k \neq j} r_{jk}^2}{\sum_{k \neq j} r_{jk}^2 + \sum_{k \neq j} a_{jk}^2}\]
where \(r_{jk}\) is the bivariate correlation and \(a_{jk}\) is the partial correlation between variables j and k controlling for all others. An MSA close to 1.0 means partial correlations are very small relative to bivariate correlations - indicating that variable j shares its variance with others through common factors. Variables with MSA < 0.50 are considered “unfit” and are removed iteratively.
KMO Classification (Kaiser, 1974; Watkins, 2021):
| MSA Range | Interpretation |
|---|---|
| ≥ 0.90 | Marvelous |
| 0.80 – 0.89 | Meritorious |
| 0.70 – 0.79 | Middling |
| 0.60 – 0.69 | Mediocre |
| 0.50 – 0.59 | Miserable |
| < 0.50 | Unacceptable - remove variable |
data_iter <- data_num
dropped_vars <- c()
iter <- 0
repeat {
iter <- iter + 1
kmo_res <- KMO(data_iter)
msa_vec <- kmo_res$MSAi
min_msa <- min(msa_vec)
min_var <- names(which.min(msa_vec))
if (min_msa >= 0.50) break
cat(sprintf("Iter %d: removing '%s' (MSA = %.4f)\n", iter, min_var, min_msa))
dropped_vars <- c(dropped_vars, min_var)
data_iter <- data_iter[, colnames(data_iter) != min_var]
}## Iter 1: removing 'Total Cholesterol (TC)' (MSA = 0.4001)
## Iter 2: removing 'Low Density Lipoprotein (LDL)' (MSA = 0.3743)
## Iter 3: removing 'Extracellular Fluid/Total Body Water (ECF/TBW)' (MSA = 0.4832)
kmo_final <- KMO(data_iter)
msa_final <- data.frame(
Variable = names(kmo_final$MSAi),
MSA = round(as.numeric(kmo_final$MSAi), 4)
)
low_final <- msa_final[msa_final$MSA < 0.50, ]
mat_final <- cor(data_iter, use = "complete.obs")
kmo_cat <- switch(
as.character(findInterval(kmo_final$MSA, c(0.5, 0.6, 0.7, 0.8, 0.9))),
"0" = "Unacceptable", "1" = "Miserable", "2" = "Mediocre",
"3" = "Middling", "4" = "Meritorious", "5" = "Marvelous"
)
cat(sprintf("\nFinal Overall KMO MSA : %.4f (%s)\n", kmo_final$MSA, kmo_cat))##
## Final Overall KMO MSA : 0.8388 (Meritorious)
cat(sprintf("Variables dropped : %d {%s}\n",
length(dropped_vars),
paste(dropped_vars, collapse = ", ")))## Variables dropped : 3 {Total Cholesterol (TC), Low Density Lipoprotein (LDL), Extracellular Fluid/Total Body Water (ECF/TBW)}
## Variables retained : 29
ggplot(
msa_final %>% mutate(
Short = abbreviate(Variable, minlength = 10),
Category = case_when(
MSA < 0.50 ~ "< 0.50",
MSA < 0.70 ~ "0.50-0.70",
TRUE ~ ">= 0.70"
)
),
aes(x = reorder(Short, MSA), y = MSA, fill = Category)
) +
geom_col(width = 0.75, alpha = 0.9) +
geom_hline(yintercept = 0.50, linetype = "dashed", color = "#D73027", linewidth = 0.7) +
geom_hline(yintercept = 0.70, linetype = "dashed", color = "#FC8D59", linewidth = 0.7) +
geom_hline(yintercept = 0.80, linetype = "dashed", color = "#1A9641", linewidth = 0.7) +
annotate("text", x = 0.6, y = c(0.52, 0.72, 0.82),
label = c("0.50 - Acceptable", "0.70 - Good", "0.80 - Great"),
hjust = 0, size = 3, color = c("#D73027","#FC8D59","#1A9641")) +
scale_fill_manual(values = c("< 0.50" = "#D73027",
"0.50-0.70" = "#FEE090",
">= 0.70" = "#4575B4")) +
coord_flip() +
scale_y_continuous(limits = c(0, 1.05), breaks = seq(0, 1, 0.1)) +
labs(title = "KMO Per-Variable MSA (Post-Iteration)",
x = NULL, y = "MSA Value", fill = "MSA Category") +
theme_minimal(base_size = 11) +
theme(legend.position = "bottom",
plot.title = element_text(face = "bold", hjust = 0.5),
panel.grid.minor = element_blank())Interpretation - Why were these three variables removed?
Iteration 1 - Total Cholesterol (TC, MSA =
0.4001):
TC correlates strongly with LDL (which was also eventually removed) but
only weakly with the body composition variables that dominate the
dataset. The high partial correlations of TC relative to its bivariate
correlations indicate that TC’s relationship with other variables is not
mediated through a common latent factor - it behaves more like an
“isolated island” in the correlation network.
Iteration 2 - LDL (MSA = 0.3743):
After TC was removed, LDL - which was heavily co-linear with TC - became
the most isolated variable. LDL’s correlations with the remaining body
composition variables and other biomarkers are weak and inconsistent,
without a common factor structure tying them together.
Iteration 3 - ECF/TBW Ratio (MSA = 0.4832):
This is a derived variable computed as ECF divided by TBW.
Since both ECF and TBW are already present in the dataset, the ratio
creates a mathematical dependency that inflates partial correlations.
Derived ratios often behave poorly in factor analysis precisely because
their variance is algebraically constrained by their components.
Final result: KMO Overall = 0.8388 (Meritorious) - an excellent level. Virtually all 29 retained variables have MSA ≥ 0.70, confirming that each variable shares substantial variance with the others through common latent factors. The analysis is fully supported.
Bartlett’s Test of Sphericity tests the null hypothesis that the population correlation matrix is an identity matrix (i.e., all off-diagonal correlations = 0). If H₀ cannot be rejected, there are no significant correlations among variables and PCA/EFA is meaningless. The test statistic follows a chi-square distribution with df = p(p-1)/2.
Note: This test is sensitive to sample size - with n = 319, even weak correlations will be statistically significant. The KMO test is therefore a more substantively meaningful indicator of factorability.
bartlett_final <- cortest.bartlett(mat_final, n = n, diag = TRUE)
cat(sprintf("Chi-square : %.3f\n", bartlett_final$chisq))## Chi-square : 11645.016
## df : 406
cat(sprintf("p-value : %s [%s]\n",
format(bartlett_final$p.value, scientific = TRUE),
ifelse(bartlett_final$p.value < 0.05, "PASSED", "FAILED")))## p-value : 0e+00 [PASSED]
Result: χ²(406) = 11,645.016, p ≈ 0. The null hypothesis is decisively rejected. The correlation matrix is definitively non-spherical, confirming that substantial shared variance exists among variables - a prerequisite for meaningful factor extraction.
p_final <- ncol(data_iter)
mat_abs <- abs(mat_final); diag(mat_abs) <- 0
n_pairs <- p_final * (p_final - 1) / 2
n_sig <- sum(mat_abs > 0.3) / 2
pct_sig <- round(n_sig / n_pairs * 100, 1)
assumption_df <- data.frame(
Assumption = c(
"Sample size (n ≥ 50)",
"Obs-to-variable ratio (≥ 5:1)",
"Correlation pairs |r| > 0.30",
"KMO Overall MSA (≥ 0.50)",
"Per-variable MSA (all ≥ 0.50)",
"Bartlett's Test (p < 0.05)"
),
Result = c(
as.character(n),
paste0(round(n / p_final, 2), ":1"),
sprintf("%d / %d (%.1f%%)", as.integer(n_sig), n_pairs, pct_sig),
sprintf("%.4f (%s)", kmo_final$MSA, kmo_cat),
ifelse(nrow(low_final) == 0, "All passed",
paste0(nrow(low_final), " variable(s) below 0.50")),
format(bartlett_final$p.value, digits = 3, scientific = TRUE)
),
Status = c(
ifelse(n >= 50, "PASSED", "FAILED"),
ifelse((n / p_final) >= 5, "PASSED", "FAILED"),
ifelse(pct_sig >= 25, "PASSED", "FAILED"),
ifelse(kmo_final$MSA >= 0.50, "PASSED", "FAILED"),
ifelse(nrow(low_final) == 0, "PASSED", "FAILED"),
ifelse(bartlett_final$p.value < 0.05, "PASSED", "FAILED")
)
)
assumption_df %>%
kbl(caption = "Assumption Test Summary - PCA/EFA Prerequisites") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE) %>%
column_spec(3,
color = ifelse(assumption_df$Status == "PASSED", "#27ae60", "#c0392b"),
bold = TRUE)| Assumption | Result | Status |
|---|---|---|
| Sample size (n ≥ 50) | 319 | PASSED |
| Obs-to-variable ratio (≥ 5:1) | 11:1 | PASSED |
| Correlation pairs |r| > 0.30 | 135 / 406 (33.3%) | PASSED |
| KMO Overall MSA (≥ 0.50) | 0.8388 (Meritorious) | PASSED |
| Per-variable MSA (all ≥ 0.50) | All passed | PASSED |
| Bartlett’s Test (p < 0.05) | 0e+00 | PASSED |
All 6 prerequisites are satisfied. The dataset demonstrates excellent factorability: a meritorious KMO (0.8388), 33.3% of variable pairs with |r| > 0.30, and an overwhelmingly significant Bartlett test. PCA and EFA are fully justified.
Principal Component Analysis (PCA) transforms the original correlated variables into a set of orthogonal (uncorrelated) linear combinations called principal components (PCs). Each PC captures a decreasing amount of total variance.
The mathematical transformation is:
\[PC_i = e_{i1}Z_1 + e_{i2}Z_2 + \cdots + e_{ip}Z_p\]
where \(e_{ij}\) are the eigenvectors of the correlation matrix and \(Z_j\) are standardized variables. Components are ordered so that PC1 captures the most variance, PC2 the most remaining variance, and so on.
Component retention criteria applied:
data_pca <- data_iter
pca_result <- prcomp(data_pca, scale. = TRUE, center = TRUE)
eigenvalues <- pca_result$sdev^2
prop_var <- eigenvalues / sum(eigenvalues)
cum_var <- cumsum(prop_var)
n_kaiser <- sum(eigenvalues > 1)
n_var70 <- which(cum_var >= 0.70)[1]
n_var80 <- which(cum_var >= 0.80)[1]
n_comp <- n_kaisereig_df <- data.frame(
Component = paste0("PC", seq_along(eigenvalues)),
Eigenvalue = round(eigenvalues, 4),
Variance = paste0(round(prop_var * 100, 2), "%"),
Cumulative = paste0(round(cum_var * 100, 2), "%"),
Decision = ifelse(eigenvalues > 1, "Retain", "Drop")
)
head(eig_df, 15) %>%
kbl(caption = "Eigenvalue Table - First 15 Principal Components") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE) %>%
row_spec(which(head(eig_df,15)$Decision == "Retain"),
background = "#eafaf1") %>%
row_spec(which(head(eig_df,15)$Decision == "Drop"),
color = "#95a5a6")| Component | Eigenvalue | Variance | Cumulative | Decision |
|---|---|---|---|---|
| PC1 | 8.7859 | 30.3% | 30.3% | Retain |
| PC2 | 6.1447 | 21.19% | 51.48% | Retain |
| PC3 | 2.0819 | 7.18% | 58.66% | Retain |
| PC4 | 1.6823 | 5.8% | 64.46% | Retain |
| PC5 | 1.3655 | 4.71% | 69.17% | Retain |
| PC6 | 1.1053 | 3.81% | 72.98% | Retain |
| PC7 | 1.0051 | 3.47% | 76.45% | Retain |
| PC8 | 0.9758 | 3.36% | 79.82% | Drop |
| PC9 | 0.8816 | 3.04% | 82.86% | Drop |
| PC10 | 0.7620 | 2.63% | 85.48% | Drop |
| PC11 | 0.7099 | 2.45% | 87.93% | Drop |
| PC12 | 0.6177 | 2.13% | 90.06% | Drop |
| PC13 | 0.5336 | 1.84% | 91.9% | Drop |
| PC14 | 0.4492 | 1.55% | 93.45% | Drop |
| PC15 | 0.3856 | 1.33% | 94.78% | Drop |
data.frame(
Criterion = c("Kaiser Rule (eigenvalue > 1)",
"Cumulative Variance ≥ 70%",
"Cumulative Variance ≥ 80%"),
N_Components = c(n_kaiser, n_var70, n_var80),
Cumulative = c(sprintf("%.2f%%", cum_var[n_kaiser]*100),
sprintf("%.2f%%", cum_var[n_var70]*100),
sprintf("%.2f%%", cum_var[n_var80]*100)),
Role = c("PRIMARY CRITERION", "Supporting", "Supporting")
) %>%
kbl(caption = "Component Retention Criteria Comparison") %>%
kable_styling(bootstrap_options = c("striped","hover"),
full_width = FALSE) %>%
row_spec(1, bold = TRUE, background = "#eaf4fb")| Criterion | N_Components | Cumulative | Role |
|---|---|---|---|
| Kaiser Rule (eigenvalue > 1) | 7 | 76.45% | PRIMARY CRITERION |
| Cumulative Variance ≥ 70% | 6 | 72.98% | Supporting |
| Cumulative Variance ≥ 80% | 9 | 82.86% | Supporting |
fviz_eig(pca_result,
ncp = p_final,
addlabels = TRUE,
barfill = "#3498DB",
barcolor = "#2980B9",
linecolor = "#E74C3C") +
geom_hline(yintercept = 100 / p_final,
linetype = "dashed", color = "#E74C3C", linewidth = 0.8) +
annotate("text",
x = n_comp + 1.2,
y = eigenvalues[n_comp] / sum(eigenvalues) * 100 + 1.8,
label = sprintf("Retain PC1–PC%d\n(%.2f%% variance)",
n_comp, cum_var[n_comp]*100),
color = "#1A5276", size = 3.2, hjust = 0) +
labs(title = "Scree Plot - PCA",
subtitle = sprintf("Kaiser Rule: %d components retained | Cumulative variance: %.2f%%",
n_comp, cum_var[n_comp]*100),
x = "Principal Component", y = "Variance Explained (%)") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey40"),
panel.grid.minor = element_blank())Interpretation - Eigenvalue Structure:
The PCA retains 7 components (Kaiser Rule) explaining 76.45% of total variance. This is an excellent result for a 29-variable dataset of mixed clinical data.
The eigenvalue pattern is highly informative:
PC1 (λ = 8.79, 30.30%): Nearly one-third of all variance is captured by a single general dimension. This is the General Body Mass & Hydration component - almost all body composition variables (muscle, water, bone, weight) load positively on it. Biologically, this reflects the strong correlation among all indicators of body size: bigger bodies have more muscle, more water, more bone, and more fat.
PC2 (λ = 6.14, 21.19%): The second component is unusually large, nearly as strong as PC1. This indicates a major contrast dimension between lean composition (positive: LM, Protein) and adipose composition (negative: TBFR, TFC, BMI). This bipolar component captures the fundamental biological trade-off between fat mass and lean mass across the sample.
PC3 (λ = 2.08, 7.18%): Captures the Renal Aging axis - Age loads positively, GFR loads negatively. This reflects the well-established physiological relationship between advancing age and declining kidney function (measured by GFR).
PC4 (λ = 1.68, 5.80%): A Hepatocellular Injury component dominated by AST and ALT - both transaminase enzymes that elevate during liver damage.
PC5 (λ = 1.37, 4.71%): A Metabolic Risk component with negative loadings for Glucose and Triglyceride, capturing insulin resistance and dyslipidemia.
PC6 (λ = 1.11, 3.81%): An Inflammation vs Vitamin D component - CRP (inflammatory marker) loads positively while Vitamin D loads negatively. Low Vitamin D and elevated CRP are both markers of poor metabolic health and are inversely correlated.
PC7 (λ = 1.01, 3.47%): An Obesity (%) singleton component. Note that Obesity (%) has extreme skewness (16.7) and appears to be a near-outlier variable that forms its own isolated component.
PC7’s eigenvalue of 1.0051 is only marginally above 1.0, suggesting it represents borderline-justifiable variance. The Parallel Analysis in EFA (Stage 10) will provide a more conservative estimate.
The factor loading matrix shows the correlation between each variable and each principal component. Loadings are calculated as:
\[\lambda_{jk} = e_{jk} \cdot \sqrt{\text{eigenvalue}_k}\]
Communality (h²) is the sum of squared loadings across all retained components for each variable - representing the proportion of that variable’s variance explained by the overall solution. A variable with h² < 0.50 means more than half its variance is unique and not shared with the extracted components.
loadings_unrot <- pca_result$rotation[, 1:n_comp] %*%
diag(sqrt(eigenvalues[1:n_comp]), nrow = n_comp)
colnames(loadings_unrot) <- paste0("PC", 1:n_comp)
h2_unrot <- rowSums(loadings_unrot^2)
load_unrot_df <- as.data.frame(round(loadings_unrot, 3))
load_unrot_df$h2 <- round(h2_unrot, 3)
load_unrot_df <- load_unrot_df %>% arrange(desc(h2))
load_unrot_df %>%
kbl(caption = "Unrotated Loading Matrix + Communality (h²) - Sorted by h²") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE, font_size = 11) %>%
column_spec(ncol(load_unrot_df),
color = ifelse(load_unrot_df$h2 < 0.50, "#c0392b", "#27ae60"),
bold = TRUE)| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | h2 | |
|---|---|---|---|---|---|---|---|---|
| Weight | 0.894 | -0.382 | -0.111 | 0.100 | 0.047 | 0.061 | -0.003 | 0.973 |
| Lean Mass (LM) (%) | 0.022 | 0.965 | 0.066 | 0.013 | 0.019 | -0.025 | -0.028 | 0.939 |
| Muscle Mass (MM) | 0.930 | 0.236 | -0.061 | 0.093 | 0.051 | 0.050 | -0.030 | 0.939 |
| Total Body Fat Ratio (TBFR) (%) | -0.017 | -0.964 | -0.069 | -0.010 | -0.009 | 0.019 | 0.021 | 0.936 |
| Total Fat Content (TFC) | 0.383 | -0.875 | -0.105 | 0.042 | 0.020 | 0.046 | 0.014 | 0.929 |
| Alanin Aminotransferaz (ALT) | 0.374 | 0.156 | -0.242 | -0.830 | 0.086 | 0.061 | -0.019 | 0.923 |
| Body Mass Index (BMI) | 0.523 | -0.801 | -0.034 | 0.008 | 0.032 | -0.001 | 0.015 | 0.918 |
| Visceral Fat Area (VFA) | 0.673 | -0.666 | -0.073 | 0.071 | 0.059 | 0.039 | 0.012 | 0.912 |
| Total Body Water (TBW) | 0.923 | 0.196 | -0.048 | 0.080 | 0.073 | 0.008 | 0.001 | 0.904 |
| Aspartat Aminotransferaz (AST) | 0.271 | 0.154 | -0.268 | -0.848 | 0.124 | -0.024 | -0.023 | 0.904 |
| Extracellular Water (ECW) | 0.919 | 0.054 | -0.035 | 0.056 | 0.125 | -0.097 | 0.046 | 0.879 |
| Visceral Fat Rating (VFR) | 0.711 | -0.485 | 0.320 | -0.074 | 0.130 | -0.092 | 0.022 | 0.875 |
| Intracellular Water (ICW) | 0.860 | 0.302 | -0.119 | 0.099 | 0.022 | 0.055 | -0.034 | 0.859 |
| Height | 0.643 | 0.592 | -0.137 | 0.155 | 0.015 | 0.129 | -0.018 | 0.824 |
| Visceral Muscle Area (VMA) (Kg) | 0.860 | 0.220 | -0.032 | 0.071 | 0.027 | 0.087 | -0.021 | 0.803 |
| Glomerular Filtration Rate (GFR) | -0.084 | 0.116 | -0.818 | 0.109 | -0.219 | -0.172 | 0.116 | 0.794 |
| Bone Mass (BM) | 0.848 | 0.213 | -0.040 | 0.121 | 0.064 | -0.032 | -0.012 | 0.785 |
| Age | 0.019 | -0.330 | 0.749 | -0.208 | 0.169 | -0.093 | 0.003 | 0.751 |
| Obesity (%) | 0.044 | -0.208 | 0.005 | -0.097 | -0.223 | -0.245 | -0.756 | 0.736 |
| Creatinine | 0.448 | 0.454 | 0.508 | -0.019 | 0.115 | 0.090 | -0.111 | 0.698 |
| Body Protein Content (Protein) (%) | -0.021 | 0.722 | 0.271 | -0.011 | -0.024 | 0.155 | -0.080 | 0.626 |
| Vitamin D | -0.010 | 0.144 | 0.151 | 0.017 | 0.397 | -0.496 | 0.422 | 0.626 |
| Triglyceride | 0.431 | 0.026 | 0.180 | -0.076 | -0.588 | -0.088 | 0.084 | 0.584 |
| Glucose | 0.201 | -0.085 | 0.345 | -0.049 | -0.611 | -0.168 | 0.074 | 0.575 |
| Hemoglobin (HGB) | 0.580 | 0.438 | 0.061 | -0.054 | -0.120 | -0.109 | -0.009 | 0.561 |
| Alkaline Phosphatase (ALP) | 0.065 | -0.150 | 0.144 | -0.301 | -0.347 | 0.360 | 0.397 | 0.546 |
| C-Reactive Protein (CRP) | -0.033 | -0.200 | 0.052 | 0.081 | 0.078 | 0.680 | -0.066 | 0.524 |
| Hepatic Fat Accumulation (HFA) | 0.525 | -0.386 | 0.059 | -0.051 | -0.087 | -0.159 | -0.046 | 0.466 |
| High Density Lipoprotein (HDL) | -0.410 | -0.214 | 0.000 | -0.057 | 0.354 | 0.041 | -0.195 | 0.382 |
low_h2 <- rownames(load_unrot_df)[load_unrot_df$h2 < 0.50]
cat(sprintf("[!] Variables with h2 < 0.50: %s\n\n",
ifelse(length(low_h2) == 0, "None", paste(low_h2, collapse = ", "))))## [!] Variables with h2 < 0.50: Hepatic Fat Accumulation (HFA), High Density Lipoprotein (HDL)
for (k in 1:n_comp) {
ld <- loadings_unrot[, k]
dom <- sort(ld[abs(ld) > 0.50], decreasing = TRUE)
cat(sprintf("PC%d (%.2f%% variance):\n", k, prop_var[k] * 100))
if (length(dom) > 0)
for (nm in names(dom)) cat(sprintf(" %+.4f %s\n", dom[nm], nm))
else cat(" No loading > 0.50\n")
cat("\n")
}## PC1 (30.30% variance):
## +0.9300 Muscle Mass (MM)
## +0.9228 Total Body Water (TBW)
## +0.9191 Extracellular Water (ECW)
## +0.8942 Weight
## +0.8599 Visceral Muscle Area (VMA) (Kg)
## +0.8597 Intracellular Water (ICW)
## +0.8477 Bone Mass (BM)
## +0.7109 Visceral Fat Rating (VFR)
## +0.6726 Visceral Fat Area (VFA)
## +0.6430 Height
## +0.5803 Hemoglobin (HGB)
## +0.5251 Hepatic Fat Accumulation (HFA)
## +0.5226 Body Mass Index (BMI)
##
## PC2 (21.19% variance):
## +0.9654 Lean Mass (LM) (%)
## +0.7217 Body Protein Content (Protein) (%)
## +0.5918 Height
## -0.6660 Visceral Fat Area (VFA)
## -0.8014 Body Mass Index (BMI)
## -0.8753 Total Fat Content (TFC)
## -0.9644 Total Body Fat Ratio (TBFR) (%)
##
## PC3 (7.18% variance):
## +0.7495 Age
## +0.5075 Creatinine
## -0.8185 Glomerular Filtration Rate (GFR)
##
## PC4 (5.80% variance):
## -0.8301 Alanin Aminotransferaz (ALT)
## -0.8478 Aspartat Aminotransferaz (AST)
##
## PC5 (4.71% variance):
## -0.5875 Triglyceride
## -0.6106 Glucose
##
## PC6 (3.81% variance):
## +0.6803 C-Reactive Protein (CRP)
##
## PC7 (3.47% variance):
## -0.7562 Obesity (%)
load_unrot_df %>%
select(-h2) %>%
rownames_to_column("Variable") %>%
pivot_longer(-Variable, names_to = "Component", values_to = "Loading") %>%
ggplot(aes(x = Component, y = reorder(Variable, Loading), fill = Loading)) +
geom_tile(color = "white", linewidth = 0.4) +
geom_text(aes(label = sprintf("%.2f", Loading)), size = 2.2, color = "black") +
scale_fill_gradient2(low = "#2166AC", mid = "white", high = "#D6604D",
midpoint = 0, limits = c(-1, 1), name = "Loading") +
labs(title = "Unrotated Component Loadings",
subtitle = "All loading values displayed",
x = "Principal Component", y = NULL) +
theme_minimal(base_size = 9) +
theme(axis.text.y = element_text(size = 7),
legend.position = "right",
plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey50"),
panel.grid = element_blank())Interpretation - Communality (h²) Analysis:
The communality values reveal how well each variable is represented by the 7-component solution:
Excellent representation (h² ≥ 0.90): Weight (0.973), Lean Mass (0.939), Muscle Mass (0.939), TBFR (0.936), TFC (0.929), ALT (0.923), BMI (0.918), VFA (0.912), TBW (0.904), AST (0.904). These variables are almost entirely explained by the extracted components - they are the “core” variables of this dataset.
Good representation (h² = 0.70–0.89): ECW, VFR, ICW, Height, VMA, GFR, Bone Mass, Age, Obesity (%), Creatinine. Well-represented with moderate uniqueness.
Weak representation (h² < 0.50): - Hepatic Fat Accumulation (HFA, h² = 0.466): HFA correlates moderately with multiple components but doesn’t dominate any one. Its unique variance is high (u² = 0.534), suggesting it represents a dimension partially captured but not fully explained by this solution. - HDL (h² = 0.382): HDL has consistently low MSA and low communality throughout the analysis. This reflects its complex physiological role - HDL is inversely protective for cardiovascular disease and correlates with multiple body composition dimensions simultaneously without clustering cleanly with any one group.
Key pattern in unrotated loadings: PC1 acts as a general factor with positive loadings across virtually all body composition variables - this “blurring” makes interpretation difficult. The rotated solution (Stage 8) will clarify the structure by creating more differentiated, interpretable components.
VARIMAX rotation (orthogonal) redistributes variance among components to maximize the variance of squared loadings within each column. The goal is simple structure - each variable loads highly on one component and near-zero on others, making each component substantively interpretable. Rotation does not change the total variance explained or communalities; it only redistributes the variance more evenly across components (Watkins, 2021).
pca_rotated <- principal(data_pca, nfactors = n_comp,
rotate = "varimax", scores = TRUE)
rot_load <- unclass(pca_rotated$loadings)[, 1:n_comp]
colnames(rot_load) <- paste0("RC", 1:n_comp)
h2_rot <- pca_rotated$communality
cat("Rotated Component Matrix (cutoff |loading| >= 0.30):\n\n")## Rotated Component Matrix (cutoff |loading| >= 0.30):
##
## Loadings:
## RC1 RC2 RC3 RC4 RC5 RC6
## Height 0.825 -0.339
## Weight 0.733 0.649
## Total Body Water (TBW) 0.934
## Extracellular Water (ECW) 0.879
## Intracellular Water (ICW) 0.916
## Bone Mass (BM) 0.876
## Muscle Mass (MM) 0.958
## Visceral Muscle Area (VMA) (Kg) 0.881
## Creatinine 0.540 -0.368 0.512
## Hemoglobin (HGB) 0.635
## Body Mass Index (BMI) 0.917
## Total Body Fat Ratio (TBFR) (%) -0.319 0.903
## Lean Mass (LM) (%) 0.326 -0.902
## Body Protein Content (Protein) (%) -0.737
## Visceral Fat Rating (VFR) 0.471 0.640 0.459
## Total Fat Content (TFC) 0.949
## Visceral Fat Area (VFA) 0.432 0.842
## Hepatic Fat Accumulation (HFA) 0.328 0.516
## Age 0.807
## Glomerular Filtration Rate (GFR) -0.881
## Aspartat Aminotransferaz (AST) 0.934
## Alanin Aminotransferaz (ALT) 0.927
## Glucose 0.727
## Triglyceride 0.697
## C-Reactive Protein (CRP) 0.649
## Vitamin D -0.692
## Obesity (%)
## High Density Lipoprotein (HDL) -0.401 -0.433
## Alkaline Phosphatase (ALP) 0.446
## RC7
## Height
## Weight
## Total Body Water (TBW)
## Extracellular Water (ECW)
## Intracellular Water (ICW)
## Bone Mass (BM)
## Muscle Mass (MM)
## Visceral Muscle Area (VMA) (Kg)
## Creatinine
## Hemoglobin (HGB)
## Body Mass Index (BMI)
## Total Body Fat Ratio (TBFR) (%)
## Lean Mass (LM) (%)
## Body Protein Content (Protein) (%)
## Visceral Fat Rating (VFR)
## Total Fat Content (TFC)
## Visceral Fat Area (VFA)
## Hepatic Fat Accumulation (HFA)
## Age
## Glomerular Filtration Rate (GFR)
## Aspartat Aminotransferaz (AST)
## Alanin Aminotransferaz (ALT)
## Glucose
## Triglyceride
## C-Reactive Protein (CRP)
## Vitamin D
## Obesity (%) 0.832
## High Density Lipoprotein (HDL)
## Alkaline Phosphatase (ALP) -0.419
##
## RC1 RC2 RC3 RC4 RC5 RC6 RC7
## SS loadings 8.061 6.242 2.098 1.897 1.662 1.165 1.046
## Proportion Var 0.278 0.215 0.072 0.065 0.057 0.040 0.036
## Cumulative Var 0.278 0.493 0.566 0.631 0.688 0.728 0.765
data.frame(
PC = paste0("PC", 1:n_comp),
Unrotated = paste0(round(prop_var[1:n_comp] * 100, 2), "%"),
RC = paste0("RC", 1:n_comp),
Rotated_Var = paste0(round(pca_rotated$Vaccounted[2, 1:n_comp] * 100, 2), "%"),
Rotated_Cum = paste0(round(pca_rotated$Vaccounted[3, 1:n_comp] * 100, 2), "%")
) %>%
kbl(caption = "Variance Comparison: Unrotated vs VARIMAX Rotated") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| PC | Unrotated | RC | Rotated_Var | Rotated_Cum |
|---|---|---|---|---|
| PC1 | 30.3% | RC1 | 27.8% | 27.8% |
| PC2 | 21.19% | RC2 | 21.52% | 49.32% |
| PC3 | 7.18% | RC3 | 7.24% | 56.56% |
| PC4 | 5.8% | RC4 | 6.54% | 63.1% |
| PC5 | 4.71% | RC5 | 5.73% | 68.83% |
| PC6 | 3.81% | RC6 | 4.02% | 72.84% |
| PC7 | 3.47% | RC7 | 3.61% | 76.45% |
rot_df <- as.data.frame(round(rot_load, 3))
rot_df$h2 <- round(h2_rot, 3)
rot_df %>%
select(-h2) %>%
rownames_to_column("Variable") %>%
pivot_longer(-Variable, names_to = "RC", values_to = "Loading") %>%
ggplot(aes(x = RC, y = reorder(Variable, Loading), fill = Loading)) +
geom_tile(color = "white", linewidth = 0.4) +
geom_text(aes(label = ifelse(abs(Loading) >= 0.30,
sprintf("%.2f", Loading), "")),
size = 2.3, color = "black") +
scale_fill_gradient2(low = "#2166AC", mid = "white", high = "#D6604D",
midpoint = 0, limits = c(-1, 1), name = "Loading") +
labs(title = "Rotated Component Loadings (VARIMAX)",
subtitle = "Labels displayed for |loading| ≥ 0.30",
x = "Rotated Component", y = NULL) +
theme_minimal(base_size = 9) +
theme(axis.text.y = element_text(size = 7),
legend.position = "right",
plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey50"),
panel.grid = element_blank())Interpretation - Rotated Components (VARIMAX):
After rotation, each component becomes clearly interpretable:
RC1 - “Body Size & Lean Mass” (27.8%
variance)
High positive loadings: MM (0.958), TBW (0.934), ICW (0.916), VMA
(0.881), ECW (0.879), BM (0.876), Height (0.825), Weight (0.733), HGB
(0.635), Creatinine (0.540). This component represents overall body size
and its lean tissue composition. Larger, more muscular individuals score
high. The positive loading of Hemoglobin reflects the higher blood
volume in larger lean bodies. HDL loads negatively (-0.40), consistent
with known inverse relationships between muscle mass and HDL in
non-athletic populations.
RC2 - “Adiposity vs Lean Composition” (21.5%
variance)
Strong contrast: TFC (0.949), BMI (0.917), TBFR (0.903), VFA (0.842)
positive vs LM (-0.902), Protein (-0.737) negative. This component
captures the fundamental fat-lean trade-off. High scorers have high fat
mass, high BMI, and low lean mass percentage. This is the core “obesity
dimension” of the dataset and is clinically the most directly relevant
component for gallstone risk.
RC3 - “Renal Age” (7.2% variance)
Age (0.807) positive, GFR (-0.881) negative. This component cleanly
captures the biological aging of kidney function. GFR naturally declines
with age (approximately 1 mL/min/year after age 40), making this
component a direct physiological age indicator. Creatinine (0.512) and
VFR (0.459) also contribute, as both tend to increase with age.
RC4 - “Hepatocellular Enzyme Activity” (6.5%
variance)
AST (0.934) and ALT (0.927) load almost exclusively on this component.
Both are intracellular liver enzymes that leak into the bloodstream
during hepatocyte injury. This component specifically captures hepatic
inflammatory or metabolic stress, independent of body composition.
RC5 - “Glycemic-Lipid Metabolic Risk” (5.7%
variance)
Glucose (0.727), Triglyceride (0.697), ALP (0.446). This component
represents the metabolic syndrome cluster - elevated blood sugar and
triglycerides co-occur in insulin-resistant patients. ALP’s moderate
loading may reflect hepatic involvement in metabolic syndrome. HDL loads
negatively (-0.433), consistent with its protective role and inverse
relationship with triglycerides.
RC6 - “Inflammation-Vitamin D Axis” (4.0%
variance)
CRP (0.649) positive, Vitamin D (-0.692) negative. This bipolar
component captures the well-documented inverse relationship between
chronic inflammation (high CRP) and Vitamin D deficiency. Low Vitamin D
is associated with increased inflammatory cytokines, and this appears as
a meaningful clinical dimension even in the PCA solution.
RC7 - “Extreme Obesity” (3.6% variance)
Obesity (%) loads exclusively here (0.832), with ALP loading negatively
(-0.419). The extreme skewness of Obesity (%) (skewness = 16.7)
effectively created a component that almost entirely represents the
extreme-obesity tail of the distribution. This singleton component
should be interpreted with caution.
Effect of rotation: PC1’s original dominance (30.3%) was partially redistributed to create a more balanced RC1 (27.8%) and RC2 (21.5%). The rotated solution achieves much cleaner simple structure, with most variables now loading clearly on just one component.
fviz_pca_biplot(pca_result,
axes = c(1, 2),
geom.ind = "point",
col.ind = "grey70",
alpha.ind = 0.25,
col.var = "contrib",
gradient.cols = c("#00AFBB","#E7B800","#FC4E07"),
repel = TRUE,
label = "var",
arrowsize = 0.7,
labelsize = 3.0) +
scale_color_gradient(low = "#00AFBB", high = "#FC4E07",
name = "Contribution (%)") +
labs(title = "PCA Biplot - PC1 vs PC2",
subtitle = sprintf("PC1 = %.1f%% | PC2 = %.1f%% | Combined = %.1f%%",
prop_var[1]*100, prop_var[2]*100,
(prop_var[1]+prop_var[2])*100)) +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey40"),
legend.position = "right",
panel.grid.minor = element_blank())fviz_pca_biplot(pca_result,
axes = c(1, 3),
geom.ind = "point",
col.ind = "grey70",
alpha.ind = 0.25,
col.var = "contrib",
gradient.cols = c("#00AFBB","#E7B800","#FC4E07"),
repel = TRUE,
label = "var",
arrowsize = 0.7,
labelsize = 3.0) +
scale_color_gradient(low = "#00AFBB", high = "#FC4E07",
name = "Contribution (%)") +
labs(title = "PCA Biplot - PC1 vs PC3",
subtitle = sprintf("PC1 = %.1f%% | PC3 = %.1f%%",
prop_var[1]*100, prop_var[3]*100)) +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey40"),
legend.position = "right",
panel.grid.minor = element_blank())fviz_contrib(pca_result, choice = "var",
axes = 1:2, top = 20,
fill = "#3498DB", color = "#2980B9") +
labs(title = "Top 20 Variable Contributions to PC1 & PC2",
subtitle = "Red dashed line = expected contribution under uniform distribution") +
theme_minimal(base_size = 11) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey40"))Interpretation - Biplot:
The PC1 vs PC2 biplot shows a characteristic “fan” shape. Arrows pointing in the same direction indicate positively correlated variables that cluster on the same component. Key patterns:
The PC1 vs PC3 biplot shows a clear vertical separation: Age arrows point upward while GFR points downward on PC3, visually confirming the Renal Age interpretation. Creatinine aligns between PC1 and PC3, consistent with its cross-loading.
Exploratory Factor Analysis (EFA) differs fundamentally from PCA. While PCA is a variance decomposition technique (no underlying model), EFA is a latent variable model - it posits that observed variables are caused by (or reflect) unobserved latent factors, plus unique error:
\[X_j = \lambda_{j1}F_1 + \lambda_{j2}F_2 + \cdots + \lambda_{jk}F_k + \varepsilon_j\]
EFA estimates factor loadings (λ), communalities, and unique variances while accounting for measurement error. This makes EFA more appropriate when the goal is theoretical construct identification rather than pure data reduction.
Method used: Maximum Likelihood (ML)
ML estimation is preferred because it provides model fit indices (RMSEA,
TLI, BIC, p-value) that allow formal evaluation of whether the factor
model adequately reproduces the observed correlation matrix (Shi &
Maydeu-Olivares, 2020).
References: Watkins (2021); Goretzko & Bühner (2020); Shi & Maydeu-Olivares (2020); McNeish (2023); Goretzko (2022).
Parallel Analysis (PA) is the most accurate method for determining the number of factors (Goretzko & Bühner, 2020). PA compares the actual eigenvalues to eigenvalues generated from random data matrices of the same size (500 iterations at the 95th percentile). Only factors whose eigenvalue exceeds the corresponding random eigenvalue are retained. PA consistently outperforms Kaiser Rule, which tends to over-retain factors.
set.seed(42)
pa_result <- fa.parallel(data_pca,
fm = "ml",
fa = "fa",
n.iter = 500,
quant = 0.95,
main = "Parallel Analysis Scree Plot - EFA",
show.legend = TRUE)## Parallel analysis suggests that the number of factors = 5 and the number of components = NA
n_fa_parallel <- pa_result$nfact
cat(sprintf("\nParallel Analysis suggests : %d factor(s)\n", n_fa_parallel))##
## Parallel Analysis suggests : 5 factor(s)
## Kaiser Rule suggests : 7 factor(s)
## => Using 5 factor(s) for EFA
Interpretation: Parallel Analysis suggests 5 factors, while Kaiser Rule suggested 7. This discrepancy is common and expected - Kaiser Rule applied to factor analysis is known to over-retain factors, especially when variables have moderate-to-high intercorrelations (Goretzko & Bühner, 2020).
Parallel Analysis is the superior criterion here: the 6th and 7th eigenvalues from the actual data do not meaningfully exceed what random data would produce at the 95th percentile, suggesting they do not represent genuine latent constructs. The decision to use 5 factors is well-supported by this gold-standard criterion.
efa_varimax <- fa(data_pca,
nfactors = n_fa,
fm = "ml",
rotate = "varimax",
scores = "regression",
use = "complete.obs")
data.frame(
Index = c("RMSEA", "TLI", "BIC", "Chi-sq p-value"),
Value = c(round(efa_varimax$RMSEA[1], 4),
round(efa_varimax$TLI, 4),
round(efa_varimax$BIC, 4),
round(efa_varimax$PVAL, 4)),
Threshold = c("< 0.08 acceptable; < 0.05 good",
"> 0.90 acceptable",
"Lower = better fit",
"p > 0.05 = exact fit"),
Status = c(ifelse(efa_varimax$RMSEA[1] < 0.08, "Acceptable", "Poor"),
ifelse(efa_varimax$TLI > 0.90, "Acceptable", "Below threshold"),
"Reference",
ifelse(efa_varimax$PVAL > 0.05, "Exact fit", "Significant misfit"))
) %>%
kbl(caption = "EFA Model Fit Indices - 5-Factor ML Solution") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
column_spec(4, color = c("#c0392b","#e67e22","#27ae60","#c0392b"))| Index | Value | Threshold | Status |
|---|---|---|---|
| RMSEA | 0.1108 | < 0.08 acceptable; < 0.05 good | Poor |
| TLI | 0.8569 | > 0.90 acceptable | Below threshold |
| BIC | -229.6157 | Lower = better fit | Reference |
| Chi-sq p-value | 0.0000 | p > 0.05 = exact fit | Significant misfit |
Important Note on Model Fit: RMSEA = 0.1108 and TLI = 0.857 indicate that the 5-factor ML model does not achieve ideal fit indices. This is a substantively meaningful finding, not a failure of the analysis. Several factors explain this:
9 variables with h² < 0.40 (Obesity, HDL, Glucose, CRP, Vitamin D, ALP, GFR, Triglyceride, HFA) have high uniqueness - they don’t fit cleanly into any common factor. Their residual correlations inflate model misfit statistics.
Extreme non-normality in several biomarkers (skewness > 5) violates ML’s normality assumption, which inflates χ² and worsens RMSEA.
Genuine complexity: Clinical biomarker data rarely fits a clean factor structure because the same variable (e.g., Creatinine) can reflect multiple independent physiological processes simultaneously.
The BIC = -229.6 is negative, indicating that the 5-factor model fits substantially better than a saturated model - a positive indicator. The split-sample validation (Stage 11) showing stability < 2.15% difference provides the most practically meaningful evidence of solution quality.
efa_oblimin <- fa(data_pca,
nfactors = n_fa,
fm = "ml",
rotate = "oblimin",
scores = "regression",
use = "complete.obs")
cat("Factor intercorrelations (Phi matrix - Oblimin rotation):\n\n")## Factor intercorrelations (Phi matrix - Oblimin rotation):
## ML2 ML1 ML5 ML3 ML4
## ML2 1.000 0.165 0.289 0.267 0.386
## ML1 0.165 1.000 -0.052 0.325 -0.471
## ML5 0.289 -0.052 1.000 0.044 0.157
## ML3 0.267 0.325 0.044 1.000 -0.163
## ML4 0.386 -0.471 0.157 -0.163 1.000
max_phi <- max(abs(efa_oblimin$Phi[lower.tri(efa_oblimin$Phi)]))
cat(sprintf("\nMaximum factor correlation: %.3f\n", max_phi))##
## Maximum factor correlation: 0.471
cat(sprintf("Decision: %s\n",
ifelse(max_phi > 0.32,
"Oblique (Oblimin) rotation preferred [correlation > 0.32]",
"Orthogonal (Varimax) rotation preferred [correlation <= 0.32]")))## Decision: Oblique (Oblimin) rotation preferred [correlation > 0.32]
if (max_phi > 0.32) {
efa_final <- efa_oblimin
rot_method <- "oblimin"
} else {
efa_final <- efa_varimax
rot_method <- "varimax"
}
cat(sprintf("=> Final rotation method: %s\n", toupper(rot_method)))## => Final rotation method: OBLIMIN
Interpretation - Why Oblimin Rotation?
The maximum inter-factor correlation is 0.471 (between ML2 and ML4), well above the 0.32 threshold. This means the latent factors are not independent of each other - they are correlated. In biological/clinical data, this is expected: body size (ML2) and height (ML4) are naturally correlated; adiposity (ML1) and metabolic risk are related constructs.
Forcing orthogonality (Varimax) in this situation would artificially constrain the solution and distort loadings, as it would attempt to represent correlated real-world constructs as uncorrelated - an unrealistic assumption. Oblimin rotation allows factors to correlate, producing a more honest representation of the underlying factor structure (Watkins, 2021; McNeish, 2023).
The Phi matrix shows that: - ML2 (Body Size) ↔︎ ML4 (Height): r = 0.386 - taller people have proportionally larger body size - ML1 (Adiposity) ↔︎ ML4 (Height): r = -0.471 - taller people tend to have relatively lower fat ratios - ML2 ↔︎ ML3 (Aging): r = 0.267 - larger body size is modestly associated with older age in this sample
## Final EFA Factor Loadings - OBLIMIN (cutoff |loading| >= 0.30):
##
## Loadings:
## ML2 ML1 ML5 ML3 ML4
## Weight 0.759 0.508
## Body Mass Index (BMI) 0.554 0.527 -0.465
## Total Body Water (TBW) 0.884
## Extracellular Water (ECW) 0.799
## Intracellular Water (ICW) 0.899
## Bone Mass (BM) 0.839
## Muscle Mass (MM) 0.882
## Visceral Muscle Area (VMA) (Kg) 0.796
## Total Body Fat Ratio (TBFR) (%) 0.952
## Lean Mass (LM) (%) -0.953
## Body Protein Content (Protein) (%) -0.566
## Total Fat Content (TFC) 0.867
## Visceral Fat Area (VFA) 0.357 0.759
## Aspartat Aminotransferaz (AST) 0.923
## Alanin Aminotransferaz (ALT) 0.950
## Age 0.821
## Visceral Fat Rating (VFR) 0.414 0.687
## Height 0.415 0.740
## Obesity (%)
## Hepatic Fat Accumulation (HFA) 0.443
## Glucose
## High Density Lipoprotein (HDL) -0.418
## Triglyceride
## Alkaline Phosphatase (ALP)
## Creatinine 0.342 -0.403
## Glomerular Filtration Rate (GFR) -0.489
## C-Reactive Protein (CRP)
## Hemoglobin (HGB) 0.376
## Vitamin D
##
## ML2 ML1 ML5 ML3 ML4
## SS loadings 6.679 4.473 1.832 1.700 1.147
## Proportion Var 0.230 0.154 0.063 0.059 0.040
## Cumulative Var 0.230 0.385 0.448 0.506 0.546
efa_load <- unclass(efa_final$loadings)
h2_efa <- efa_final$communality
u2_efa <- efa_final$uniquenesses
efa_df <- as.data.frame(round(efa_load, 3))
efa_df$h2 <- round(h2_efa, 3)
efa_df$u2 <- round(u2_efa, 3)
efa_df %>%
kbl(caption = "EFA Communality (h²) & Uniqueness (u²) - OBLIMIN Solution") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE, font_size = 11) %>%
column_spec(which(colnames(efa_df) == "h2"),
color = ifelse(efa_df$h2 < 0.40, "#c0392b",
ifelse(efa_df$h2 < 0.70, "#e67e22", "#27ae60")),
bold = TRUE)| ML2 | ML1 | ML5 | ML3 | ML4 | h2 | u2 | |
|---|---|---|---|---|---|---|---|
| Age | -0.263 | -0.060 | -0.021 | 0.821 | -0.050 | 0.631 | 0.369 |
| Height | 0.415 | 0.019 | 0.038 | -0.122 | 0.740 | 0.982 | 0.018 |
| Weight | 0.759 | 0.508 | 0.022 | 0.024 | 0.048 | 0.996 | 0.004 |
| Body Mass Index (BMI) | 0.554 | 0.527 | -0.009 | 0.078 | -0.465 | 0.995 | 0.005 |
| Total Body Water (TBW) | 0.884 | -0.088 | 0.060 | 0.064 | 0.087 | 0.903 | 0.097 |
| Extracellular Water (ECW) | 0.799 | -0.005 | 0.073 | 0.186 | 0.076 | 0.840 | 0.160 |
| Intracellular Water (ICW) | 0.899 | -0.149 | 0.059 | -0.078 | 0.065 | 0.859 | 0.141 |
| Total Body Fat Ratio (TBFR) (%) | -0.245 | 0.952 | -0.018 | 0.031 | -0.070 | 0.993 | 0.007 |
| Lean Mass (LM) (%) | 0.249 | -0.953 | 0.019 | -0.031 | 0.070 | 0.995 | 0.005 |
| Body Protein Content (Protein) (%) | 0.077 | -0.566 | -0.023 | 0.010 | 0.208 | 0.471 | 0.529 |
| Visceral Fat Rating (VFR) | 0.414 | 0.219 | 0.042 | 0.687 | -0.068 | 0.995 | 0.005 |
| Bone Mass (BM) | 0.839 | -0.110 | 0.017 | 0.022 | 0.058 | 0.750 | 0.250 |
| Muscle Mass (MM) | 0.882 | -0.055 | 0.059 | 0.015 | 0.149 | 0.939 | 0.061 |
| Obesity (%) | 0.070 | 0.059 | 0.027 | 0.005 | -0.183 | 0.044 | 0.956 |
| Total Fat Content (TFC) | 0.201 | 0.867 | -0.013 | 0.040 | -0.078 | 0.936 | 0.064 |
| Visceral Fat Area (VFA) | 0.357 | 0.759 | 0.012 | 0.213 | 0.149 | 0.933 | 0.067 |
| Visceral Muscle Area (VMA) (Kg) | 0.796 | -0.004 | 0.067 | -0.029 | 0.155 | 0.781 | 0.219 |
| Hepatic Fat Accumulation (HFA) | 0.443 | 0.210 | 0.055 | 0.134 | -0.210 | 0.375 | 0.625 |
| Glucose | 0.079 | -0.005 | -0.048 | 0.219 | 0.017 | 0.062 | 0.938 |
| High Density Lipoprotein (HDL) | -0.418 | 0.172 | 0.000 | 0.007 | 0.030 | 0.166 | 0.834 |
| Triglyceride | 0.274 | 0.055 | 0.057 | 0.052 | 0.063 | 0.122 | 0.878 |
| Aspartat Aminotransferaz (AST) | -0.072 | -0.004 | 0.923 | -0.012 | -0.030 | 0.813 | 0.187 |
| Alanin Aminotransferaz (ALT) | 0.014 | 0.011 | 0.950 | -0.008 | -0.014 | 0.904 | 0.096 |
| Alkaline Phosphatase (ALP) | -0.010 | 0.081 | 0.131 | 0.063 | -0.083 | 0.042 | 0.958 |
| Creatinine | 0.342 | -0.403 | 0.005 | 0.285 | 0.165 | 0.412 | 0.588 |
| Glomerular Filtration Rate (GFR) | 0.124 | 0.048 | 0.059 | -0.489 | -0.057 | 0.206 | 0.794 |
| C-Reactive Protein (CRP) | 0.049 | 0.112 | -0.063 | -0.084 | -0.116 | 0.040 | 0.960 |
| Hemoglobin (HGB) | 0.376 | -0.217 | 0.127 | 0.150 | 0.285 | 0.460 | 0.540 |
| Vitamin D | 0.053 | -0.259 | -0.029 | 0.118 | -0.097 | 0.048 | 0.952 |
## EFA Variance Explained:
vacc <- efa_final$Vaccounted
data.frame(
Factor = colnames(vacc),
SS_Loading = round(vacc[1,], 3),
Proportion = paste0(round(vacc[2,]*100, 2), "%"),
Cumulative = paste0(round(vacc[3,]*100, 2), "%")
) %>%
kbl(caption = "Variance Explained by EFA Factors (Oblimin)") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Factor | SS_Loading | Proportion | Cumulative | |
|---|---|---|---|---|
| ML2 | ML2 | 7.218 | 24.89% | 24.89% |
| ML1 | ML1 | 4.875 | 16.81% | 41.7% |
| ML5 | ML5 | 1.949 | 6.72% | 48.42% |
| ML3 | ML3 | 1.914 | 6.6% | 55.02% |
| ML4 | ML4 | 1.737 | 5.99% | 61.01% |
efa_df %>%
select(-h2, -u2) %>%
rownames_to_column("Variable") %>%
pivot_longer(-Variable, names_to = "Factor", values_to = "Loading") %>%
ggplot(aes(x = Factor, y = reorder(Variable, Loading), fill = Loading)) +
geom_tile(color = "white", linewidth = 0.4) +
geom_text(aes(label = ifelse(abs(Loading) >= 0.30,
sprintf("%.2f", Loading), "")),
size = 2.3, color = "black") +
scale_fill_gradient2(low = "#2166AC", mid = "white", high = "#D6604D",
midpoint = 0, limits = c(-1, 1), name = "Loading") +
labs(title = "EFA Factor Loadings - OBLIMIN Rotation",
subtitle = "Maximum Likelihood extraction | Labels for |loading| ≥ 0.30",
x = "Factor", y = NULL) +
theme_minimal(base_size = 9) +
theme(axis.text.y = element_text(size = 7),
legend.position = "right",
plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey50"),
panel.grid = element_blank())data.frame(
Variable = names(h2_efa),
Communality = round(h2_efa, 3),
Uniqueness = round(u2_efa, 3)
) %>%
arrange(desc(Communality)) %>%
ggplot(aes(x = reorder(Variable, Communality), y = Communality,
fill = Communality)) +
geom_col(alpha = 0.85) +
geom_hline(yintercept = 0.40, linetype = "dashed",
color = "#D73027", linewidth = 0.7) +
geom_hline(yintercept = 0.70, linetype = "dashed",
color = "#1A9641", linewidth = 0.7) +
annotate("text", x = 0.6, y = c(0.42, 0.72),
label = c("0.40 - Minimum threshold", "0.70 - Good"),
hjust = 0, size = 3,
color = c("#D73027","#1A9641")) +
scale_fill_gradient(low = "#FEE090", high = "#4575B4", name = "h²") +
scale_y_continuous(limits = c(0, 1.05), breaks = seq(0, 1, 0.1)) +
coord_flip() +
labs(title = "Communality (h²) per Variable - EFA OBLIMIN",
subtitle = "Proportion of variance explained by the 5 common factors",
x = NULL, y = "Communality (h²)") +
theme_minimal(base_size = 10) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey40"),
legend.position = "right",
panel.grid.minor = element_blank())Interpretation - EFA Factor Structure (Oblimin, 5 Factors, 61.01% total variance):
ML2 - “Body Size & Hydration” (24.89%)
Primary loadings: TBW (0.884), ICW (0.899), MM (0.882), Bone Mass
(0.839), ECW (0.799), VMA (0.796), Weight (0.759), Height (0.415). This
factor represents total body mass and its structural components (muscle,
bone, intracellular water). High scorers are physically larger
individuals with more lean tissue. This is the dominant clinical factor
in the dataset, explaining nearly a quarter of all variance.
Substantive label: Lean Body Mass & Hydration.
ML1 - “Adiposity vs Lean Composition” (16.81%)
Strong bipolar: TBFR (+0.952), LM (-0.953), TFC (+0.867), VFA (+0.759),
BMI (+0.527) vs Protein (-0.566), Creatinine (-0.403). This factor
captures the fat-lean dichotomy within body composition. High positive
scores indicate high fat mass, low lean mass, and high visceral fat -
the classical obesity phenotype. This factor is directly clinically
relevant to gallstone risk, as visceral adiposity is a known risk
factor. Substantive label: Adiposity & Fat
Composition.
ML5 - “Hepatocellular Stress” (6.72%)
Near-exclusive loadings: ALT (0.950), AST (0.923). These two
transaminase enzymes are the definitive markers of hepatocyte injury.
Their extremely high and focused loadings on a single factor confirms
that they measure the same underlying biological process (liver cell
damage) almost perfectly. This factor is particularly relevant to the
gallstone context, as hepatic dysfunction and biliary disease are
closely linked. Substantive label: Liver Enzyme
Activity.
ML3 - “Visceral-Renal Aging” (6.60%)
Age (0.821), VFR (0.687), GFR (-0.489). As age increases, visceral fat
accumulates (positive VFR loading) while kidney function declines
(negative GFR loading). This factor captures the multidimensional aging
process - specifically the metabolic and renal dimensions that co-occur
with advancing age. Substantive label: Age-Related Visceral
& Renal Function.
ML4 - “Height-Based Body Frame” (5.99%)
Height (0.740), with cross-loadings on ML2. This factor represents
skeletal body frame size independent of fat or muscle mass. Taller
individuals have a different body composition profile even at similar
weights. BMI cross-loads negatively (-0.465), confirming that BMI is
partially confounded by height - a known limitation of BMI as an
adiposity index. Substantive label: Skeletal Frame
Size.
Variables not well-represented (h² < 0.40):
Obesity (%), Hepatic Fat Accumulation, Glucose, HDL, Triglyceride, ALP,
GFR, CRP, Vitamin D. These 9 variables have high uniqueness - their
variance is primarily “noise” or unique to each variable, not shared
through any common latent factor. This is clinically interpretable:
variables like CRP and Vitamin D are influenced by a wide range of
patient-specific conditions (infections, sun exposure, diet) that don’t
cluster with other variables in this particular dataset.
Split-sample validation tests the stability and reproducibility of the factor solution. The data is randomly split 50:50 and EFA is run independently on both halves. If the variance explained per factor differs by less than 5% between the two samples, the solution is considered stable and not merely a product of sample-specific noise (Goretzko, 2022).
set.seed(42)
idx_split <- sample(seq_len(nrow(data_pca)), nrow(data_pca) %/% 2)
data_s1 <- data_pca[ idx_split, ]
data_s2 <- data_pca[-idx_split, ]
fa_s1 <- fa(data_s1, nfactors = n_fa, fm = "ml", rotate = rot_method)
fa_s2 <- fa(data_s2, nfactors = n_fa, fm = "ml", rotate = rot_method)
n_show <- min(n_fa, 5)
var_s1 <- round(fa_s1$Vaccounted[2, 1:n_show] * 100, 2)
var_s2 <- round(fa_s2$Vaccounted[2, 1:n_show] * 100, 2)
diff_v <- round(abs(var_s1 - var_s2), 2)
data.frame(
Factor = paste0("F", 1:n_show),
Sample1 = paste0(var_s1, "%"),
Sample2 = paste0(var_s2, "%"),
Difference = paste0(diff_v, "%"),
Stable = ifelse(diff_v < 5, "Yes", "No")
) %>%
kbl(caption = sprintf("Split-Sample Validation | S1: %d obs | S2: %d obs",
nrow(data_s1), nrow(data_s2))) %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
column_spec(5, color = ifelse(diff_v < 5, "#27ae60", "#c0392b"), bold = TRUE)| Factor | Sample1 | Sample2 | Difference | Stable | |
|---|---|---|---|---|---|
| ML2 | F1 | 28.95% | 26.76% | 2.19% | Yes |
| ML1 | F2 | 15.33% | 16.59% | 1.26% | Yes |
| ML4 | F3 | 7.52% | 6.91% | 0.61% | Yes |
| ML3 | F4 | 6.4% | 6.52% | 0.12% | Yes |
| ML5 | F5 | 4.36% | 5.84% | 1.48% | Yes |
Result: The largest difference between split samples is only 2.15% (Factor 1), well below the 5% stability threshold. All 5 factors show consistent variance contributions across both halves of the sample. This confirms that the factor solution is stable and generalizable, not a product of chance variation in a specific subset of patients (Goretzko, 2022).
pc_scores <- as.data.frame(pca_result$x[, 1:n_comp])
colnames(pc_scores) <- paste0("PC", 1:n_comp)
efa_scores <- as.data.frame(efa_final$scores)
colnames(efa_scores) <- paste0("Factor_", seq_len(ncol(efa_scores)))
cat("PC score orthogonality check (correlations should be exactly 0):\n\n")## PC score orthogonality check (correlations should be exactly 0):
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## PC1 1 0 0 0 0 0 0
## PC2 0 1 0 0 0 0 0
## PC3 0 0 1 0 0 0 0
## PC4 0 0 0 1 0 0 0
## PC5 0 0 0 0 1 0 0
## PC6 0 0 0 0 0 1 0
## PC7 0 0 0 0 0 0 1
write.csv(cbind(data_pca, pc_scores), "OUTPUT_pca_scores.csv", row.names = FALSE)
write.csv(cbind(data_pca, efa_scores), "OUTPUT_efa_scores.csv", row.names = FALSE)
cat("\n[Saved] OUTPUT_pca_scores.csv - raw data + PCA component scores\n")##
## [Saved] OUTPUT_pca_scores.csv - raw data + PCA component scores
## [Saved] OUTPUT_efa_scores.csv - raw data + EFA factor scores (Oblimin)
Factor Scores: PCA scores are perfectly orthogonal (all inter-PC correlations = 0 by mathematical construction). These can be used as uncorrelated predictor variables in regression models. EFA Oblimin scores will have small non-zero inter-factor correlations, reflecting the genuine correlations among the latent constructs identified in Stage 10c.
Use cases for exported scores: - Use PCA scores for dimensionality reduction in machine learning models predicting gallstone status - Use EFA factor scores for structural equation modeling or clinical clustering analyses - Both score sets allow downstream analyses with a 75% reduction in dimensionality (29 variables → 5–7 scores)
data.frame(
Metric = c(
"Dataset",
"Observations (n)",
"Initial numeric variables",
"Variables dropped via KMO",
"Final variables for analysis",
"Correlated pairs |r| > 0.30",
"KMO Overall MSA",
"Bartlett's Test p-value",
"PCA components retained (Kaiser Rule)",
"PCA cumulative variance explained",
"EFA factors (Parallel Analysis)",
"EFA rotation method selected",
"Max inter-factor correlation (Phi)",
"EFA total variance explained",
"Split-sample stability (max diff)"
),
Value = c(
"Gallstone / Body Composition & Biomarkers (UCI)",
"319 patients",
"32 variables",
"3 variables: TC, LDL, ECF/TBW Ratio",
"29 variables",
sprintf("135 / 406 (33.3%%)"),
"0.8388 (Meritorious)",
"≈ 0 (p << 0.05)",
"7 components (PC1–PC7)",
"76.45%",
"5 factors (F1–F5)",
"OBLIMIN (oblique) - max phi = 0.471",
"0.471 (ML1 ↔ ML4)",
"61.01%",
"2.15% - STABLE"
)
) %>%
kbl(caption = "Complete Analysis Summary - PCA & EFA") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE)| Metric | Value |
|---|---|
| Dataset | Gallstone / Body Composition & Biomarkers (UCI) |
| Observations (n) | 319 patients |
| Initial numeric variables | 32 variables |
| Variables dropped via KMO | 3 variables: TC, LDL, ECF/TBW Ratio |
| Final variables for analysis | 29 variables |
| Correlated pairs |r| > 0.30 | 135 / 406 (33.3%) |
| KMO Overall MSA | 0.8388 (Meritorious) |
| Bartlett’s Test p-value | ≈ 0 (p << 0.05) |
| PCA components retained (Kaiser Rule) | 7 components (PC1–PC7) |
| PCA cumulative variance explained | 76.45% |
| EFA factors (Parallel Analysis) | 5 factors (F1–F5) |
| EFA rotation method selected | OBLIMIN (oblique) - max phi = 0.471 |
| Max inter-factor correlation (Phi) | 0.471 (ML1 ↔︎ ML4) |
| EFA total variance explained | 61.01% |
| Split-sample stability (max diff) | 2.15% - STABLE |
## PCA Component Detail:
for (k in 1:n_comp)
cat(sprintf(" PC%d: lambda = %.4f | Var = %.2f%% | Cumul = %.2f%%\n",
k, eigenvalues[k], prop_var[k]*100, cum_var[k]*100))## PC1: lambda = 8.7859 | Var = 30.30% | Cumul = 30.30%
## PC2: lambda = 6.1447 | Var = 21.19% | Cumul = 51.48%
## PC3: lambda = 2.0819 | Var = 7.18% | Cumul = 58.66%
## PC4: lambda = 1.6823 | Var = 5.80% | Cumul = 64.46%
## PC5: lambda = 1.3655 | Var = 4.71% | Cumul = 69.17%
## PC6: lambda = 1.1053 | Var = 3.81% | Cumul = 72.98%
## PC7: lambda = 1.0051 | Var = 3.47% | Cumul = 76.45%
##
## EFA Factor Detail:
for (k in 1:n_fa)
cat(sprintf(" F%d: SS = %.4f | Var = %.2f%% | Cumul = %.2f%%\n",
k, efa_final$Vaccounted[1,k],
efa_final$Vaccounted[2,k]*100,
efa_final$Vaccounted[3,k]*100))## F1: SS = 7.2176 | Var = 24.89% | Cumul = 24.89%
## F2: SS = 4.8747 | Var = 16.81% | Cumul = 41.70%
## F3: SS = 1.9495 | Var = 6.72% | Cumul = 48.42%
## F4: SS = 1.9145 | Var = 6.60% | Cumul = 55.02%
## F5: SS = 1.7368 | Var = 5.99% | Cumul = 61.01%
Overall Interpretation & Conclusions:
This analysis successfully identified the latent structure underlying 29 body composition and biomarker variables in a sample of 319 gallstone patients. The two complementary techniques revealed:
From PCA (7 components, 76.45% variance):
The data is organized around seven distinct dimensions: (1) General Body
Size & Lean Mass, (2) Adiposity vs Lean Composition, (3) Renal
Aging, (4) Hepatic Enzyme Activity, (5) Glycemic-Lipid Metabolic Risk,
(6) Inflammation-Vitamin D Axis, and (7) Extreme Obesity. This
7-dimensional structure suggests that the assessment of gallstone risk
requires a multi-dimensional clinical framework - no single measurement
captures the full picture.
From EFA (5 factors, 61.01% variance,
Oblimin):
The more conservative Parallel Analysis identified 5 genuine latent
constructs: Lean Body Mass & Hydration, Adiposity & Fat
Composition, Liver Enzyme Activity, Age-Related Visceral & Renal
Function, and Skeletal Frame Size. The necessity of oblique rotation
(max factor correlation = 0.471) confirms that these clinical dimensions
are genuinely interrelated in real patients - for example, body size and
adiposity co-vary, as do aging and renal function.
Clinical implications for gallstone research:
The Adiposity factor (ML1) and Liver Enzyme factor (ML5) are
particularly relevant, as visceral obesity and hepatic dysfunction are
both established risk factors for cholesterol gallstone formation.
Future predictive models for gallstone risk should incorporate scores
from these two factors at minimum, alongside the aging dimension (ML3).
The 9 variables with poor communality (h² < 0.40) - including HDL,
Glucose, CRP, and Vitamin D - likely represent individual patient
variability not captured by group-level factor patterns and may require
separate clinical consideration.
Esen, I., Arslan, H., Aktürk, S., Gülşen, M., Kültekin, N., & Özdemir, O. (2024). Gallstone [Dataset]. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/1150/gallstone-1
Esen, I., Arslan, H., Aktürk Esen, S., Gülşen, M., Kültekin, N., & Özdemir, O. (2024). Early prediction of gallstone disease with a machine learning-based method from bioimpedance and laboratory data. Medicine, 103(6), e37258. https://doi.org/10.1097/md.0000000000037258
Goretzko, D., Siemund, K., & Sterner, P. (2024). Evaluating model fit of measurement models in confirmatory factor analysis. Educational and Psychological Measurement, 84(1), 123–144. https://doi.org/10.1177/00131644231163813
McNeish, D. (2023). Dynamic fit index cutoffs for confirmatory factor analysis models. Psychological Methods, 28(1), 61–88. https://doi.org/10.1037/met0000395
Midi, H., Bagheri, A., & Imon, A. H. M. R. (2022). Mitigating the multicollinearity problem and its machine learning approach: A review. Mathematics, 10(8), 1283. https://doi.org/10.3390/math10081283
Watkins, M. W. (2021). A step-by-step guide to exploratory factor analysis with R and R Studio. Routledge. https://doi.org/10.4324/9781003120001
Xia, Y., & Zhou, X. (2024). Improving the use of parallel analysis by accounting for sampling variability of the observed correlation matrix. Educational and Psychological Measurement, 85(1), 114–133. https://doi.org/10.1177/00131644241268073