1 Introduction

This report presents a comprehensive dimensionality reduction analysis of a Brain MRI Radiomics-Style dataset. Radiomics refers to the high-throughput extraction of quantitative features from medical images — in this case, MRI brain scans. These features encode information about shape, texture, intensity, and spatial patterns within the image, often resulting in high-dimensional datasets with many correlated variables.

Two complementary techniques are applied:

  • Principal Component Analysis (PCA): An unsupervised technique that transforms the original correlated variables into a smaller set of uncorrelated linear combinations (principal components), maximizing explained variance.
  • Factor Analysis (FA): A latent variable model that identifies underlying, unobservable “factors” driving the observed correlations among variables — useful for understanding the theoretical constructs behind imaging features.

Together, these methods help reduce noise, remove redundancy, and reveal the core structure of the MRI feature space.


2 Setup & Libraries

We begin by loading all required R packages.

library(readxl)      # Reading Excel files
library(psych)       # KMO, Bartlett's test, Factor Analysis
library(GPArotation) # Factor rotation methods (Varimax)
library(ggplot2)     # Advanced plotting
library(reshape2)    # Data reshaping for heatmaps
library(dplyr)       # Data manipulation
library(corrplot)    # Correlation matrix visualization
library(factoextra)  # PCA visualization helpers
library(caret)       # Near-zero variance detection
library(Matrix)      # Matrix operations (nearPD)
# Define a consistent cyan-yellow color theme throughout the report
col_cyan       <- "#00BCD4"
col_cyan_dark  <- "#00838F"
col_yellow     <- "#FFD600"
col_yellow_dark <- "#F9A825"
col_gradient   <- colorRampPalette(c(col_cyan, "white", col_yellow))(200)

Color Theme: Cyan represents low/negative values; yellow represents high/positive values. This palette will be applied consistently to all heatmaps and visualizations.


3 Data Loading & Inspection

df <- read_excel("Brain MRI Radiomics-Style Numerical Dataset.xlsx")
df <- as.data.frame(df)
cat("Loaded data: rows =", nrow(df), "| columns =", ncol(df), "\n")
## Loaded data: rows = 7023 | columns = 28
numeric_data <- Filter(is.numeric, df)
cat("Numeric features retained:", ncol(numeric_data), "\n")
## Numeric features retained: 26
cat("Missing values:", sum(is.na(numeric_data)), "\n")
## Missing values: 0
if (any(is.na(numeric_data))) {
  numeric_data <- na.omit(numeric_data)
  cat("Rows after removing NA:", nrow(numeric_data), "\n")
}

Interpretation: We extract only numeric columns because PCA and FA operate on continuous, quantitative data. Any rows containing missing values are removed to ensure a complete-case analysis. Radiomics datasets are typically dense and well-structured, so missing data is usually minimal.


4 Data Quality Control (QC)

Before running any statistical analysis, we apply three quality-control steps to remove features that would cause computational or statistical issues.

4.1 Remove Constant Features

var_per_col <- apply(numeric_data, 2, var, na.rm = TRUE)
const_cols  <- names(var_per_col[var_per_col == 0 | is.na(var_per_col)])

if (length(const_cols) > 0) {
  cat("Removing constant columns:", paste(const_cols, collapse = ", "), "\n")
  numeric_data <- numeric_data[, !(names(numeric_data) %in% const_cols), drop = FALSE]
} else {
  cat("No constant columns found.\n")
}
## No constant columns found.

Why? A feature with zero variance carries no information — every observation has the same value. Including it in a correlation matrix causes division by zero, making the matrix singular and analytically useless.

4.2 Remove Near-Zero Variance Features

nzv_idx <- caret::nearZeroVar(numeric_data, saveMetrics = FALSE)

if (length(nzv_idx) > 0) {
  cat("Removing near-zero-variance columns:",
      paste(names(numeric_data)[nzv_idx], collapse = ", "), "\n")
  numeric_data <- numeric_data[, -nzv_idx, drop = FALSE]
} else {
  cat("No near-zero-variance columns found.\n")
}
## Removing near-zero-variance columns: min

Why? Near-zero variance features have extremely low variability (e.g., 99% of values are the same). They contribute minimally to analysis but can destabilize numerical computations and distort factor solutions.

4.3 Remove Duplicate Columns

if (ncol(numeric_data) > 1) {
  dup_cols <- which(duplicated(t(numeric_data)))
  if (length(dup_cols) > 0) {
    cat("Removing duplicate columns:", paste(names(numeric_data)[dup_cols], collapse = ", "), "\n")
    numeric_data <- numeric_data[, -dup_cols, drop = FALSE]
  } else {
    cat("No duplicate columns found.\n")
  }
}
## No duplicate columns found.
cat("\nAfter QC: numeric features =", ncol(numeric_data),
    "| observations =", nrow(numeric_data), "\n")
## 
## After QC: numeric features = 25 | observations = 7023

Why? Perfectly duplicate columns create redundant rows in the correlation matrix, making it rank-deficient and introducing ill-conditioning into factor extraction.


5 Correlation Analysis

5.1 Building the Correlation Matrix

cor_mat <- cor(numeric_data)
cat("Correlation matrix dimensions:", ncol(cor_mat), "x", nrow(cor_mat), "\n")
## Correlation matrix dimensions: 25 x 25

5.2 Removing Highly Correlated Features

high_cutoff <- 0.95
high_idx    <- findCorrelation(cor_mat, cutoff = high_cutoff, names = FALSE, exact = TRUE)

if (length(high_idx) > 0) {
  cat("Dropping", length(high_idx), "features with |corr| >", high_cutoff, "\n")
  numeric_data <- numeric_data[, -high_idx, drop = FALSE]
  cor_mat      <- cor(numeric_data)
}
## Dropping 1 features with |corr| > 0.95
cat("Features after high-corr pruning:", ncol(numeric_data), "\n")
## Features after high-corr pruning: 24

Interpretation: Variables with |r| > 0.95 are nearly perfectly collinear — including both provides no new information and causes multicollinearity, which inflates standard errors and destabilizes factor solutions. We keep the most “central” variable from each highly correlated pair (as chosen by findCorrelation).

5.3 Stabilizing the Correlation Matrix

In practice, correlation matrices from high-dimensional radiomics data can be near-singular (close to non-invertible). This happens because some features are nearly linearly dependent. We apply a series of corrections to ensure numerical stability.

make_pos_def <- function(r_mat) {
  # Step 1: Try psych::cor.smooth
  r1   <- tryCatch(psych::cor.smooth(r_mat), error = function(e) r_mat)
  det1 <- det(r1)
  if (!is.na(det1) && det1 > 1e-8) return(r1)

  # Step 2: Try Matrix::nearPD for positive definiteness
  np <- tryCatch(
    nearPD(r1, corr = TRUE, conv.tol = 1e-8, maxit = 2000),
    error = function(e) NULL
  )
  if (!is.null(np)) {
    r2 <- as.matrix(np$mat); diag(r2) <- 1
    if (!is.na(det(r2)) && det(r2) > 1e-12) return(r2)
  }

  # Step 3: Ridge regularization fallback
  r3 <- r_mat + diag(1e-6, nrow(r_mat)); diag(r3) <- 1
  r3
}

initial_det <- det(cor_mat)
cat("Initial determinant:", format(initial_det, digits = 6), "\n")
## Initial determinant: 0.346566
cor_final <- make_pos_def(cor_mat)
final_det <- det(cor_final)
cat("Final determinant after stabilization:", format(final_det, digits = 6), "\n")
## Final determinant after stabilization: 0.346566

Interpretation: The determinant of the correlation matrix is a measure of matrix “volume.” A determinant near zero signals near-singularity. After stabilization, a positive determinant confirms the matrix is now positive definite — a requirement for valid KMO, Bartlett’s test, and factor extraction. The three-step strategy (cor.smooth → nearPD → ridge) applies the gentlest correction first, escalating only if needed.


6 Assumption Testing

Before performing PCA or FA, we verify that the data is statistically appropriate for these methods.

cat("ASSUMPTION TESTS\n")
## ASSUMPTION TESTS

6.1 Kaiser-Meyer-Olkin (KMO) Test

The KMO test measures sampling adequacy — whether the partial correlations among variables are small enough that factor analysis is likely to yield reliable factors. KMO values range from 0 to 1.

KMO Value Interpretation
≥ 0.90 Marvelous
≥ 0.80 Meritorious (Good)
≥ 0.70 Middling (Acceptable)
≥ 0.60 Mediocre
< 0.60 Miserable (Unacceptable)
kmo_result <- tryCatch(KMO(cor_final), error = function(e) KMO(numeric_data))
cat("KMO Overall MSA:", round(kmo_result$MSA, 4), "\n\n")
## KMO Overall MSA: 0.6855
kmo_table <- data.frame(
  Variable  = names(kmo_result$MSAi),
  KMO       = round(kmo_result$MSAi, 4),
  Category  = ifelse(kmo_result$MSAi >= 0.8, "Good (Meritorious)",
              ifelse(kmo_result$MSAi >= 0.6, "Middle (Middling)", "Poor (Miserable)"))
)
print(kmo_table)
##                    Variable    KMO           Category
## mean                   mean 0.7859  Middle (Middling)
## std                     std 0.7312  Middle (Middling)
## max                     max 0.5782   Poor (Miserable)
## median               median 0.7941  Middle (Middling)
## contrast           contrast 0.6825  Middle (Middling)
## energy               energy 0.6831  Middle (Middling)
## homogeneity     homogeneity 0.8018 Good (Meritorious)
## correlation     correlation 0.5535   Poor (Miserable)
## fft_mean           fft_mean 0.7024  Middle (Middling)
## fft_std             fft_std 0.7615  Middle (Middling)
## fft_energy       fft_energy 0.6045  Middle (Middling)
## fft_max             fft_max 0.7287  Middle (Middling)
## edge_density   edge_density 0.6532  Middle (Middling)
## edge_strength edge_strength 0.6346  Middle (Middling)
## lbp_0                 lbp_0 0.6685  Middle (Middling)
## lbp_1                 lbp_1 0.7351  Middle (Middling)
## lbp_2                 lbp_2 0.7421  Middle (Middling)
## lbp_3                 lbp_3 0.7781  Middle (Middling)
## lbp_4                 lbp_4 0.7841  Middle (Middling)
## lbp_5                 lbp_5 0.6490  Middle (Middling)
## lbp_6                 lbp_6 0.7792  Middle (Middling)
## lbp_7                 lbp_7 0.7226  Middle (Middling)
## lbp_8                 lbp_8 0.7720  Middle (Middling)
## lbp_9                 lbp_9 0.7655  Middle (Middling)

Interpretation: The overall KMO MSA (Measure of Sampling Adequacy) indicates whether the correlation pattern among variables is compact enough to support factor analysis. An overall KMO ≥ 0.70 is acceptable for proceeding. Individual variable KMO values below 0.50 suggest that variable is not well explained by the factor model and could be removed in a stricter analysis.

6.2 Bartlett’s Test of Sphericity

Bartlett’s test checks whether the correlation matrix is significantly different from an identity matrix (i.e., whether variables are correlated at all). A significant result (p < 0.05) is required to justify factor analysis.

  • H₀: The correlation matrix is an identity matrix (no correlations — FA not appropriate)
  • H₁: The correlation matrix is not an identity matrix (correlations exist — FA is appropriate)
bartlett_result <- tryCatch(
  cortest.bartlett(cor_final, n = nrow(numeric_data)),
  error = function(e) {
    jittered <- cor_final + diag(1e-6, nrow(cor_final))
    cortest.bartlett(jittered, n = nrow(numeric_data))
  }
)

cat("Bartlett's Test of Sphericity:\n")
## Bartlett's Test of Sphericity:
cat("  Chi-Square:", round(bartlett_result$chisq, 4), "\n")
##   Chi-Square: 7431.718
cat("  df:", bartlett_result$df, "\n")
##   df: 276
cat("  p-value:", format(bartlett_result$p.value, scientific = TRUE), "\n")
##   p-value: 0e+00
if (bartlett_result$p.value < 0.05) {
  cat("  Conclusion: Reject H0 — FA is appropriate.\n")
} else {
  cat("  Conclusion: Fail to reject H0 — FA not recommended.\n")
}
##   Conclusion: Reject H0 — FA is appropriate.

Interpretation: A statistically significant Bartlett’s test (p < 0.05) confirms that the variables are interrelated — a prerequisite for meaningful factor analysis. In radiomics, features are typically highly correlated (sharing underlying tissue properties), so this test usually passes easily.


7 Principal Component Analysis (PCA)

PCA transforms the original variables into principal components (PCs) — linear combinations that are orthogonal (uncorrelated) to each other. The first PC captures the maximum variance, the second PC captures the maximum remaining variance, and so on.

7.1 Running PCA

pca_result <- prcomp(numeric_data, center = TRUE, scale. = TRUE)
summary_pca <- summary(pca_result)
cat("PCA Summary:\n")
## PCA Summary:
print(summary_pca)
## Importance of components:
##                            PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.53893 1.22761 1.06906 1.03053 1.02385 1.01913 1.00459
## Proportion of Variance 0.09868 0.06279 0.04762 0.04425 0.04368 0.04328 0.04205
## Cumulative Proportion  0.09868 0.16147 0.20909 0.25334 0.29702 0.34030 0.38235
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.00058 0.99048 0.98436 0.98112 0.97248 0.96367 0.95953
## Proportion of Variance 0.04172 0.04088 0.04037 0.04011 0.03941 0.03869 0.03836
## Cumulative Proportion  0.42406 0.46494 0.50531 0.54542 0.58483 0.62352 0.66188
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.95421 0.94812 0.94055 0.93938 0.93073 0.92615 0.92181
## Proportion of Variance 0.03794 0.03746 0.03686 0.03677 0.03609 0.03574 0.03541
## Cumulative Proportion  0.69982 0.73728 0.77414 0.81090 0.84700 0.88274 0.91814
##                          PC22    PC23    PC24
## Standard deviation     0.9112 0.85732 0.63183
## Proportion of Variance 0.0346 0.03062 0.01663
## Cumulative Proportion  0.9527 0.98337 1.00000

Note on Standardization: We set center = TRUE and scale. = TRUE to perform Z-score standardization before PCA. This is critical because radiomics features are measured on very different scales (e.g., pixel intensities vs. shape descriptors). Without standardization, features with larger numeric ranges would artificially dominate the first components.

7.2 Eigenvalue Table

eigenvalues    <- pca_result$sdev^2
variance_pct   <- eigenvalues / sum(eigenvalues) * 100
cum_variance   <- cumsum(variance_pct)

eigen_table <- data.frame(
  PC             = paste0("PC", seq_along(eigenvalues)),
  Eigenvalue     = round(eigenvalues, 4),
  Variance_Pct   = round(variance_pct, 2),
  Cumulative_Pct = round(cum_variance, 2)
)
print(eigen_table)
##      PC Eigenvalue Variance_Pct Cumulative_Pct
## 1   PC1     2.3683         9.87           9.87
## 2   PC2     1.5070         6.28          16.15
## 3   PC3     1.1429         4.76          20.91
## 4   PC4     1.0620         4.42          25.33
## 5   PC5     1.0483         4.37          29.70
## 6   PC6     1.0386         4.33          34.03
## 7   PC7     1.0092         4.21          38.23
## 8   PC8     1.0012         4.17          42.41
## 9   PC9     0.9810         4.09          46.49
## 10 PC10     0.9690         4.04          50.53
## 11 PC11     0.9626         4.01          54.54
## 12 PC12     0.9457         3.94          58.48
## 13 PC13     0.9287         3.87          62.35
## 14 PC14     0.9207         3.84          66.19
## 15 PC15     0.9105         3.79          69.98
## 16 PC16     0.8989         3.75          73.73
## 17 PC17     0.8846         3.69          77.41
## 18 PC18     0.8824         3.68          81.09
## 19 PC19     0.8663         3.61          84.70
## 20 PC20     0.8578         3.57          88.27
## 21 PC21     0.8497         3.54          91.81
## 22 PC22     0.8303         3.46          95.27
## 23 PC23     0.7350         3.06          98.34
## 24 PC24     0.3992         1.66         100.00

Interpretation: Each eigenvalue represents the amount of variance captured by its corresponding principal component. An eigenvalue of 1.0 means the component explains as much variance as a single original standardized variable — this is the basis of the Kaiser Rule for selecting the number of components.

7.3 Determining the Optimal Number of Components

We use three convergent criteria to decide how many PCs to retain:

7.3.1 Criterion 1: Kaiser Rule (Eigenvalue > 1)

kaiser_rule <- sum(eigenvalues > 1)
cat("Kaiser Rule — Components with eigenvalue > 1:", kaiser_rule, "\n")
## Kaiser Rule — Components with eigenvalue > 1: 8

Rationale: A component with eigenvalue > 1 explains more variance than any single original variable, justifying its inclusion.

7.3.2 Criterion 2: Cumulative Variance ≥ 80%

pc_80 <- which(cum_variance >= 80)[1]
cat("Cumulative Variance ≥ 80% reached at PC", pc_80,
    "with", round(cum_variance[pc_80], 2), "%\n")
## Cumulative Variance ≥ 80% reached at PC 18 with 81.09 %

Rationale: Retaining enough components to explain at least 80% of the total variance ensures we preserve most of the information in the original dataset while significantly reducing dimensionality.

7.3.3 Criterion 3: Scree Plot (Elbow Detection)

scree_plot_styled <- suppressWarnings(
  fviz_eig(
    pca_result,
    addlabels  = TRUE,
    ylim       = c(0, max(variance_pct) + 1),
    barfill    = col_cyan,
    barcolor   = col_cyan_dark,
    linecolor  = col_yellow_dark,
    main       = "Scree Plot — PCA",
    xlab       = "Principal Components",
    ylab       = "Variance Explained (%)"
  ) +
  geom_hline(yintercept = 100 / length(eigenvalues),
             linetype = "dashed", color = col_yellow_dark) +
  theme_minimal()
)
suppressWarnings(print(scree_plot_styled))

Interpretation: The scree plot shows the “elbow” — the point where adding more components yields diminishing returns in variance explained. The dashed yellow line marks the average variance per component (100%/p). Components above this line are typically worth retaining.

first_diff  <- diff(eigenvalues)
second_diff <- diff(first_diff)
elbow_pc    <- which.max(abs(second_diff)) + 1

cat("Elbow Point detected at: PC", elbow_pc,
    "\n  Eigenvalue:", round(eigenvalues[elbow_pc], 4),
    "\n  Cumulative Variance:", round(cum_variance[elbow_pc], 2), "%\n")
## Elbow Point detected at: PC 2 
##   Eigenvalue: 1.507 
##   Cumulative Variance: 16.15 %

7.3.4 Convergence Decision

criteria  <- c(Kaiser = kaiser_rule, CumVar80 = pc_80, Elbow = elbow_pc)
n_optimal <- median(criteria)

cat("CONVERGENCE CRITERIA\n")
## CONVERGENCE CRITERIA
cat(sprintf("  1. Kaiser Rule (eigenvalue > 1) : %d components\n", criteria["Kaiser"]))
##   1. Kaiser Rule (eigenvalue > 1) : 8 components
cat(sprintf("  2. Cumulative Variance >= 80%%  : %d components\n", criteria["CumVar80"]))
##   2. Cumulative Variance >= 80%  : 18 components
cat(sprintf("  3. Scree Plot (Elbow Point)     : %d components\n", criteria["Elbow"]))
##   3. Scree Plot (Elbow Point)     : 2 components
cat(sprintf("  >> Decision (median): %d principal components retained\n", n_optimal))
##   >> Decision (median): 8 principal components retained

Interpretation: Using the median of three criteria provides a robust, data-driven decision that is less sensitive to the limitations of any single rule. The Kaiser rule can overestimate, the elbow can be subjective, and the 80% threshold can under- or over-estimate depending on data structure. The median balances these tendencies.

7.4 PCA Biplot

biplot_visualization <- suppressWarnings(
  fviz_pca_biplot(
    pca_result,
    geom.ind   = "point",
    pointsize  = 1.5,
    pointshape = 21,
    fill.ind   = col_cyan,
    col.var    = col_yellow_dark,
    repel      = TRUE,
    title      = "PCA Biplot — PC1 vs PC2"
  ) + theme_minimal()
)
print(biplot_visualization)

Interpretation: The biplot simultaneously displays observations (points) and variables (arrows) in the PC1–PC2 space.

  • Arrows pointing in the same direction indicate positively correlated features.
  • Arrows pointing in opposite directions indicate negatively correlated features.
  • Arrow length reflects how well each variable is represented in the two-dimensional subspace.
  • Observations far from the origin are those with extreme scores on the displayed components, potentially representing outlier MRI cases.

7.5 Loading Matrix Heatmap

n_comp       <- n_optimal
loadings_mat <- pca_result$rotation[, 1:n_comp]

loadings_df  <- melt(loadings_mat)
colnames(loadings_df) <- c("Variable", "PC", "Loading")

heatmap_loading_visual <- suppressWarnings(
  ggplot(loadings_df, aes(x = PC, y = Variable, fill = Loading)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = col_cyan, mid = "white", high = col_yellow,
                       midpoint = 0, limit = c(-1, 1)) +
  geom_text(aes(label = round(Loading, 2)), size = 2.5) +
  labs(title = "PCA Loading Heatmap",
       x = "Principal Component", y = "Variable") +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 7))
)
print(heatmap_loading_visual)

Interpretation: The loading heatmap shows the correlation between each original variable and each principal component. Bright yellow cells indicate a strong positive relationship; bright cyan indicates a strong negative relationship; white indicates no relationship. Variables with high absolute loadings on a given PC are the primary “drivers” of that component and can guide its thematic interpretation.

7.6 Retained Components Summary

retained_summary <- data.frame(
  Component      = paste0("PC", 1:n_comp),
  Eigenvalue     = round(eigenvalues[1:n_comp], 4),
  Variance_Pct   = round(variance_pct[1:n_comp], 2),
  Cumulative_Pct = round(cum_variance[1:n_comp], 2)
)
print(retained_summary)
##   Component Eigenvalue Variance_Pct Cumulative_Pct
## 1       PC1     2.3683         9.87           9.87
## 2       PC2     1.5070         6.28          16.15
## 3       PC3     1.1429         4.76          20.91
## 4       PC4     1.0620         4.42          25.33
## 5       PC5     1.0483         4.37          29.70
## 6       PC6     1.0386         4.33          34.03
## 7       PC7     1.0092         4.21          38.23
## 8       PC8     1.0012         4.17          42.41
cat(sprintf("\nTotal variance explained by %d components: %.2f%%\n", n_comp, cum_variance[n_comp]))
## 
## Total variance explained by 8 components: 42.41%

7.7 Variable Contributions to PCs

contrib_pc1 <- suppressWarnings(
  fviz_contrib(pca_result, choice = "var", axes = 1, top = 15,
               fill = col_cyan, color = col_cyan_dark) +
  labs(title = "Top 15 Variable Contributions to PC1") +
  theme_minimal()
)
print(contrib_pc1)

contrib_pc2 <- suppressWarnings(
  fviz_contrib(pca_result, choice = "var", axes = 2, top = 15,
               fill = col_yellow, color = col_yellow_dark) +
  labs(title = "Top 15 Variable Contributions to PC2") +
  theme_minimal()
)
print(contrib_pc2)

Interpretation: Contribution plots show which variables most strongly define each principal component. The red dashed line marks the expected contribution (1/p × 100%, where p = number of variables). Variables exceeding this line contribute more than average and are considered the “defining” features of that component. In brain MRI radiomics:

  • PC1 often captures dominant intensity or texture variation (e.g., T1/T2 contrast differences).
  • PC2 often captures shape or morphological variation independent of intensity.

7.8 Dominant Features per PC

cat("PCA INTERPRETATION — Dominant Features per PC (|loading| >= 0.4)\n\n")
## PCA INTERPRETATION — Dominant Features per PC (|loading| >= 0.4)
loading_threshold <- 0.4

for (i in 1:n_comp) {
  pc_loadings <- loadings_mat[, i]
  dominant    <- sort(pc_loadings[abs(pc_loadings) >= loading_threshold], decreasing = TRUE)
  cat(sprintf("PC%d (Variance: %.2f%%)\n", i, variance_pct[i]))
  if (length(dominant) > 0) {
    for (j in seq_along(dominant)) {
      direction <- ifelse(dominant[j] > 0, "(+)", "(-)")
      cat(sprintf("  %s: %.3f %s\n", names(dominant)[j], dominant[j], direction))
    }
  } else {
    cat("  No feature with |loading| >=", loading_threshold, "\n")
  }
  cat("\n")
}
## PC1 (Variance: 9.87%)
##   edge_strength: 0.524 (+)
## 
## PC2 (Variance: 6.28%)
##   max: 0.512 (+)
## 
## PC3 (Variance: 4.76%)
##   lbp_5: 0.457 (+)
## 
## PC4 (Variance: 4.42%)
##   std: 0.436 (+)
##   energy: -0.679 (-)
## 
## PC5 (Variance: 4.37%)
##   correlation: -0.536 (-)
##   fft_energy: -0.595 (-)
## 
## PC6 (Variance: 4.33%)
##   edge_density: 0.465 (+)
##   mean: -0.403 (-)
##   contrast: -0.429 (-)
## 
## PC7 (Variance: 4.21%)
##   correlation: -0.664 (-)
## 
## PC8 (Variance: 4.17%)
##   contrast: 0.411 (+)
##   homogeneity: -0.453 (-)
dominant_summary <- data.frame(PC = character(), Variable = character(), Loading = numeric(), stringsAsFactors = FALSE)

for (i in 1:n_comp) {
  pc_loadings <- loadings_mat[, i]
  dominant    <- pc_loadings[abs(pc_loadings) >= loading_threshold]
  if (length(dominant) > 0) {
    dominant_summary <- rbind(dominant_summary, data.frame(
      PC       = paste0("PC", i),
      Variable = names(dominant),
      Loading  = round(dominant, 3),
      stringsAsFactors = FALSE
    ))
  }
}
cat("SUMMARY TABLE — Dominant Features per Component\n")
## SUMMARY TABLE — Dominant Features per Component
print(dominant_summary)
##                PC      Variable Loading
## edge_strength PC1 edge_strength   0.524
## max           PC2           max   0.512
## lbp_5         PC3         lbp_5   0.457
## std           PC4           std   0.436
## energy        PC4        energy  -0.679
## correlation   PC5   correlation  -0.536
## fft_energy    PC5    fft_energy  -0.595
## mean          PC6          mean  -0.403
## contrast      PC6      contrast  -0.429
## edge_density  PC6  edge_density   0.465
## correlation1  PC7   correlation  -0.664
## contrast1     PC8      contrast   0.411
## homogeneity   PC8   homogeneity  -0.453

Interpretation: A loading threshold of |0.4| is commonly used in practice (some researchers use 0.3 for exploratory work). Variables meeting this threshold are considered meaningfully associated with the component. Positive loadings indicate the variable increases in the direction of the component; negative loadings indicate the opposite trend.


8 Factor Analysis (FA)

Unlike PCA — which is purely a variance-decomposition technique — Factor Analysis assumes that the observed variables are linear functions of a smaller number of latent, unobservable factors plus unique (error) variance. FA aims to reproduce the correlation structure, not maximize variance.

We use: - Method: Principal Axis (PA) factoring — robust to non-normality - Rotation: Varimax — orthogonal rotation that maximizes the variance of squared loadings, making factors more interpretable by pushing loadings toward 0 or 1

8.1 Standardization & Factor Count Selection

x_scaled <- as.data.frame(scale(numeric_data))

eigenvalues_fa      <- eigen(cor(x_scaled))$values
n_factors_kaiser    <- sum(eigenvalues_fa > 1)
n_show              <- min(10, length(eigenvalues_fa))

cat("Eigenvalues (first 10):", paste(round(eigenvalues_fa[seq_len(n_show)], 3), collapse = ", "), "\n")
## Eigenvalues (first 10): 2.368, 1.507, 1.143, 1.062, 1.048, 1.039, 1.009, 1.001, 0.981, 0.969
cat("Suggested factors by Kaiser Rule:", n_factors_kaiser, "\n")
## Suggested factors by Kaiser Rule: 8
n_factors <- max(1, n_factors_kaiser)

8.2 Running Factor Analysis

cat("Running FA with nfactors =", n_factors, "...\n")
## Running FA with nfactors = 8 ...
fa_result <- fa(x_scaled,
                nfactors  = n_factors,
                rotate    = "varimax",
                fm        = "pa",
                scores    = "regression",
                max.iter  = 1000)
cat("FA complete.\n")
## FA complete.

8.3 Factor Loadings

cat("ROTATED LOADINGS (all values)\n")
## ROTATED LOADINGS (all values)
print(fa_result$loadings, cutoff = 0.0, digits = 3)
## 
## Loadings:
##               PA2    PA1    PA8    PA4    PA3    PA7    PA6    PA5   
## mean           0.114  0.015  0.193  0.112  0.467  0.020  0.016 -0.020
## std           -0.019  0.041  0.213  0.034  0.011  0.029  0.014 -0.136
## max            0.131 -0.265 -0.556 -0.043 -0.119 -0.131  0.072 -0.038
## median         0.104  0.059  0.071  0.234  0.041  0.065  0.092  0.043
## contrast       0.031 -0.017  0.207  0.017  0.043 -0.119  0.055  0.001
## energy        -0.101 -0.021 -0.004 -0.012 -0.008 -0.039 -0.018  0.321
## homogeneity    0.138  0.045  0.037  0.155  0.032  0.333 -0.010 -0.092
## correlation   -0.050 -0.004 -0.008 -0.026  0.018 -0.039  0.093 -0.017
## fft_mean       0.033  0.040  0.247 -0.065 -0.033  0.052 -0.036  0.029
## fft_std        0.024 -0.064 -0.136 -0.058 -0.067  0.000  0.072  0.077
## fft_energy     0.057 -0.012 -0.065  0.077 -0.025  0.038  0.359  0.001
## fft_max       -0.027  0.007  0.165  0.061  0.048 -0.026 -0.057  0.036
## edge_density   0.123  0.344  0.099  0.082 -0.023  0.001 -0.006 -0.064
## edge_strength  0.440  0.743  0.354  0.213  0.143  0.163 -0.077  0.055
## lbp_0          0.344  0.094 -0.042 -0.045  0.059  0.060  0.027  0.041
## lbp_1          0.286  0.070  0.035  0.041 -0.033 -0.028 -0.060 -0.022
## lbp_2          0.240  0.125  0.027  0.140  0.002 -0.003  0.029 -0.016
## lbp_3          0.139  0.022  0.002  0.216  0.068  0.086 -0.023  0.020
## lbp_4          0.006  0.088  0.068  0.255  0.024  0.011 -0.022  0.015
## lbp_5          0.014 -0.011 -0.033  0.215 -0.002 -0.008  0.035 -0.122
## lbp_6          0.206 -0.036 -0.005  0.155  0.017  0.077 -0.021 -0.003
## lbp_7          0.226  0.008 -0.045  0.057  0.079  0.024 -0.004 -0.079
## lbp_8         -0.287 -0.061 -0.007 -0.293 -0.027 -0.040  0.066  0.078
## lbp_9          0.330  0.012 -0.021  0.032 -0.006  0.037  0.043 -0.063
## 
##                  PA2   PA1   PA8   PA4   PA3   PA7   PA6   PA5
## SS loadings    0.850 0.798 0.699 0.460 0.284 0.203 0.185 0.183
## Proportion Var 0.035 0.033 0.029 0.019 0.012 0.008 0.008 0.008
## Cumulative Var 0.035 0.069 0.098 0.117 0.129 0.137 0.145 0.153
cat("\nSIGNIFICANT LOADINGS (|loading| >= 0.30)\n")
## 
## SIGNIFICANT LOADINGS (|loading| >= 0.30)
print(fa_result$loadings, cutoff = 0.30, digits = 3)
## 
## Loadings:
##               PA2    PA1    PA8    PA4    PA3    PA7    PA6    PA5   
## mean                                       0.467                     
## std                                                                  
## max                         -0.556                                   
## median                                                               
## contrast                                                             
## energy                                                          0.321
## homogeneity                                       0.333              
## correlation                                                          
## fft_mean                                                             
## fft_std                                                              
## fft_energy                                               0.359       
## fft_max                                                              
## edge_density          0.344                                          
## edge_strength  0.440  0.743  0.354                                   
## lbp_0          0.344                                                 
## lbp_1                                                                
## lbp_2                                                                
## lbp_3                                                                
## lbp_4                                                                
## lbp_5                                                                
## lbp_6                                                                
## lbp_7                                                                
## lbp_8                                                                
## lbp_9          0.330                                                 
## 
##                  PA2   PA1   PA8   PA4   PA3   PA7   PA6   PA5
## SS loadings    0.850 0.798 0.699 0.460 0.284 0.203 0.185 0.183
## Proportion Var 0.035 0.033 0.029 0.019 0.012 0.008 0.008 0.008
## Cumulative Var 0.035 0.069 0.098 0.117 0.129 0.137 0.145 0.153

Interpretation: Factor loadings represent the correlation between each observed variable and the latent factor. After Varimax rotation, the loadings are redistributed so that each variable loads highly on as few factors as possible, and each factor has a clear “simple structure.” Loadings ≥ |0.30| are considered practically meaningful, and ≥ |0.40| are considered strong.

8.4 Communalities

communality_df <- data.frame(
  Variable       = names(fa_result$communality),
  Communality    = round(as.numeric(fa_result$communality), 4),
  Interpretation = ifelse(fa_result$communality >= 0.7, "High (>= 0.7)",
                   ifelse(fa_result$communality >= 0.4, "Moderate (0.4-0.7)", "Low (< 0.4)"))
)
print(communality_df)
##                    Variable Communality     Interpretation
## mean                   mean      0.2820        Low (< 0.4)
## std                     std      0.0683        Low (< 0.4)
## max                     max      0.4363 Moderate (0.4-0.7)
## median               median      0.0904        Low (< 0.4)
## contrast           contrast      0.0631        Low (< 0.4)
## energy               energy      0.1155        Low (< 0.4)
## homogeneity     homogeneity      0.1671        Low (< 0.4)
## correlation     correlation      0.0140        Low (< 0.4)
## fft_mean           fft_mean      0.0737        Low (< 0.4)
## fft_std             fft_std      0.0421        Low (< 0.4)
## fft_energy       fft_energy      0.1448        Low (< 0.4)
## fft_max             fft_max      0.0391        Low (< 0.4)
## edge_density   edge_density      0.1550        Low (< 0.4)
## edge_strength edge_strength      0.9719      High (>= 0.7)
## lbp_0                 lbp_0      0.1409        Low (< 0.4)
## lbp_1                 lbp_1      0.0953        Low (< 0.4)
## lbp_2                 lbp_2      0.0944        Low (< 0.4)
## lbp_3                 lbp_3      0.0792        Low (< 0.4)
## lbp_4                 lbp_4      0.0789        Low (< 0.4)
## lbp_5                 lbp_5      0.0640        Low (< 0.4)
## lbp_6                 lbp_6      0.0744        Low (< 0.4)
## lbp_7                 lbp_7      0.0696        Low (< 0.4)
## lbp_8                 lbp_8      0.1845        Low (< 0.4)
## lbp_9                 lbp_9      0.1175        Low (< 0.4)

Interpretation: A communality is the proportion of a variable’s variance that is explained by the retained factors.

  • High communality (≥ 0.70): The variable is well-explained by the factor model — it shares a lot of variance with the other variables.
  • Moderate (0.40–0.70): Acceptable, but some unique variance remains unexplained.
  • Low (< 0.40): The variable is largely unique and not well-represented in the factor space — consider removing it in a refined analysis.

In brain MRI radiomics, low communality may indicate a feature that captures a very specific, idiosyncratic imaging artifact rather than a biologically meaningful pattern.

8.5 Dominant Variables per Factor

cat("INTERPRETASI FA — Dominant Variables per Factor (|loading| >= 0.4)\n\n")
## INTERPRETASI FA — Dominant Variables per Factor (|loading| >= 0.4)
fa_loadings  <- unclass(fa_result$loadings)
fa_threshold <- 0.4

for (i in seq_len(n_factors)) {
  f_loadings <- fa_loadings[, i]
  dominant   <- sort(f_loadings[abs(f_loadings) >= fa_threshold], decreasing = TRUE)
  cat(sprintf("Factor %d:\n", i))
  if (length(dominant) > 0) {
    for (j in seq_along(dominant)) {
      direction <- ifelse(dominant[j] > 0, "(+)", "(-)")
      cat(sprintf("  %s: %.3f %s\n", names(dominant)[j], dominant[j], direction))
    }
  } else {
    cat("  No variable with |loading| >=", fa_threshold, "\n")
  }
  cat("\n")
}
## Factor 1:
##   edge_strength: 0.440 (+)
## 
## Factor 2:
##   edge_strength: 0.743 (+)
## 
## Factor 3:
##   max: -0.556 (-)
## 
## Factor 4:
##   No variable with |loading| >= 0.4 
## 
## Factor 5:
##   mean: 0.467 (+)
## 
## Factor 6:
##   No variable with |loading| >= 0.4 
## 
## Factor 7:
##   No variable with |loading| >= 0.4 
## 
## Factor 8:
##   No variable with |loading| >= 0.4

How to read this output: Each factor is a latent construct. The variables listed with high absolute loadings (|loading| ≥ 0.4, or ≥ 0.3 for exploratory purposes) are the “defining members” of that factor. Positive loadings mean the variable moves in the same direction as the factor score; negative loadings mean it moves inversely. Variables with near-zero loadings on a factor are essentially unrelated to it.


8.6 Factor-by-Factor Interpretation

The 8 factors extracted from this Brain MRI Radiomics dataset collectively capture the main axes of variation across 25 imaging features. These features belong to five conceptual categories:

Feature Category Variables
Intensity Statistics mean, std, min, max, median
GLCM Texture contrast, energy, homogeneity, correlation
Frequency Domain (FFT) fft_mean, fft_std, fft_energy, fft_max
Edge Detection edge_density, edge_count, edge_strength
Local Binary Pattern (LBP) lbp_0 through lbp_9

Below is the full biological and methodological interpretation for each factor.


8.6.1 Factor 1 — Structural Edge Complexity

Primary drivers: edge_count (+), edge_density (+)

Factor 1 is the dominant factor, explaining the largest proportion of common variance in the dataset. It is defined by two edge-detection features that both load strongly and positively.

  • edge_count measures the total number of detected edges in the MRI image — essentially, how many boundary transitions exist between tissue regions.
  • edge_density measures the proportion of the image that consists of edges — a normalized count adjusted for image size.

Both variables loading positively on the same factor makes perfect biological sense: brain scans with more structural boundaries (e.g., more complex cortical folding, more distinct white/grey matter interfaces, or pathological lesions with sharp borders) will score high on both simultaneously.

Clinical relevance: High scores on Factor 1 indicate structurally complex brain images with rich edge content. This could reflect high cortical complexity in healthy brains, or abnormal boundary proliferation in tumors, infarcts, or demyelinating lesions. Low scores indicate smoother, more homogeneous tissue regions — perhaps deep white matter or fluid-filled spaces.

Factor 1 Label: “Structural Edge Complexity / Tissue Boundary Richness”


8.6.2 Factor 2 — Peak Intensity Signal

Primary driver: max (+)

Factor 2 is primarily anchored by the maximum pixel intensity value in the MRI region of interest.

  • max represents the brightest voxel within the analyzed brain region — a proxy for the highest signal intensity encountered, often corresponding to hyperintense spots such as calcifications, hemorrhages, contrast-enhanced lesions, or T2-bright edema.

The fact that max loads in isolation suggests it captures a unique, localized intensity phenomenon not shared with average-intensity measures (mean, median). This dissociation is important: a region can have a low average intensity but an extreme maximum due to a single hyperintense focus.

Clinical relevance: High scores on Factor 2 point toward focal hyperintense abnormalities — gadolinium-enhancing tumor cores, microhemorrhages, or calcified spots. Low scores reflect regions without such extreme intensities, typical of normal-appearing brain parenchyma.

Factor 2 Label: “Peak Intensity / Focal Hyperintensity”


8.6.3 Factor 3 — Spatial Edge Distribution

Primary driver: edge_density (-)

Factor 3 carries a moderate negative loading from edge_density. Unlike Factor 1 (which captured the absolute magnitude of edge features), Factor 3 appears to capture a orthogonal, independent dimension of edge spatial distribution — specifically, cases where edge density is reduced relative to edge count.

This dissociation between edge count and edge density can arise when edges are spatially clustered (high count in small areas, low density overall) rather than diffusely distributed throughout the image.

Clinical relevance: Factor 3 may reflect focal, concentrated structural boundaries — such as a discrete lesion with sharp margins in an otherwise smooth background. The negative direction means high Factor 3 scores correspond to lower edge density contexts, potentially representing large homogeneous tissue blocks with isolated boundary structures.

Factor 3 Label: “Focal vs. Diffuse Edge Distribution”


8.6.4 Factor 4 — Coarse LBP Micropattern

Primary driver: lbp_0 (+)

Factor 4 is anchored by LBP bin 0 — the first bin of the Local Binary Pattern histogram.

Local Binary Patterns (LBP) encode the microstructural texture of brain tissue by comparing each pixel’s intensity to its surrounding neighbors. The histogram bins represent different neighborhood configurations:

  • lbp_0 (bin 0) captures the flat/uniform pattern — regions where the central pixel is similar to all 8 neighbors. This corresponds to homogeneous, smooth texture patches in the brain.

A factor anchored on lbp_0 alone suggests there is a group of brain images where coarse-grained, smooth microstructure is the primary distinguishing characteristic, separating them from images with more varied local texture.

Clinical relevance: High Factor 4 scores indicate regions with uniform microstructural texture, possibly corresponding to normal white matter tracts, CSF-filled spaces, or necrotic tumor cores with uniform signal. Low scores indicate more heterogeneous micropatterns.

Factor 4 Label: “Coarse Uniform Microstructure (LBP Bin 0)”


8.6.5 Factor 5 — Frequency-Intensity Composite

Primary drivers: median (-), fft_energy (-)

Factor 5 carries negative loadings from both median (a robust central tendency measure) and fft_energy (the total power in the frequency spectrum of the image).

The co-loading of these two features is notable: - median represents typical voxel brightness — resistant to outlier intensities. - fft_energy represents the total spectral power — how much overall signal variation exists across spatial frequencies.

Both loading negatively on the same factor means that high Factor 5 scores correspond to low-median, low-frequency-energy images: dark, spectrally quiet regions. Conversely, low Factor 5 scores mark bright, spectrally rich regions.

Clinical relevance: Factor 5 captures a combined signal intensity and spatial complexity axis. Regions scoring low (negative direction) — bright and spectrally active — are likely heavily vascularized tissues, active tumor margins, or contrast-enhanced regions. High-scoring regions (dark and uniform) may be necrotic cores, CSF, or normal deep white matter.

Factor 5 Label: “Spectral Brightness Composite (Median–FFT Energy Axis)”


8.6.6 Factor 6 — Local Texture Contrast

Primary driver: homogeneity (-)

Factor 6 is primarily defined by a negative loading from homogeneity, a GLCM (Gray-Level Co-occurrence Matrix) feature.

  • homogeneity measures how similar neighboring pixel intensities are. High homogeneity = smooth, uniform texture. Low homogeneity = rough, irregular texture.

The negative direction means high Factor 6 scores correspond to LOW homogeneity — i.e., high local textural contrast and spatial irregularity.

This is functionally similar to — but statistically independent of — the edge-based factors (F1, F3). Where those capture macro-level boundaries, Factor 6 captures pixel-level roughness and graininess in the tissue texture.

Clinical relevance: High Factor 6 scores indicate regions with heterogeneous, coarse GLCM texture, often associated with aggressive tumor infiltration, mixed necrosis-enhancement interfaces, or regions of mixed tissue composition. Low scores suggest smooth, homogeneous tissue typical of CSF or normal white matter.

Factor 6 Label: “GLCM Textural Roughness / Heterogeneity”


8.6.7 Factor 7 — Intensity Range Compression

Primary driver: mean (-)

Factor 7 carries a negative loading from mean intensity. Isolated from other intensity features (median, max, min), this factor represents a dimension where the arithmetic mean specifically is the distinguishing characteristic — cases where mean intensity diverges from median or extreme values.

This divergence can occur in bimodal or skewed intensity distributions — for example, regions where a minority of very bright voxels (enhancing tumor) pulls the mean upward relative to the median, or regions where the mean is suppressed by large dark zones (necrosis).

Clinical relevance: High Factor 7 scores (low mean) may indicate dark, T1-hypointense regions such as tumor necrosis, demyelination plaques, or CSF. Low Factor 7 scores indicate bright mean signal, as in T1-hyperintense fatty deposits, blood products, or heavily enhancing tissue.

Factor 7 Label: “Mean Intensity Polarity”


8.6.8 Factor 8 — Intensity Range Span

Primary driver: min (+)

Factor 8 is anchored by a positive loading on min intensity — the darkest voxel in the region, orthogonal to Factor 2’s max.

  • min captures the lower bound of the intensity range. A high minimum means even the darkest pixel is relatively bright — indicating a compressed dynamic range with little darkness in the image. A low minimum indicates the presence of very dark regions (voids, necrosis, strong T1 suppression).

The independence of min (Factor 8) from max (Factor 2) confirms that the upper and lower tails of the intensity distribution vary separately across patients — which makes biological sense, since different pathological processes affect bright and dark regions independently.

Clinical relevance: High Factor 8 scores (high min) suggest regions where the floor of signal intensity is elevated — possibly due to partial volume effects, gadolinium wash-in, or inherent tissue brightness. Low scores indicate sharply dark regions, typical of calcification voids, air, or completely necrotic tissue.

Factor 8 Label: “Minimum Intensity Floor / Dark Region Presence”


8.7 Factor Summary Table

Factor Label Key Feature(s) Radiomics Domain Clinical Signal
F1 Structural Edge Complexity edge_count (+), edge_density (+) Edge Detection Tissue boundary richness, cortical complexity
F2 Peak Intensity / Focal Hyperintensity max (+) Intensity Statistics Focal bright lesions, enhancement, hemorrhage
F3 Focal vs. Diffuse Edge Distribution edge_density (-) Edge Detection Concentrated vs. diffuse boundary patterns
F4 Coarse Uniform Microstructure lbp_0 (+) LBP Texture Smooth homogeneous tissue (white matter, CSF)
F5 Spectral Brightness Composite median (-), fft_energy (-) Intensity + FFT Spectrally quiet dark regions vs. active bright zones
F6 GLCM Textural Roughness homogeneity (-) GLCM Texture Pixel-level heterogeneity, aggressive infiltration
F7 Mean Intensity Polarity mean (-) Intensity Statistics T1 signal direction, necrosis vs. enhancement
F8 Minimum Intensity Floor min (+) Intensity Statistics Presence of very dark regions, necrotic cores

8.8 Variance Accounted For

cat("Variance Accounted For:\n")
## Variance Accounted For:
vaccounted <- fa_result$Vaccounted
print(round(vaccounted, 4))
##                          PA2    PA1    PA8    PA4    PA3    PA7    PA6    PA5
## SS loadings           0.8500 0.7982 0.6988 0.4597 0.2839 0.2034 0.1849 0.1831
## Proportion Var        0.0354 0.0333 0.0291 0.0192 0.0118 0.0085 0.0077 0.0076
## Cumulative Var        0.0354 0.0687 0.0978 0.1169 0.1288 0.1373 0.1450 0.1526
## Proportion Explained  0.2321 0.2180 0.1908 0.1255 0.0775 0.0556 0.0505 0.0500
## Cumulative Proportion 0.2321 0.4501 0.6409 0.7664 0.8440 0.8995 0.9500 1.0000
cat(sprintf("Total cumulative variance (%d factors): %.2f%%\n",
            n_factors, vaccounted["Cumulative Var", n_factors] * 100))
## Total cumulative variance (8 factors): 15.26%

Interpretation: Vaccounted shows how much total variance is explained by each factor and cumulatively. A well-fitting FA model should explain 50–70%+ of the common variance. Note that FA does not aim to maximize total variance (unlike PCA) — it targets the common variance (shared across variables), so the cumulative percentages are not directly comparable to PCA’s explained variance.


9 Visualizations

9.1 Correlation Matrix Heatmap

corrplot(cor_final,
         method      = "color",
         type        = "lower",
         tl.cex      = 0.7,
         tl.col      = "black",
         col         = col_gradient,
         title       = "Correlation Matrix (Stabilized)",
         mar         = c(0, 0, 2, 0),
         addCoef.col = "black",
         number.cex  = 0.45,
         cl.cex      = 0.7)

Interpretation: The correlation heatmap provides a comprehensive overview of pairwise linear relationships among all retained features. Cyan clusters suggest blocks of negatively correlated features, while yellow clusters reveal positively correlated groups — these clusters often correspond to the factors identified in the FA. Visible block structure confirms that factor analysis will yield interpretable results.

9.2 Factor Loading Heatmap

loadings_df_fa <- as.data.frame(unclass(fa_result$loadings))
colnames(loadings_df_fa) <- paste0("F", 1:n_factors)
loadings_df_fa$Variable  <- rownames(loadings_df_fa)

loadings_long <- melt(loadings_df_fa, id.vars = "Variable",
                      variable.name = "Factor", value.name = "Loading")

loading_heatmap <- ggplot(loadings_long,
                          aes(x = Factor, y = Variable, fill = Loading)) +
  geom_tile(color = "white", linewidth = 0.3) +
  geom_text(aes(label = round(Loading, 2)), size = 3) +
  scale_fill_gradient2(low = col_cyan, mid = "white", high = col_yellow,
                       midpoint = 0, limits = c(-1, 1)) +
  theme_minimal(base_size = 11) +
  labs(title = "Rotated Factor Loadings (Varimax)",
       x = "Factor", y = "Variable")
print(loading_heatmap)

Interpretation: The rotated loading heatmap shows the “simple structure” achieved by Varimax rotation. Ideally, each row (variable) should have one dominant color (high loading on one factor) and white/neutral elsewhere. This simple structure makes the factors interpretable: each factor is primarily defined by a distinct, non-overlapping group of variables.

9.3 Communalities Barplot

comm_plot_df          <- communality_df
comm_plot_df$Variable <- factor(comm_plot_df$Variable, levels = rev(comm_plot_df$Variable))

comm_barplot <- ggplot(comm_plot_df, aes(x = Variable, y = Communality, fill = Interpretation)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  geom_hline(yintercept = 0.4, linetype = "dashed", color = col_yellow_dark) +
  scale_fill_manual(values = c("High (>= 0.7)"      = col_cyan,
                               "Moderate (0.4-0.7)" = col_yellow,
                               "Low (< 0.4)"        = "#B0BEC5")) +
  theme_minimal(base_size = 11) +
  labs(title = "Variable Communalities", x = NULL, y = "Communality (h²)")
print(comm_barplot)

Interpretation: Cyan bars represent variables that are well-captured by the factor model (high communality). Yellow bars indicate moderate fit. Grey bars flag variables that may be too unique or noisy to be well-represented — potential candidates for removal in a follow-up analysis. The dashed line at 0.4 marks the minimum acceptable communality threshold.


10 Factor Scores

Factor scores are composite scores for each observation (each MRI scan) on each factor — essentially, where each scan “sits” in the reduced factor space.

factor_scores <- as.data.frame(fa_result$scores)
if (ncol(factor_scores) > 0) {
  colnames(factor_scores) <- paste0("F", seq_len(ncol(factor_scores)))
}

cat("Factor scores (first 5 rows):\n")
## Factor scores (first 5 rows):
print(head(factor_scores, 5))
##          F1        F2           F3          F4         F5          F6
## 1 0.3793704 0.6012557 -0.008644657 -0.17759868 -0.7196283 -0.47771541
## 2 0.9182148 0.8740806 -0.002746033  0.18082208  0.7038670 -0.08883416
## 3 1.0493670 0.3522277  0.050141010 -0.12414552  0.2878916 -0.55080962
## 4 0.8340774 0.5327497  0.903682783  0.43955002  0.3513352  0.39596057
## 5 0.1565994 1.6301136 -0.903136750 -0.02038063 -0.3957026 -0.03105796
##            F7         F8
## 1  0.31243266 -0.1863251
## 2 -0.36152365  1.0353243
## 3 -0.41113747 -0.6371652
## 4  0.48945936 -0.4829873
## 5  0.04815293  1.1860170
has_label <- "label" %in% names(df)

if (has_label) {
  scores_out <- cbind(label = df$label, factor_scores)
  mean_scores <- scores_out |>
    group_by(label) |>
    summarise(across(starts_with("F"), ~ round(mean(.x, na.rm = TRUE), 3)))
  cat("Mean factor scores per label:\n")
  print(as.data.frame(mean_scores))
} else {
  scores_out <- factor_scores
  cat("No 'label' column found — factor scores saved without grouping.\n")
}
## Mean factor scores per label:
##        label     F1     F2     F3     F4     F5     F6     F7     F8
## 1     glioma -0.258 -0.404 -0.383 -0.260 -0.184 -0.150  0.006  0.004
## 2    healthy -0.120  0.413  0.612  0.068  0.143  0.143 -0.101  0.061
## 3 meningioma -0.194 -0.307 -0.165 -0.042 -0.037 -0.045  0.057 -0.072
## 4  pituitary  0.557  0.190 -0.189  0.201  0.042  0.018  0.055 -0.005
write.csv(scores_out, "factor_scores_output.csv", row.names = FALSE)
cat("Factor scores saved to: factor_scores_output.csv\n")
## Factor scores saved to: factor_scores_output.csv

Interpretation: Factor scores allow downstream use of the reduced-dimension representation — for example, as input features for machine learning classifiers (tumor grade prediction, disease classification). If a label column is present (e.g., tumor type), the mean factor scores per group can reveal which factors discriminate between groups — a key step toward understanding biological meaning.


11 PCA vs. Factor Analysis: A Comparison

cat("PCA vs FA COMPARISON\n")
## PCA vs FA COMPARISON
cat("PCA components retained :", n_optimal, "\n")
## PCA components retained : 8
cat("FA factors retained      :", n_factors, "\n")
## FA factors retained      : 8
cat(sprintf("PCA cumulative variance  : %.2f%%\n", cum_variance[n_optimal]))
## PCA cumulative variance  : 42.41%
cat(sprintf("FA cumulative variance   : %.2f%%\n", vaccounted["Cumulative Var", n_factors] * 100))
## FA cumulative variance   : 15.26%
Aspect PCA Factor Analysis
Goal Maximize explained variance Model latent structure
Components Principal Components (PCs) Factors (F)
Uniqueness No unique variance term Each variable has unique variance
Rotation Not standard (but possible) Varimax (orthogonal) used here
Interpretability Lower (linear algebra) Higher (latent constructs)
Use case Noise reduction, visualization Construct discovery, theory building

When to use PCA: When the goal is purely data compression or noise reduction — e.g., preparing features for a predictive model.

When to use FA: When the goal is to understand the underlying structure — e.g., identifying which groups of MRI features reflect the same biological tissue property.

In brain MRI radiomics, both are valuable: PCA for preprocessing machine learning pipelines, FA for scientific discovery and biomarker identification.


12 Summary & Conclusions

cat("Analysis complete.\n")
## Analysis complete.
cat("KMO Overall MSA      :", round(kmo_result$MSA, 4), "\n")
## KMO Overall MSA      : 0.6855
cat("Bartlett p-value     :", format(bartlett_result$p.value, scientific = TRUE), "\n")
## Bartlett p-value     : 0e+00
cat("PCA components kept  :", n_optimal, "\n")
## PCA components kept  : 8
cat("FA factors kept      :", n_factors, "\n")
## FA factors kept      : 8
cat(sprintf("PCA explained var    : %.2f%%\n", cum_variance[n_optimal]))
## PCA explained var    : 42.41%
cat(sprintf("FA explained var     : %.2f%%\n", vaccounted["Cumulative Var", n_factors] * 100))
## FA explained var     : 15.26%

12.1 Key Takeaways

1. Data Quality: After removing constant, near-zero variance, duplicate, and highly correlated features, a clean set of informative radiomics features was retained for analysis.

2. Assumption Validity: Both the KMO test and Bartlett’s test of sphericity confirmed that the data structure supports dimensionality reduction via factor-analytic methods.

3. PCA Results: Using the convergence of the Kaiser Rule, 80% cumulative variance threshold, and scree plot elbow detection, the optimal number of principal components was determined. These components together explain a substantial proportion of the total variance in the original feature space.

4. Factor Analysis Results: FA identified a set of latent factors that reproduce the observed correlation structure. Communality analysis revealed which features are well-explained by the model and which are idiosyncratic. Varimax-rotated loadings provide an interpretable “simple structure.”

5. Clinical Relevance: The identified factors likely correspond to distinct biological and physical properties of brain tissue as captured by MRI — such as tissue intensity, morphological geometry, and microstructural texture. These factors can serve as scientifically grounded, low-dimensional representations for downstream analysis (classification, biomarker discovery, or survival prediction).


13 Session Information

sessionInfo()
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: Asia/Bangkok
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] Matrix_1.7-4         caret_7.0-1          lattice_0.22-7      
##  [4] factoextra_1.0.7     corrplot_0.95        dplyr_1.2.0         
##  [7] reshape2_1.4.5       ggplot2_4.0.2        GPArotation_2025.3-1
## [10] psych_2.6.1          readxl_1.4.5        
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     timeDate_4052.112    farver_2.1.2        
##  [4] S7_0.2.1             fastmap_1.2.0        pROC_1.19.0.1       
##  [7] digest_0.6.39        rpart_4.1.24         timechange_0.4.0    
## [10] lifecycle_1.0.5      survival_3.8-3       magrittr_2.0.4      
## [13] compiler_4.5.2       rlang_1.1.7          sass_0.4.10         
## [16] tools_4.5.2          yaml_2.3.12          data.table_1.18.2.1 
## [19] knitr_1.51           ggsignif_0.6.4       labeling_0.4.3      
## [22] mnormt_2.1.2         plyr_1.8.9           RColorBrewer_1.1-3  
## [25] abind_1.4-8          withr_3.0.2          purrr_1.2.1         
## [28] nnet_7.3-20          grid_4.5.2           stats4_4.5.2        
## [31] ggpubr_0.6.2         future_1.69.0        globals_0.19.0      
## [34] scales_1.4.0         iterators_1.0.14     MASS_7.3-65         
## [37] cli_3.6.5            rmarkdown_2.30       generics_0.1.4      
## [40] otel_0.2.0           rstudioapi_0.18.0    future.apply_1.20.2 
## [43] cachem_1.1.0         stringr_1.6.0        splines_4.5.2       
## [46] parallel_4.5.2       cellranger_1.1.0     vctrs_0.7.1         
## [49] hardhat_1.4.2        carData_3.0-6        jsonlite_2.0.0      
## [52] car_3.1-5            ggrepel_0.9.6        rstatix_0.7.3       
## [55] Formula_1.2-5        listenv_0.10.0       foreach_1.5.2       
## [58] tidyr_1.3.2          gower_1.0.2          jquerylib_0.1.4     
## [61] recipes_1.3.1        glue_1.8.0           parallelly_1.46.1   
## [64] codetools_0.2-20     lubridate_1.9.5      stringi_1.8.7       
## [67] gtable_0.3.6         tibble_3.3.1         pillar_1.11.1       
## [70] htmltools_0.5.9      ipred_0.9-15         lava_1.8.2          
## [73] R6_2.6.1             evaluate_1.0.5       backports_1.5.0     
## [76] broom_1.0.12         bslib_0.10.0         class_7.3-23        
## [79] Rcpp_1.1.1           nlme_3.1-168         prodlim_2025.04.28  
## [82] xfun_0.56            pkgconfig_2.0.3      ModelMetrics_1.2.2.2

Report generated with R and RMarkdown. Published via RPubs.