This report presents a comprehensive dimensionality reduction analysis of a Brain MRI Radiomics-Style dataset. Radiomics refers to the high-throughput extraction of quantitative features from medical images — in this case, MRI brain scans. These features encode information about shape, texture, intensity, and spatial patterns within the image, often resulting in high-dimensional datasets with many correlated variables.
Two complementary techniques are applied:
Together, these methods help reduce noise, remove redundancy, and reveal the core structure of the MRI feature space.
We begin by loading all required R packages.
library(readxl) # Reading Excel files
library(psych) # KMO, Bartlett's test, Factor Analysis
library(GPArotation) # Factor rotation methods (Varimax)
library(ggplot2) # Advanced plotting
library(reshape2) # Data reshaping for heatmaps
library(dplyr) # Data manipulation
library(corrplot) # Correlation matrix visualization
library(factoextra) # PCA visualization helpers
library(caret) # Near-zero variance detection
library(Matrix) # Matrix operations (nearPD)# Define a consistent cyan-yellow color theme throughout the report
col_cyan <- "#00BCD4"
col_cyan_dark <- "#00838F"
col_yellow <- "#FFD600"
col_yellow_dark <- "#F9A825"
col_gradient <- colorRampPalette(c(col_cyan, "white", col_yellow))(200)Color Theme: Cyan represents low/negative values; yellow represents high/positive values. This palette will be applied consistently to all heatmaps and visualizations.
df <- read_excel("Brain MRI Radiomics-Style Numerical Dataset.xlsx")
df <- as.data.frame(df)
cat("Loaded data: rows =", nrow(df), "| columns =", ncol(df), "\n")## Loaded data: rows = 7023 | columns = 28
## Numeric features retained: 26
## Missing values: 0
if (any(is.na(numeric_data))) {
numeric_data <- na.omit(numeric_data)
cat("Rows after removing NA:", nrow(numeric_data), "\n")
}Interpretation: We extract only numeric columns because PCA and FA operate on continuous, quantitative data. Any rows containing missing values are removed to ensure a complete-case analysis. Radiomics datasets are typically dense and well-structured, so missing data is usually minimal.
Before running any statistical analysis, we apply three quality-control steps to remove features that would cause computational or statistical issues.
var_per_col <- apply(numeric_data, 2, var, na.rm = TRUE)
const_cols <- names(var_per_col[var_per_col == 0 | is.na(var_per_col)])
if (length(const_cols) > 0) {
cat("Removing constant columns:", paste(const_cols, collapse = ", "), "\n")
numeric_data <- numeric_data[, !(names(numeric_data) %in% const_cols), drop = FALSE]
} else {
cat("No constant columns found.\n")
}## No constant columns found.
Why? A feature with zero variance carries no information — every observation has the same value. Including it in a correlation matrix causes division by zero, making the matrix singular and analytically useless.
nzv_idx <- caret::nearZeroVar(numeric_data, saveMetrics = FALSE)
if (length(nzv_idx) > 0) {
cat("Removing near-zero-variance columns:",
paste(names(numeric_data)[nzv_idx], collapse = ", "), "\n")
numeric_data <- numeric_data[, -nzv_idx, drop = FALSE]
} else {
cat("No near-zero-variance columns found.\n")
}## Removing near-zero-variance columns: min
Why? Near-zero variance features have extremely low variability (e.g., 99% of values are the same). They contribute minimally to analysis but can destabilize numerical computations and distort factor solutions.
if (ncol(numeric_data) > 1) {
dup_cols <- which(duplicated(t(numeric_data)))
if (length(dup_cols) > 0) {
cat("Removing duplicate columns:", paste(names(numeric_data)[dup_cols], collapse = ", "), "\n")
numeric_data <- numeric_data[, -dup_cols, drop = FALSE]
} else {
cat("No duplicate columns found.\n")
}
}## No duplicate columns found.
cat("\nAfter QC: numeric features =", ncol(numeric_data),
"| observations =", nrow(numeric_data), "\n")##
## After QC: numeric features = 25 | observations = 7023
Why? Perfectly duplicate columns create redundant rows in the correlation matrix, making it rank-deficient and introducing ill-conditioning into factor extraction.
cor_mat <- cor(numeric_data)
cat("Correlation matrix dimensions:", ncol(cor_mat), "x", nrow(cor_mat), "\n")## Correlation matrix dimensions: 25 x 25
In practice, correlation matrices from high-dimensional radiomics data can be near-singular (close to non-invertible). This happens because some features are nearly linearly dependent. We apply a series of corrections to ensure numerical stability.
make_pos_def <- function(r_mat) {
# Step 1: Try psych::cor.smooth
r1 <- tryCatch(psych::cor.smooth(r_mat), error = function(e) r_mat)
det1 <- det(r1)
if (!is.na(det1) && det1 > 1e-8) return(r1)
# Step 2: Try Matrix::nearPD for positive definiteness
np <- tryCatch(
nearPD(r1, corr = TRUE, conv.tol = 1e-8, maxit = 2000),
error = function(e) NULL
)
if (!is.null(np)) {
r2 <- as.matrix(np$mat); diag(r2) <- 1
if (!is.na(det(r2)) && det(r2) > 1e-12) return(r2)
}
# Step 3: Ridge regularization fallback
r3 <- r_mat + diag(1e-6, nrow(r_mat)); diag(r3) <- 1
r3
}
initial_det <- det(cor_mat)
cat("Initial determinant:", format(initial_det, digits = 6), "\n")## Initial determinant: 0.346566
cor_final <- make_pos_def(cor_mat)
final_det <- det(cor_final)
cat("Final determinant after stabilization:", format(final_det, digits = 6), "\n")## Final determinant after stabilization: 0.346566
Interpretation: The determinant of the correlation matrix is a measure of matrix “volume.” A determinant near zero signals near-singularity. After stabilization, a positive determinant confirms the matrix is now positive definite — a requirement for valid KMO, Bartlett’s test, and factor extraction. The three-step strategy (cor.smooth → nearPD → ridge) applies the gentlest correction first, escalating only if needed.
Before performing PCA or FA, we verify that the data is statistically appropriate for these methods.
## ASSUMPTION TESTS
The KMO test measures sampling adequacy — whether the partial correlations among variables are small enough that factor analysis is likely to yield reliable factors. KMO values range from 0 to 1.
| KMO Value | Interpretation |
|---|---|
| ≥ 0.90 | Marvelous |
| ≥ 0.80 | Meritorious (Good) |
| ≥ 0.70 | Middling (Acceptable) |
| ≥ 0.60 | Mediocre |
| < 0.60 | Miserable (Unacceptable) |
kmo_result <- tryCatch(KMO(cor_final), error = function(e) KMO(numeric_data))
cat("KMO Overall MSA:", round(kmo_result$MSA, 4), "\n\n")## KMO Overall MSA: 0.6855
kmo_table <- data.frame(
Variable = names(kmo_result$MSAi),
KMO = round(kmo_result$MSAi, 4),
Category = ifelse(kmo_result$MSAi >= 0.8, "Good (Meritorious)",
ifelse(kmo_result$MSAi >= 0.6, "Middle (Middling)", "Poor (Miserable)"))
)
print(kmo_table)## Variable KMO Category
## mean mean 0.7859 Middle (Middling)
## std std 0.7312 Middle (Middling)
## max max 0.5782 Poor (Miserable)
## median median 0.7941 Middle (Middling)
## contrast contrast 0.6825 Middle (Middling)
## energy energy 0.6831 Middle (Middling)
## homogeneity homogeneity 0.8018 Good (Meritorious)
## correlation correlation 0.5535 Poor (Miserable)
## fft_mean fft_mean 0.7024 Middle (Middling)
## fft_std fft_std 0.7615 Middle (Middling)
## fft_energy fft_energy 0.6045 Middle (Middling)
## fft_max fft_max 0.7287 Middle (Middling)
## edge_density edge_density 0.6532 Middle (Middling)
## edge_strength edge_strength 0.6346 Middle (Middling)
## lbp_0 lbp_0 0.6685 Middle (Middling)
## lbp_1 lbp_1 0.7351 Middle (Middling)
## lbp_2 lbp_2 0.7421 Middle (Middling)
## lbp_3 lbp_3 0.7781 Middle (Middling)
## lbp_4 lbp_4 0.7841 Middle (Middling)
## lbp_5 lbp_5 0.6490 Middle (Middling)
## lbp_6 lbp_6 0.7792 Middle (Middling)
## lbp_7 lbp_7 0.7226 Middle (Middling)
## lbp_8 lbp_8 0.7720 Middle (Middling)
## lbp_9 lbp_9 0.7655 Middle (Middling)
Interpretation: The overall KMO MSA (Measure of Sampling Adequacy) indicates whether the correlation pattern among variables is compact enough to support factor analysis. An overall KMO ≥ 0.70 is acceptable for proceeding. Individual variable KMO values below 0.50 suggest that variable is not well explained by the factor model and could be removed in a stricter analysis.
Bartlett’s test checks whether the correlation matrix is significantly different from an identity matrix (i.e., whether variables are correlated at all). A significant result (p < 0.05) is required to justify factor analysis.
bartlett_result <- tryCatch(
cortest.bartlett(cor_final, n = nrow(numeric_data)),
error = function(e) {
jittered <- cor_final + diag(1e-6, nrow(cor_final))
cortest.bartlett(jittered, n = nrow(numeric_data))
}
)
cat("Bartlett's Test of Sphericity:\n")## Bartlett's Test of Sphericity:
## Chi-Square: 7431.718
## df: 276
## p-value: 0e+00
if (bartlett_result$p.value < 0.05) {
cat(" Conclusion: Reject H0 — FA is appropriate.\n")
} else {
cat(" Conclusion: Fail to reject H0 — FA not recommended.\n")
}## Conclusion: Reject H0 — FA is appropriate.
Interpretation: A statistically significant Bartlett’s test (p < 0.05) confirms that the variables are interrelated — a prerequisite for meaningful factor analysis. In radiomics, features are typically highly correlated (sharing underlying tissue properties), so this test usually passes easily.
PCA transforms the original variables into principal components (PCs) — linear combinations that are orthogonal (uncorrelated) to each other. The first PC captures the maximum variance, the second PC captures the maximum remaining variance, and so on.
pca_result <- prcomp(numeric_data, center = TRUE, scale. = TRUE)
summary_pca <- summary(pca_result)
cat("PCA Summary:\n")## PCA Summary:
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.53893 1.22761 1.06906 1.03053 1.02385 1.01913 1.00459
## Proportion of Variance 0.09868 0.06279 0.04762 0.04425 0.04368 0.04328 0.04205
## Cumulative Proportion 0.09868 0.16147 0.20909 0.25334 0.29702 0.34030 0.38235
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.00058 0.99048 0.98436 0.98112 0.97248 0.96367 0.95953
## Proportion of Variance 0.04172 0.04088 0.04037 0.04011 0.03941 0.03869 0.03836
## Cumulative Proportion 0.42406 0.46494 0.50531 0.54542 0.58483 0.62352 0.66188
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.95421 0.94812 0.94055 0.93938 0.93073 0.92615 0.92181
## Proportion of Variance 0.03794 0.03746 0.03686 0.03677 0.03609 0.03574 0.03541
## Cumulative Proportion 0.69982 0.73728 0.77414 0.81090 0.84700 0.88274 0.91814
## PC22 PC23 PC24
## Standard deviation 0.9112 0.85732 0.63183
## Proportion of Variance 0.0346 0.03062 0.01663
## Cumulative Proportion 0.9527 0.98337 1.00000
Note on Standardization: We set
center = TRUEandscale. = TRUEto perform Z-score standardization before PCA. This is critical because radiomics features are measured on very different scales (e.g., pixel intensities vs. shape descriptors). Without standardization, features with larger numeric ranges would artificially dominate the first components.
eigenvalues <- pca_result$sdev^2
variance_pct <- eigenvalues / sum(eigenvalues) * 100
cum_variance <- cumsum(variance_pct)
eigen_table <- data.frame(
PC = paste0("PC", seq_along(eigenvalues)),
Eigenvalue = round(eigenvalues, 4),
Variance_Pct = round(variance_pct, 2),
Cumulative_Pct = round(cum_variance, 2)
)
print(eigen_table)## PC Eigenvalue Variance_Pct Cumulative_Pct
## 1 PC1 2.3683 9.87 9.87
## 2 PC2 1.5070 6.28 16.15
## 3 PC3 1.1429 4.76 20.91
## 4 PC4 1.0620 4.42 25.33
## 5 PC5 1.0483 4.37 29.70
## 6 PC6 1.0386 4.33 34.03
## 7 PC7 1.0092 4.21 38.23
## 8 PC8 1.0012 4.17 42.41
## 9 PC9 0.9810 4.09 46.49
## 10 PC10 0.9690 4.04 50.53
## 11 PC11 0.9626 4.01 54.54
## 12 PC12 0.9457 3.94 58.48
## 13 PC13 0.9287 3.87 62.35
## 14 PC14 0.9207 3.84 66.19
## 15 PC15 0.9105 3.79 69.98
## 16 PC16 0.8989 3.75 73.73
## 17 PC17 0.8846 3.69 77.41
## 18 PC18 0.8824 3.68 81.09
## 19 PC19 0.8663 3.61 84.70
## 20 PC20 0.8578 3.57 88.27
## 21 PC21 0.8497 3.54 91.81
## 22 PC22 0.8303 3.46 95.27
## 23 PC23 0.7350 3.06 98.34
## 24 PC24 0.3992 1.66 100.00
Interpretation: Each eigenvalue represents the amount of variance captured by its corresponding principal component. An eigenvalue of 1.0 means the component explains as much variance as a single original standardized variable — this is the basis of the Kaiser Rule for selecting the number of components.
We use three convergent criteria to decide how many PCs to retain:
kaiser_rule <- sum(eigenvalues > 1)
cat("Kaiser Rule — Components with eigenvalue > 1:", kaiser_rule, "\n")## Kaiser Rule — Components with eigenvalue > 1: 8
Rationale: A component with eigenvalue > 1 explains more variance than any single original variable, justifying its inclusion.
pc_80 <- which(cum_variance >= 80)[1]
cat("Cumulative Variance ≥ 80% reached at PC", pc_80,
"with", round(cum_variance[pc_80], 2), "%\n")## Cumulative Variance ≥ 80% reached at PC 18 with 81.09 %
Rationale: Retaining enough components to explain at least 80% of the total variance ensures we preserve most of the information in the original dataset while significantly reducing dimensionality.
scree_plot_styled <- suppressWarnings(
fviz_eig(
pca_result,
addlabels = TRUE,
ylim = c(0, max(variance_pct) + 1),
barfill = col_cyan,
barcolor = col_cyan_dark,
linecolor = col_yellow_dark,
main = "Scree Plot — PCA",
xlab = "Principal Components",
ylab = "Variance Explained (%)"
) +
geom_hline(yintercept = 100 / length(eigenvalues),
linetype = "dashed", color = col_yellow_dark) +
theme_minimal()
)
suppressWarnings(print(scree_plot_styled))Interpretation: The scree plot shows the “elbow” — the point where adding more components yields diminishing returns in variance explained. The dashed yellow line marks the average variance per component (100%/p). Components above this line are typically worth retaining.
first_diff <- diff(eigenvalues)
second_diff <- diff(first_diff)
elbow_pc <- which.max(abs(second_diff)) + 1
cat("Elbow Point detected at: PC", elbow_pc,
"\n Eigenvalue:", round(eigenvalues[elbow_pc], 4),
"\n Cumulative Variance:", round(cum_variance[elbow_pc], 2), "%\n")## Elbow Point detected at: PC 2
## Eigenvalue: 1.507
## Cumulative Variance: 16.15 %
criteria <- c(Kaiser = kaiser_rule, CumVar80 = pc_80, Elbow = elbow_pc)
n_optimal <- median(criteria)
cat("CONVERGENCE CRITERIA\n")## CONVERGENCE CRITERIA
## 1. Kaiser Rule (eigenvalue > 1) : 8 components
## 2. Cumulative Variance >= 80% : 18 components
## 3. Scree Plot (Elbow Point) : 2 components
## >> Decision (median): 8 principal components retained
Interpretation: Using the median of three criteria provides a robust, data-driven decision that is less sensitive to the limitations of any single rule. The Kaiser rule can overestimate, the elbow can be subjective, and the 80% threshold can under- or over-estimate depending on data structure. The median balances these tendencies.
biplot_visualization <- suppressWarnings(
fviz_pca_biplot(
pca_result,
geom.ind = "point",
pointsize = 1.5,
pointshape = 21,
fill.ind = col_cyan,
col.var = col_yellow_dark,
repel = TRUE,
title = "PCA Biplot — PC1 vs PC2"
) + theme_minimal()
)
print(biplot_visualization)Interpretation: The biplot simultaneously displays observations (points) and variables (arrows) in the PC1–PC2 space.
- Arrows pointing in the same direction indicate positively correlated features.
- Arrows pointing in opposite directions indicate negatively correlated features.
- Arrow length reflects how well each variable is represented in the two-dimensional subspace.
- Observations far from the origin are those with extreme scores on the displayed components, potentially representing outlier MRI cases.
n_comp <- n_optimal
loadings_mat <- pca_result$rotation[, 1:n_comp]
loadings_df <- melt(loadings_mat)
colnames(loadings_df) <- c("Variable", "PC", "Loading")
heatmap_loading_visual <- suppressWarnings(
ggplot(loadings_df, aes(x = PC, y = Variable, fill = Loading)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = col_cyan, mid = "white", high = col_yellow,
midpoint = 0, limit = c(-1, 1)) +
geom_text(aes(label = round(Loading, 2)), size = 2.5) +
labs(title = "PCA Loading Heatmap",
x = "Principal Component", y = "Variable") +
theme_minimal() +
theme(axis.text.y = element_text(size = 7))
)
print(heatmap_loading_visual)Interpretation: The loading heatmap shows the correlation between each original variable and each principal component. Bright yellow cells indicate a strong positive relationship; bright cyan indicates a strong negative relationship; white indicates no relationship. Variables with high absolute loadings on a given PC are the primary “drivers” of that component and can guide its thematic interpretation.
retained_summary <- data.frame(
Component = paste0("PC", 1:n_comp),
Eigenvalue = round(eigenvalues[1:n_comp], 4),
Variance_Pct = round(variance_pct[1:n_comp], 2),
Cumulative_Pct = round(cum_variance[1:n_comp], 2)
)
print(retained_summary)## Component Eigenvalue Variance_Pct Cumulative_Pct
## 1 PC1 2.3683 9.87 9.87
## 2 PC2 1.5070 6.28 16.15
## 3 PC3 1.1429 4.76 20.91
## 4 PC4 1.0620 4.42 25.33
## 5 PC5 1.0483 4.37 29.70
## 6 PC6 1.0386 4.33 34.03
## 7 PC7 1.0092 4.21 38.23
## 8 PC8 1.0012 4.17 42.41
##
## Total variance explained by 8 components: 42.41%
contrib_pc1 <- suppressWarnings(
fviz_contrib(pca_result, choice = "var", axes = 1, top = 15,
fill = col_cyan, color = col_cyan_dark) +
labs(title = "Top 15 Variable Contributions to PC1") +
theme_minimal()
)
print(contrib_pc1)contrib_pc2 <- suppressWarnings(
fviz_contrib(pca_result, choice = "var", axes = 2, top = 15,
fill = col_yellow, color = col_yellow_dark) +
labs(title = "Top 15 Variable Contributions to PC2") +
theme_minimal()
)
print(contrib_pc2)Interpretation: Contribution plots show which variables most strongly define each principal component. The red dashed line marks the expected contribution (1/p × 100%, where p = number of variables). Variables exceeding this line contribute more than average and are considered the “defining” features of that component. In brain MRI radiomics:
- PC1 often captures dominant intensity or texture variation (e.g., T1/T2 contrast differences).
- PC2 often captures shape or morphological variation independent of intensity.
## PCA INTERPRETATION — Dominant Features per PC (|loading| >= 0.4)
loading_threshold <- 0.4
for (i in 1:n_comp) {
pc_loadings <- loadings_mat[, i]
dominant <- sort(pc_loadings[abs(pc_loadings) >= loading_threshold], decreasing = TRUE)
cat(sprintf("PC%d (Variance: %.2f%%)\n", i, variance_pct[i]))
if (length(dominant) > 0) {
for (j in seq_along(dominant)) {
direction <- ifelse(dominant[j] > 0, "(+)", "(-)")
cat(sprintf(" %s: %.3f %s\n", names(dominant)[j], dominant[j], direction))
}
} else {
cat(" No feature with |loading| >=", loading_threshold, "\n")
}
cat("\n")
}## PC1 (Variance: 9.87%)
## edge_strength: 0.524 (+)
##
## PC2 (Variance: 6.28%)
## max: 0.512 (+)
##
## PC3 (Variance: 4.76%)
## lbp_5: 0.457 (+)
##
## PC4 (Variance: 4.42%)
## std: 0.436 (+)
## energy: -0.679 (-)
##
## PC5 (Variance: 4.37%)
## correlation: -0.536 (-)
## fft_energy: -0.595 (-)
##
## PC6 (Variance: 4.33%)
## edge_density: 0.465 (+)
## mean: -0.403 (-)
## contrast: -0.429 (-)
##
## PC7 (Variance: 4.21%)
## correlation: -0.664 (-)
##
## PC8 (Variance: 4.17%)
## contrast: 0.411 (+)
## homogeneity: -0.453 (-)
dominant_summary <- data.frame(PC = character(), Variable = character(), Loading = numeric(), stringsAsFactors = FALSE)
for (i in 1:n_comp) {
pc_loadings <- loadings_mat[, i]
dominant <- pc_loadings[abs(pc_loadings) >= loading_threshold]
if (length(dominant) > 0) {
dominant_summary <- rbind(dominant_summary, data.frame(
PC = paste0("PC", i),
Variable = names(dominant),
Loading = round(dominant, 3),
stringsAsFactors = FALSE
))
}
}
cat("SUMMARY TABLE — Dominant Features per Component\n")## SUMMARY TABLE — Dominant Features per Component
## PC Variable Loading
## edge_strength PC1 edge_strength 0.524
## max PC2 max 0.512
## lbp_5 PC3 lbp_5 0.457
## std PC4 std 0.436
## energy PC4 energy -0.679
## correlation PC5 correlation -0.536
## fft_energy PC5 fft_energy -0.595
## mean PC6 mean -0.403
## contrast PC6 contrast -0.429
## edge_density PC6 edge_density 0.465
## correlation1 PC7 correlation -0.664
## contrast1 PC8 contrast 0.411
## homogeneity PC8 homogeneity -0.453
Interpretation: A loading threshold of |0.4| is commonly used in practice (some researchers use 0.3 for exploratory work). Variables meeting this threshold are considered meaningfully associated with the component. Positive loadings indicate the variable increases in the direction of the component; negative loadings indicate the opposite trend.
Unlike PCA — which is purely a variance-decomposition technique — Factor Analysis assumes that the observed variables are linear functions of a smaller number of latent, unobservable factors plus unique (error) variance. FA aims to reproduce the correlation structure, not maximize variance.
We use: - Method: Principal Axis (PA) factoring — robust to non-normality - Rotation: Varimax — orthogonal rotation that maximizes the variance of squared loadings, making factors more interpretable by pushing loadings toward 0 or 1
x_scaled <- as.data.frame(scale(numeric_data))
eigenvalues_fa <- eigen(cor(x_scaled))$values
n_factors_kaiser <- sum(eigenvalues_fa > 1)
n_show <- min(10, length(eigenvalues_fa))
cat("Eigenvalues (first 10):", paste(round(eigenvalues_fa[seq_len(n_show)], 3), collapse = ", "), "\n")## Eigenvalues (first 10): 2.368, 1.507, 1.143, 1.062, 1.048, 1.039, 1.009, 1.001, 0.981, 0.969
## Suggested factors by Kaiser Rule: 8
## Running FA with nfactors = 8 ...
fa_result <- fa(x_scaled,
nfactors = n_factors,
rotate = "varimax",
fm = "pa",
scores = "regression",
max.iter = 1000)
cat("FA complete.\n")## FA complete.
## ROTATED LOADINGS (all values)
##
## Loadings:
## PA2 PA1 PA8 PA4 PA3 PA7 PA6 PA5
## mean 0.114 0.015 0.193 0.112 0.467 0.020 0.016 -0.020
## std -0.019 0.041 0.213 0.034 0.011 0.029 0.014 -0.136
## max 0.131 -0.265 -0.556 -0.043 -0.119 -0.131 0.072 -0.038
## median 0.104 0.059 0.071 0.234 0.041 0.065 0.092 0.043
## contrast 0.031 -0.017 0.207 0.017 0.043 -0.119 0.055 0.001
## energy -0.101 -0.021 -0.004 -0.012 -0.008 -0.039 -0.018 0.321
## homogeneity 0.138 0.045 0.037 0.155 0.032 0.333 -0.010 -0.092
## correlation -0.050 -0.004 -0.008 -0.026 0.018 -0.039 0.093 -0.017
## fft_mean 0.033 0.040 0.247 -0.065 -0.033 0.052 -0.036 0.029
## fft_std 0.024 -0.064 -0.136 -0.058 -0.067 0.000 0.072 0.077
## fft_energy 0.057 -0.012 -0.065 0.077 -0.025 0.038 0.359 0.001
## fft_max -0.027 0.007 0.165 0.061 0.048 -0.026 -0.057 0.036
## edge_density 0.123 0.344 0.099 0.082 -0.023 0.001 -0.006 -0.064
## edge_strength 0.440 0.743 0.354 0.213 0.143 0.163 -0.077 0.055
## lbp_0 0.344 0.094 -0.042 -0.045 0.059 0.060 0.027 0.041
## lbp_1 0.286 0.070 0.035 0.041 -0.033 -0.028 -0.060 -0.022
## lbp_2 0.240 0.125 0.027 0.140 0.002 -0.003 0.029 -0.016
## lbp_3 0.139 0.022 0.002 0.216 0.068 0.086 -0.023 0.020
## lbp_4 0.006 0.088 0.068 0.255 0.024 0.011 -0.022 0.015
## lbp_5 0.014 -0.011 -0.033 0.215 -0.002 -0.008 0.035 -0.122
## lbp_6 0.206 -0.036 -0.005 0.155 0.017 0.077 -0.021 -0.003
## lbp_7 0.226 0.008 -0.045 0.057 0.079 0.024 -0.004 -0.079
## lbp_8 -0.287 -0.061 -0.007 -0.293 -0.027 -0.040 0.066 0.078
## lbp_9 0.330 0.012 -0.021 0.032 -0.006 0.037 0.043 -0.063
##
## PA2 PA1 PA8 PA4 PA3 PA7 PA6 PA5
## SS loadings 0.850 0.798 0.699 0.460 0.284 0.203 0.185 0.183
## Proportion Var 0.035 0.033 0.029 0.019 0.012 0.008 0.008 0.008
## Cumulative Var 0.035 0.069 0.098 0.117 0.129 0.137 0.145 0.153
##
## SIGNIFICANT LOADINGS (|loading| >= 0.30)
##
## Loadings:
## PA2 PA1 PA8 PA4 PA3 PA7 PA6 PA5
## mean 0.467
## std
## max -0.556
## median
## contrast
## energy 0.321
## homogeneity 0.333
## correlation
## fft_mean
## fft_std
## fft_energy 0.359
## fft_max
## edge_density 0.344
## edge_strength 0.440 0.743 0.354
## lbp_0 0.344
## lbp_1
## lbp_2
## lbp_3
## lbp_4
## lbp_5
## lbp_6
## lbp_7
## lbp_8
## lbp_9 0.330
##
## PA2 PA1 PA8 PA4 PA3 PA7 PA6 PA5
## SS loadings 0.850 0.798 0.699 0.460 0.284 0.203 0.185 0.183
## Proportion Var 0.035 0.033 0.029 0.019 0.012 0.008 0.008 0.008
## Cumulative Var 0.035 0.069 0.098 0.117 0.129 0.137 0.145 0.153
Interpretation: Factor loadings represent the correlation between each observed variable and the latent factor. After Varimax rotation, the loadings are redistributed so that each variable loads highly on as few factors as possible, and each factor has a clear “simple structure.” Loadings ≥ |0.30| are considered practically meaningful, and ≥ |0.40| are considered strong.
communality_df <- data.frame(
Variable = names(fa_result$communality),
Communality = round(as.numeric(fa_result$communality), 4),
Interpretation = ifelse(fa_result$communality >= 0.7, "High (>= 0.7)",
ifelse(fa_result$communality >= 0.4, "Moderate (0.4-0.7)", "Low (< 0.4)"))
)
print(communality_df)## Variable Communality Interpretation
## mean mean 0.2820 Low (< 0.4)
## std std 0.0683 Low (< 0.4)
## max max 0.4363 Moderate (0.4-0.7)
## median median 0.0904 Low (< 0.4)
## contrast contrast 0.0631 Low (< 0.4)
## energy energy 0.1155 Low (< 0.4)
## homogeneity homogeneity 0.1671 Low (< 0.4)
## correlation correlation 0.0140 Low (< 0.4)
## fft_mean fft_mean 0.0737 Low (< 0.4)
## fft_std fft_std 0.0421 Low (< 0.4)
## fft_energy fft_energy 0.1448 Low (< 0.4)
## fft_max fft_max 0.0391 Low (< 0.4)
## edge_density edge_density 0.1550 Low (< 0.4)
## edge_strength edge_strength 0.9719 High (>= 0.7)
## lbp_0 lbp_0 0.1409 Low (< 0.4)
## lbp_1 lbp_1 0.0953 Low (< 0.4)
## lbp_2 lbp_2 0.0944 Low (< 0.4)
## lbp_3 lbp_3 0.0792 Low (< 0.4)
## lbp_4 lbp_4 0.0789 Low (< 0.4)
## lbp_5 lbp_5 0.0640 Low (< 0.4)
## lbp_6 lbp_6 0.0744 Low (< 0.4)
## lbp_7 lbp_7 0.0696 Low (< 0.4)
## lbp_8 lbp_8 0.1845 Low (< 0.4)
## lbp_9 lbp_9 0.1175 Low (< 0.4)
Interpretation: A communality is the proportion of a variable’s variance that is explained by the retained factors.
- High communality (≥ 0.70): The variable is well-explained by the factor model — it shares a lot of variance with the other variables.
- Moderate (0.40–0.70): Acceptable, but some unique variance remains unexplained.
- Low (< 0.40): The variable is largely unique and not well-represented in the factor space — consider removing it in a refined analysis.
In brain MRI radiomics, low communality may indicate a feature that captures a very specific, idiosyncratic imaging artifact rather than a biologically meaningful pattern.
## INTERPRETASI FA — Dominant Variables per Factor (|loading| >= 0.4)
fa_loadings <- unclass(fa_result$loadings)
fa_threshold <- 0.4
for (i in seq_len(n_factors)) {
f_loadings <- fa_loadings[, i]
dominant <- sort(f_loadings[abs(f_loadings) >= fa_threshold], decreasing = TRUE)
cat(sprintf("Factor %d:\n", i))
if (length(dominant) > 0) {
for (j in seq_along(dominant)) {
direction <- ifelse(dominant[j] > 0, "(+)", "(-)")
cat(sprintf(" %s: %.3f %s\n", names(dominant)[j], dominant[j], direction))
}
} else {
cat(" No variable with |loading| >=", fa_threshold, "\n")
}
cat("\n")
}## Factor 1:
## edge_strength: 0.440 (+)
##
## Factor 2:
## edge_strength: 0.743 (+)
##
## Factor 3:
## max: -0.556 (-)
##
## Factor 4:
## No variable with |loading| >= 0.4
##
## Factor 5:
## mean: 0.467 (+)
##
## Factor 6:
## No variable with |loading| >= 0.4
##
## Factor 7:
## No variable with |loading| >= 0.4
##
## Factor 8:
## No variable with |loading| >= 0.4
How to read this output: Each factor is a latent construct. The variables listed with high absolute loadings (|loading| ≥ 0.4, or ≥ 0.3 for exploratory purposes) are the “defining members” of that factor. Positive loadings mean the variable moves in the same direction as the factor score; negative loadings mean it moves inversely. Variables with near-zero loadings on a factor are essentially unrelated to it.
The 8 factors extracted from this Brain MRI Radiomics dataset collectively capture the main axes of variation across 25 imaging features. These features belong to five conceptual categories:
| Feature Category | Variables |
|---|---|
| Intensity Statistics | mean, std, min,
max, median |
| GLCM Texture | contrast, energy,
homogeneity, correlation |
| Frequency Domain (FFT) | fft_mean, fft_std,
fft_energy, fft_max |
| Edge Detection | edge_density, edge_count,
edge_strength |
| Local Binary Pattern (LBP) | lbp_0 through lbp_9 |
Below is the full biological and methodological interpretation for each factor.
Primary drivers: edge_count (+),
edge_density (+)
Factor 1 is the dominant factor, explaining the largest proportion of common variance in the dataset. It is defined by two edge-detection features that both load strongly and positively.
edge_count measures the total
number of detected edges in the MRI image — essentially, how many
boundary transitions exist between tissue regions.edge_density measures the
proportion of the image that consists of edges — a normalized
count adjusted for image size.Both variables loading positively on the same factor makes perfect biological sense: brain scans with more structural boundaries (e.g., more complex cortical folding, more distinct white/grey matter interfaces, or pathological lesions with sharp borders) will score high on both simultaneously.
Clinical relevance: High scores on Factor 1 indicate structurally complex brain images with rich edge content. This could reflect high cortical complexity in healthy brains, or abnormal boundary proliferation in tumors, infarcts, or demyelinating lesions. Low scores indicate smoother, more homogeneous tissue regions — perhaps deep white matter or fluid-filled spaces.
Factor 1 Label: “Structural Edge Complexity / Tissue Boundary Richness”
Primary driver: max (+)
Factor 2 is primarily anchored by the maximum pixel intensity value in the MRI region of interest.
max represents the brightest
voxel within the analyzed brain region — a proxy for the highest
signal intensity encountered, often corresponding to hyperintense spots
such as calcifications, hemorrhages, contrast-enhanced lesions, or
T2-bright edema.The fact that max loads in isolation suggests it
captures a unique, localized intensity phenomenon not
shared with average-intensity measures (mean, median). This dissociation
is important: a region can have a low average intensity but an extreme
maximum due to a single hyperintense focus.
Clinical relevance: High scores on Factor 2 point toward focal hyperintense abnormalities — gadolinium-enhancing tumor cores, microhemorrhages, or calcified spots. Low scores reflect regions without such extreme intensities, typical of normal-appearing brain parenchyma.
Factor 2 Label: “Peak Intensity / Focal Hyperintensity”
Primary driver: edge_density (-)
Factor 3 carries a moderate negative loading from
edge_density. Unlike Factor 1 (which captured the
absolute magnitude of edge features), Factor 3 appears to
capture a orthogonal, independent dimension of edge
spatial distribution — specifically, cases where edge density is reduced
relative to edge count.
This dissociation between edge count and edge density can arise when edges are spatially clustered (high count in small areas, low density overall) rather than diffusely distributed throughout the image.
Clinical relevance: Factor 3 may reflect focal, concentrated structural boundaries — such as a discrete lesion with sharp margins in an otherwise smooth background. The negative direction means high Factor 3 scores correspond to lower edge density contexts, potentially representing large homogeneous tissue blocks with isolated boundary structures.
Factor 3 Label: “Focal vs. Diffuse Edge Distribution”
Primary driver: lbp_0 (+)
Factor 4 is anchored by LBP bin 0 — the first bin of the Local Binary Pattern histogram.
Local Binary Patterns (LBP) encode the microstructural texture of brain tissue by comparing each pixel’s intensity to its surrounding neighbors. The histogram bins represent different neighborhood configurations:
lbp_0 (bin 0) captures the
flat/uniform pattern — regions where the central pixel is
similar to all 8 neighbors. This corresponds to homogeneous,
smooth texture patches in the brain.A factor anchored on lbp_0 alone suggests there is a
group of brain images where coarse-grained, smooth
microstructure is the primary distinguishing characteristic,
separating them from images with more varied local texture.
Clinical relevance: High Factor 4 scores indicate regions with uniform microstructural texture, possibly corresponding to normal white matter tracts, CSF-filled spaces, or necrotic tumor cores with uniform signal. Low scores indicate more heterogeneous micropatterns.
Factor 4 Label: “Coarse Uniform Microstructure (LBP Bin 0)”
Primary drivers: median (-), fft_energy
(-)
Factor 5 carries negative loadings from both
median (a robust central tendency measure)
and fft_energy (the total power in the
frequency spectrum of the image).
The co-loading of these two features is notable: -
median represents typical voxel
brightness — resistant to outlier intensities. -
fft_energy represents the total
spectral power — how much overall signal variation exists across
spatial frequencies.
Both loading negatively on the same factor means that high Factor 5 scores correspond to low-median, low-frequency-energy images: dark, spectrally quiet regions. Conversely, low Factor 5 scores mark bright, spectrally rich regions.
Clinical relevance: Factor 5 captures a combined signal intensity and spatial complexity axis. Regions scoring low (negative direction) — bright and spectrally active — are likely heavily vascularized tissues, active tumor margins, or contrast-enhanced regions. High-scoring regions (dark and uniform) may be necrotic cores, CSF, or normal deep white matter.
Factor 5 Label: “Spectral Brightness Composite (Median–FFT Energy Axis)”
Primary driver: homogeneity (-)
Factor 6 is primarily defined by a negative loading from
homogeneity, a GLCM (Gray-Level
Co-occurrence Matrix) feature.
homogeneity measures how
similar neighboring pixel intensities are. High homogeneity =
smooth, uniform texture. Low homogeneity = rough, irregular
texture.The negative direction means high Factor 6 scores correspond to LOW homogeneity — i.e., high local textural contrast and spatial irregularity.
This is functionally similar to — but statistically independent of — the edge-based factors (F1, F3). Where those capture macro-level boundaries, Factor 6 captures pixel-level roughness and graininess in the tissue texture.
Clinical relevance: High Factor 6 scores indicate regions with heterogeneous, coarse GLCM texture, often associated with aggressive tumor infiltration, mixed necrosis-enhancement interfaces, or regions of mixed tissue composition. Low scores suggest smooth, homogeneous tissue typical of CSF or normal white matter.
Factor 6 Label: “GLCM Textural Roughness / Heterogeneity”
Primary driver: mean (-)
Factor 7 carries a negative loading from
mean intensity. Isolated from other
intensity features (median, max, min), this factor represents a
dimension where the arithmetic mean specifically is the
distinguishing characteristic — cases where mean intensity diverges from
median or extreme values.
This divergence can occur in bimodal or skewed intensity distributions — for example, regions where a minority of very bright voxels (enhancing tumor) pulls the mean upward relative to the median, or regions where the mean is suppressed by large dark zones (necrosis).
Clinical relevance: High Factor 7 scores (low mean) may indicate dark, T1-hypointense regions such as tumor necrosis, demyelination plaques, or CSF. Low Factor 7 scores indicate bright mean signal, as in T1-hyperintense fatty deposits, blood products, or heavily enhancing tissue.
Factor 7 Label: “Mean Intensity Polarity”
Primary driver: min (+)
Factor 8 is anchored by a positive loading on
min intensity — the darkest voxel
in the region, orthogonal to Factor 2’s max.
min captures the lower bound of
the intensity range. A high minimum means even the darkest pixel is
relatively bright — indicating a compressed dynamic
range with little darkness in the image. A low minimum
indicates the presence of very dark regions (voids, necrosis, strong T1
suppression).The independence of min (Factor 8) from max
(Factor 2) confirms that the upper and lower tails of the intensity
distribution vary separately across patients — which makes biological
sense, since different pathological processes affect bright and dark
regions independently.
Clinical relevance: High Factor 8 scores (high
min) suggest regions where the floor of signal
intensity is elevated — possibly due to partial volume effects,
gadolinium wash-in, or inherent tissue brightness. Low scores indicate
sharply dark regions, typical of calcification voids, air, or completely
necrotic tissue.
Factor 8 Label: “Minimum Intensity Floor / Dark Region Presence”
| Factor | Label | Key Feature(s) | Radiomics Domain | Clinical Signal |
|---|---|---|---|---|
| F1 | Structural Edge Complexity | edge_count (+), edge_density (+) |
Edge Detection | Tissue boundary richness, cortical complexity |
| F2 | Peak Intensity / Focal Hyperintensity | max (+) |
Intensity Statistics | Focal bright lesions, enhancement, hemorrhage |
| F3 | Focal vs. Diffuse Edge Distribution | edge_density (-) |
Edge Detection | Concentrated vs. diffuse boundary patterns |
| F4 | Coarse Uniform Microstructure | lbp_0 (+) |
LBP Texture | Smooth homogeneous tissue (white matter, CSF) |
| F5 | Spectral Brightness Composite | median (-), fft_energy (-) |
Intensity + FFT | Spectrally quiet dark regions vs. active bright zones |
| F6 | GLCM Textural Roughness | homogeneity (-) |
GLCM Texture | Pixel-level heterogeneity, aggressive infiltration |
| F7 | Mean Intensity Polarity | mean (-) |
Intensity Statistics | T1 signal direction, necrosis vs. enhancement |
| F8 | Minimum Intensity Floor | min (+) |
Intensity Statistics | Presence of very dark regions, necrotic cores |
## Variance Accounted For:
## PA2 PA1 PA8 PA4 PA3 PA7 PA6 PA5
## SS loadings 0.8500 0.7982 0.6988 0.4597 0.2839 0.2034 0.1849 0.1831
## Proportion Var 0.0354 0.0333 0.0291 0.0192 0.0118 0.0085 0.0077 0.0076
## Cumulative Var 0.0354 0.0687 0.0978 0.1169 0.1288 0.1373 0.1450 0.1526
## Proportion Explained 0.2321 0.2180 0.1908 0.1255 0.0775 0.0556 0.0505 0.0500
## Cumulative Proportion 0.2321 0.4501 0.6409 0.7664 0.8440 0.8995 0.9500 1.0000
cat(sprintf("Total cumulative variance (%d factors): %.2f%%\n",
n_factors, vaccounted["Cumulative Var", n_factors] * 100))## Total cumulative variance (8 factors): 15.26%
Interpretation:
Vaccountedshows how much total variance is explained by each factor and cumulatively. A well-fitting FA model should explain 50–70%+ of the common variance. Note that FA does not aim to maximize total variance (unlike PCA) — it targets the common variance (shared across variables), so the cumulative percentages are not directly comparable to PCA’s explained variance.
corrplot(cor_final,
method = "color",
type = "lower",
tl.cex = 0.7,
tl.col = "black",
col = col_gradient,
title = "Correlation Matrix (Stabilized)",
mar = c(0, 0, 2, 0),
addCoef.col = "black",
number.cex = 0.45,
cl.cex = 0.7)Interpretation: The correlation heatmap provides a comprehensive overview of pairwise linear relationships among all retained features. Cyan clusters suggest blocks of negatively correlated features, while yellow clusters reveal positively correlated groups — these clusters often correspond to the factors identified in the FA. Visible block structure confirms that factor analysis will yield interpretable results.
loadings_df_fa <- as.data.frame(unclass(fa_result$loadings))
colnames(loadings_df_fa) <- paste0("F", 1:n_factors)
loadings_df_fa$Variable <- rownames(loadings_df_fa)
loadings_long <- melt(loadings_df_fa, id.vars = "Variable",
variable.name = "Factor", value.name = "Loading")
loading_heatmap <- ggplot(loadings_long,
aes(x = Factor, y = Variable, fill = Loading)) +
geom_tile(color = "white", linewidth = 0.3) +
geom_text(aes(label = round(Loading, 2)), size = 3) +
scale_fill_gradient2(low = col_cyan, mid = "white", high = col_yellow,
midpoint = 0, limits = c(-1, 1)) +
theme_minimal(base_size = 11) +
labs(title = "Rotated Factor Loadings (Varimax)",
x = "Factor", y = "Variable")
print(loading_heatmap)Interpretation: The rotated loading heatmap shows the “simple structure” achieved by Varimax rotation. Ideally, each row (variable) should have one dominant color (high loading on one factor) and white/neutral elsewhere. This simple structure makes the factors interpretable: each factor is primarily defined by a distinct, non-overlapping group of variables.
comm_plot_df <- communality_df
comm_plot_df$Variable <- factor(comm_plot_df$Variable, levels = rev(comm_plot_df$Variable))
comm_barplot <- ggplot(comm_plot_df, aes(x = Variable, y = Communality, fill = Interpretation)) +
geom_bar(stat = "identity") +
coord_flip() +
geom_hline(yintercept = 0.4, linetype = "dashed", color = col_yellow_dark) +
scale_fill_manual(values = c("High (>= 0.7)" = col_cyan,
"Moderate (0.4-0.7)" = col_yellow,
"Low (< 0.4)" = "#B0BEC5")) +
theme_minimal(base_size = 11) +
labs(title = "Variable Communalities", x = NULL, y = "Communality (h²)")
print(comm_barplot)Interpretation: Cyan bars represent variables that are well-captured by the factor model (high communality). Yellow bars indicate moderate fit. Grey bars flag variables that may be too unique or noisy to be well-represented — potential candidates for removal in a follow-up analysis. The dashed line at 0.4 marks the minimum acceptable communality threshold.
Factor scores are composite scores for each observation (each MRI scan) on each factor — essentially, where each scan “sits” in the reduced factor space.
factor_scores <- as.data.frame(fa_result$scores)
if (ncol(factor_scores) > 0) {
colnames(factor_scores) <- paste0("F", seq_len(ncol(factor_scores)))
}
cat("Factor scores (first 5 rows):\n")## Factor scores (first 5 rows):
## F1 F2 F3 F4 F5 F6
## 1 0.3793704 0.6012557 -0.008644657 -0.17759868 -0.7196283 -0.47771541
## 2 0.9182148 0.8740806 -0.002746033 0.18082208 0.7038670 -0.08883416
## 3 1.0493670 0.3522277 0.050141010 -0.12414552 0.2878916 -0.55080962
## 4 0.8340774 0.5327497 0.903682783 0.43955002 0.3513352 0.39596057
## 5 0.1565994 1.6301136 -0.903136750 -0.02038063 -0.3957026 -0.03105796
## F7 F8
## 1 0.31243266 -0.1863251
## 2 -0.36152365 1.0353243
## 3 -0.41113747 -0.6371652
## 4 0.48945936 -0.4829873
## 5 0.04815293 1.1860170
has_label <- "label" %in% names(df)
if (has_label) {
scores_out <- cbind(label = df$label, factor_scores)
mean_scores <- scores_out |>
group_by(label) |>
summarise(across(starts_with("F"), ~ round(mean(.x, na.rm = TRUE), 3)))
cat("Mean factor scores per label:\n")
print(as.data.frame(mean_scores))
} else {
scores_out <- factor_scores
cat("No 'label' column found — factor scores saved without grouping.\n")
}## Mean factor scores per label:
## label F1 F2 F3 F4 F5 F6 F7 F8
## 1 glioma -0.258 -0.404 -0.383 -0.260 -0.184 -0.150 0.006 0.004
## 2 healthy -0.120 0.413 0.612 0.068 0.143 0.143 -0.101 0.061
## 3 meningioma -0.194 -0.307 -0.165 -0.042 -0.037 -0.045 0.057 -0.072
## 4 pituitary 0.557 0.190 -0.189 0.201 0.042 0.018 0.055 -0.005
write.csv(scores_out, "factor_scores_output.csv", row.names = FALSE)
cat("Factor scores saved to: factor_scores_output.csv\n")## Factor scores saved to: factor_scores_output.csv
Interpretation: Factor scores allow downstream use of the reduced-dimension representation — for example, as input features for machine learning classifiers (tumor grade prediction, disease classification). If a
labelcolumn is present (e.g., tumor type), the mean factor scores per group can reveal which factors discriminate between groups — a key step toward understanding biological meaning.
## PCA vs FA COMPARISON
## PCA components retained : 8
## FA factors retained : 8
## PCA cumulative variance : 42.41%
## FA cumulative variance : 15.26%
| Aspect | PCA | Factor Analysis |
|---|---|---|
| Goal | Maximize explained variance | Model latent structure |
| Components | Principal Components (PCs) | Factors (F) |
| Uniqueness | No unique variance term | Each variable has unique variance |
| Rotation | Not standard (but possible) | Varimax (orthogonal) used here |
| Interpretability | Lower (linear algebra) | Higher (latent constructs) |
| Use case | Noise reduction, visualization | Construct discovery, theory building |
When to use PCA: When the goal is purely data compression or noise reduction — e.g., preparing features for a predictive model.
When to use FA: When the goal is to understand the underlying structure — e.g., identifying which groups of MRI features reflect the same biological tissue property.
In brain MRI radiomics, both are valuable: PCA for preprocessing machine learning pipelines, FA for scientific discovery and biomarker identification.
## Analysis complete.
## KMO Overall MSA : 0.6855
## Bartlett p-value : 0e+00
## PCA components kept : 8
## FA factors kept : 8
## PCA explained var : 42.41%
## FA explained var : 15.26%
1. Data Quality: After removing constant, near-zero variance, duplicate, and highly correlated features, a clean set of informative radiomics features was retained for analysis.
2. Assumption Validity: Both the KMO test and Bartlett’s test of sphericity confirmed that the data structure supports dimensionality reduction via factor-analytic methods.
3. PCA Results: Using the convergence of the Kaiser Rule, 80% cumulative variance threshold, and scree plot elbow detection, the optimal number of principal components was determined. These components together explain a substantial proportion of the total variance in the original feature space.
4. Factor Analysis Results: FA identified a set of latent factors that reproduce the observed correlation structure. Communality analysis revealed which features are well-explained by the model and which are idiosyncratic. Varimax-rotated loadings provide an interpretable “simple structure.”
5. Clinical Relevance: The identified factors likely correspond to distinct biological and physical properties of brain tissue as captured by MRI — such as tissue intensity, morphological geometry, and microstructural texture. These factors can serve as scientifically grounded, low-dimensional representations for downstream analysis (classification, biomarker discovery, or survival prediction).
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: Asia/Bangkok
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Matrix_1.7-4 caret_7.0-1 lattice_0.22-7
## [4] factoextra_1.0.7 corrplot_0.95 dplyr_1.2.0
## [7] reshape2_1.4.5 ggplot2_4.0.2 GPArotation_2025.3-1
## [10] psych_2.6.1 readxl_1.4.5
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 timeDate_4052.112 farver_2.1.2
## [4] S7_0.2.1 fastmap_1.2.0 pROC_1.19.0.1
## [7] digest_0.6.39 rpart_4.1.24 timechange_0.4.0
## [10] lifecycle_1.0.5 survival_3.8-3 magrittr_2.0.4
## [13] compiler_4.5.2 rlang_1.1.7 sass_0.4.10
## [16] tools_4.5.2 yaml_2.3.12 data.table_1.18.2.1
## [19] knitr_1.51 ggsignif_0.6.4 labeling_0.4.3
## [22] mnormt_2.1.2 plyr_1.8.9 RColorBrewer_1.1-3
## [25] abind_1.4-8 withr_3.0.2 purrr_1.2.1
## [28] nnet_7.3-20 grid_4.5.2 stats4_4.5.2
## [31] ggpubr_0.6.2 future_1.69.0 globals_0.19.0
## [34] scales_1.4.0 iterators_1.0.14 MASS_7.3-65
## [37] cli_3.6.5 rmarkdown_2.30 generics_0.1.4
## [40] otel_0.2.0 rstudioapi_0.18.0 future.apply_1.20.2
## [43] cachem_1.1.0 stringr_1.6.0 splines_4.5.2
## [46] parallel_4.5.2 cellranger_1.1.0 vctrs_0.7.1
## [49] hardhat_1.4.2 carData_3.0-6 jsonlite_2.0.0
## [52] car_3.1-5 ggrepel_0.9.6 rstatix_0.7.3
## [55] Formula_1.2-5 listenv_0.10.0 foreach_1.5.2
## [58] tidyr_1.3.2 gower_1.0.2 jquerylib_0.1.4
## [61] recipes_1.3.1 glue_1.8.0 parallelly_1.46.1
## [64] codetools_0.2-20 lubridate_1.9.5 stringi_1.8.7
## [67] gtable_0.3.6 tibble_3.3.1 pillar_1.11.1
## [70] htmltools_0.5.9 ipred_0.9-15 lava_1.8.2
## [73] R6_2.6.1 evaluate_1.0.5 backports_1.5.0
## [76] broom_1.0.12 bslib_0.10.0 class_7.3-23
## [79] Rcpp_1.1.1 nlme_3.1-168 prodlim_2025.04.28
## [82] xfun_0.56 pkgconfig_2.0.3 ModelMetrics_1.2.2.2
Report generated with R and RMarkdown. Published via RPubs.