Dimensionality reduction represents high-dimensional data in fewer dimensions while preserving key structure. In practice, it is used to: - visualize complex data, - reduce noise and redundancy, - create compact features for clustering or downstream modeling, - interpret dominant directions of variation (e.g., PCA loadings).
How do different dimensionality reduction methods compare in terms of (i) interpretability and (ii) neighborhood preservation on a real multivariate dataset?
We compare: - PCA (linear; interpretable loadings), - Classical MDS (distance-preserving embedding), - t-SNE (nonlinear; neighborhood-focused embedding).
We do not use labels to learn embeddings. Labels are used only for visualization/interpretation.
state.x77 (base R)We use the base R dataset state.x77: - 50 observations
(US states), - 8 numeric variables (population, income,
education-related variables, etc.).
For interpretation, we attach a categorical label: -
Region from state.region (Northeast / South /
North Central / West).
required_pkgs <- c("dplyr", "tidyr", "ggplot2")
missing <- required_pkgs[!vapply(required_pkgs, requireNamespace, quietly = TRUE, FUN.VALUE = logical(1))]
if (length(missing) > 0) {
stop(
"Missing R packages: ", paste(missing, collapse = ", "),
"\n\nInstall them first, e.g.:\n",
"install.packages(c(", paste0('"', missing, '"', collapse = ", "), "), repos='https://cloud.r-project.org')\n"
)
}
# t-SNE package
if (!requireNamespace("Rtsne", quietly = TRUE)) {
stop(
"Missing R package: Rtsne\n\nPlease install it first:\n",
"install.packages('Rtsne', repos='https://cloud.r-project.org')\n"
)
}
library(dplyr)
library(tidyr)
library(ggplot2)
library(Rtsne)
state_df <- as.data.frame(state.x77)
state_df$State <- rownames(state.x77)
state_df$Region <- factor(state.region)
dplyr::glimpse(state_df)
## Rows: 50
## Columns: 10
## $ Population <dbl> 3615, 365, 2212, 2110, 21198, 2541, 3100, 579, 8277, 4931, …
## $ Income <dbl> 3624, 6315, 4530, 3378, 5114, 4884, 5348, 4809, 4815, 4091,…
## $ Illiteracy <dbl> 2.1, 1.5, 1.8, 1.9, 1.1, 0.7, 1.1, 0.9, 1.3, 2.0, 1.9, 0.6,…
## $ `Life Exp` <dbl> 69.05, 69.31, 70.55, 70.66, 71.71, 72.06, 72.48, 70.06, 70.…
## $ Murder <dbl> 15.1, 11.3, 7.8, 10.1, 10.3, 6.8, 3.1, 6.2, 10.7, 13.9, 6.2…
## $ `HS Grad` <dbl> 41.3, 66.7, 58.1, 39.9, 62.6, 63.9, 56.0, 54.6, 52.6, 40.6,…
## $ Frost <dbl> 20, 152, 15, 65, 20, 166, 139, 103, 11, 60, 0, 126, 127, 12…
## $ Area <dbl> 50708, 566432, 113417, 51945, 156361, 103766, 4862, 1982, 5…
## $ State <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "…
## $ Region <fct> South, West, West, South, West, West, Northeast, South, Sou…
Region) from numeric featuresy <- state_df$Region
num_vars <- setdiff(colnames(state_df), c("State", "Region"))
X <- state_df %>% dplyr::select(all_of(num_vars)) %>% as.data.frame()
X_scaled <- scale(X)
X_scaled <- as.matrix(X_scaled)
cat("n =", nrow(X_scaled), " | p =", ncol(X_scaled), "\n")
## n = 50 | p = 8
summary(X)
## Population Income Illiteracy Life Exp
## Min. : 365 Min. :3098 Min. :0.500 Min. :67.96
## 1st Qu.: 1080 1st Qu.:3993 1st Qu.:0.625 1st Qu.:70.12
## Median : 2838 Median :4519 Median :0.950 Median :70.67
## Mean : 4246 Mean :4436 Mean :1.170 Mean :70.88
## 3rd Qu.: 4968 3rd Qu.:4814 3rd Qu.:1.575 3rd Qu.:71.89
## Max. :21198 Max. :6315 Max. :2.800 Max. :73.60
## Murder HS Grad Frost Area
## Min. : 1.400 Min. :37.80 Min. : 0.00 Min. : 1049
## 1st Qu.: 4.350 1st Qu.:48.05 1st Qu.: 66.25 1st Qu.: 36985
## Median : 6.850 Median :53.25 Median :114.50 Median : 54277
## Mean : 7.378 Mean :53.11 Mean :104.46 Mean : 70736
## 3rd Qu.:10.675 3rd Qu.:59.15 3rd Qu.:139.75 3rd Qu.: 81163
## Max. :15.100 Max. :67.30 Max. :188.00 Max. :566432
corr <- cor(X_scaled)
corr_df <- as.data.frame(as.table(corr))
colnames(corr_df) <- c("Var1", "Var2", "Corr")
ggplot(corr_df, aes(x = Var1, y = Var2, fill = Corr)) +
geom_tile() +
coord_equal() +
labs(title = "Correlation heatmap of standardized features", x = "", y = "") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Different methods optimize different objectives. To compare them with a simple internal criterion, we compute a k-nearest-neighbor overlap score:
For each observation, take its k nearest neighbors in: - the original standardized feature space, and - the low-dimensional embedding,
then compute the average overlap ratio. Higher means better local neighborhood preservation.
knn_indices <- function(D, k) {
n <- nrow(D)
idx <- vector("list", n)
for (i in 1:n) {
ord <- order(D[i, ])
ord <- ord[ord != i]
idx[[i]] <- ord[1:k]
}
idx
}
neighborhood_overlap <- function(X, Y, k = 10) {
Dx <- as.matrix(dist(X))
Dy <- as.matrix(dist(Y))
Nx <- knn_indices(Dx, k)
Ny <- knn_indices(Dy, k)
overlaps <- sapply(1:length(Nx), function(i) {
length(intersect(Nx[[i]], Ny[[i]])) / k
})
mean(overlaps)
}
We will compute this score for k = 10.
PCA finds orthogonal directions that maximize variance and provides interpretable loadings.
pca <- prcomp(X_scaled, center = FALSE, scale. = FALSE)
var_explained <- pca$sdev^2 / sum(pca$sdev^2)
pca_var_df <- data.frame(
PC = paste0("PC", 1:length(var_explained)),
var_explained = var_explained,
cum_var = cumsum(var_explained)
)
pca_var_df
ggplot(pca_var_df, aes(x = PC, y = var_explained)) +
geom_col() +
labs(title = "Scree plot (variance explained by each PC)", x = "", y = "Proportion of variance explained") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(pca_var_df, aes(x = PC, y = cum_var, group = 1)) +
geom_line() + geom_point() +
labs(title = "Cumulative variance explained", x = "", y = "Cumulative proportion") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
pca2 <- as.data.frame(pca$x[, 1:2])
colnames(pca2) <- c("Dim1", "Dim2")
pca2$Region <- y
pca2$State <- state_df$State
ggplot(pca2, aes(x = Dim1, y = Dim2, color = Region)) +
geom_point(alpha = 0.9) +
labs(title = "PCA embedding (2D)", color = "Region (for interpretation)")
Which original variables drive PC1 and PC2?
loadings <- pca$rotation[, 1:2, drop = FALSE]
load_df <- as.data.frame(loadings)
load_df$feature <- rownames(load_df)
load_long <- tidyr::pivot_longer(load_df, cols = c("PC1", "PC2"),
names_to = "PC", values_to = "loading")
ggplot(load_long, aes(x = reorder(feature, loading), y = loading)) +
geom_col() +
coord_flip() +
facet_wrap(~PC, scales = "free_y") +
labs(title = "PCA loadings for PC1 and PC2", x = "", y = "Loading")
Classical MDS embeds points by attempting to preserve pairwise Euclidean distances.
D <- dist(X_scaled, method = "euclidean")
mds <- cmdscale(D, k = 2, eig = TRUE)
mds2 <- as.data.frame(mds$points)
colnames(mds2) <- c("Dim1", "Dim2")
mds2$Region <- y
ggplot(mds2, aes(x = Dim1, y = Dim2, color = Region)) +
geom_point(alpha = 0.9) +
labs(title = "Classical MDS embedding (2D)", color = "Region (for interpretation)")
t-SNE emphasizes local neighborhood structure and can reveal
nonlinear patterns. Perplexity must satisfy
perplexity < (n - 1) / 3, so we choose it
automatically.
n <- nrow(X_scaled)
perp_max <- floor((n - 1) / 3) - 1
perp <- min(30, perp_max)
perp <- max(5, perp)
ts <- Rtsne::Rtsne(
X_scaled,
dims = 2,
perplexity = perp,
verbose = FALSE,
check_duplicates = FALSE
)
tsne2 <- as.data.frame(ts$Y)
colnames(tsne2) <- c("Dim1", "Dim2")
tsne2$Region <- y
ggplot(tsne2, aes(x = Dim1, y = Dim2, color = Region)) +
geom_point(alpha = 0.9) +
labs(title = paste0("t-SNE embedding (2D), perplexity = ", perp), color = "Region (for interpretation)")
k <- 10
score_pca <- neighborhood_overlap(X_scaled, pca$x[, 1:2], k = k)
score_mds <- neighborhood_overlap(X_scaled, mds$points, k = k)
score_tsne <- neighborhood_overlap(X_scaled, ts$Y, k = k)
comparison <- data.frame(
method = c("PCA (2D)", "Classical MDS (2D)", "t-SNE (2D)"),
neighborhood_overlap_k10 = c(score_pca, score_mds, score_tsne),
notes = c(
"Linear + interpretable loadings; preserves major variance directions.",
"Distance-preserving embedding in Euclidean geometry.",
paste0("Nonlinear; emphasizes local neighborhoods (perplexity=", perp, ").")
)
)
comparison
sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.utf8
## [2] LC_CTYPE=Chinese (Simplified)_China.utf8
## [3] LC_MONETARY=Chinese (Simplified)_China.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=Chinese (Simplified)_China.utf8
##
## time zone: Europe/Warsaw
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Rtsne_0.17 ggplot2_4.0.0 tidyr_1.3.1 dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 jsonlite_2.0.0 compiler_4.5.1 tidyselect_1.2.1
## [5] Rcpp_1.1.0 jquerylib_0.1.4 scales_1.4.0 yaml_2.3.10
## [9] fastmap_1.2.0 R6_2.6.1 labeling_0.4.3 generics_0.1.4
## [13] knitr_1.50 tibble_3.3.0 bslib_0.9.0 pillar_1.11.1
## [17] RColorBrewer_1.1-3 rlang_1.1.6 cachem_1.1.0 xfun_0.54
## [21] sass_0.4.10 S7_0.2.0 cli_3.6.5 withr_3.0.2
## [25] magrittr_2.0.4 digest_0.6.37 grid_4.5.1 rstudioapi_0.17.1
## [29] lifecycle_1.0.4 vctrs_0.6.5 evaluate_1.0.5 glue_1.8.0
## [33] farver_2.1.2 codetools_0.2-20 rmarkdown_2.30 purrr_1.1.0
## [37] tools_4.5.1 pkgconfig_2.0.3 htmltools_0.5.8.1