1 1. Introduction

1.1 1.1 Motivation

Dimensionality reduction represents high-dimensional data in fewer dimensions while preserving key structure. In practice, it is used to: - visualize complex data, - reduce noise and redundancy, - create compact features for clustering or downstream modeling, - interpret dominant directions of variation (e.g., PCA loadings).

1.2 1.2 Research question

How do different dimensionality reduction methods compare in terms of (i) interpretability and (ii) neighborhood preservation on a real multivariate dataset?

We compare: - PCA (linear; interpretable loadings), - Classical MDS (distance-preserving embedding), - t-SNE (nonlinear; neighborhood-focused embedding).

We do not use labels to learn embeddings. Labels are used only for visualization/interpretation.

2 2. Data

2.1 2.1 Dataset: `state.x77` (base R)

We use the base R dataset state.x77: - 50 observations (US states), - 8 numeric variables (population, income, education-related variables, etc.).

For interpretation, we attach a categorical label: - Region from state.region (Northeast / South / North Central / West).

required_pkgs <- c("dplyr", "tidyr", "ggplot2")

missing <- required_pkgs[!vapply(required_pkgs, requireNamespace, quietly = TRUE, FUN.VALUE = logical(1))]
if (length(missing) > 0) {
  stop(
    "Missing R packages: ", paste(missing, collapse = ", "),
    "\n\nInstall them first, e.g.:\n",
    "install.packages(c(", paste0('"', missing, '"', collapse = ", "), "), repos='https://cloud.r-project.org')\n"
  )
}

# t-SNE package
if (!requireNamespace("Rtsne", quietly = TRUE)) {
  stop(
    "Missing R package: Rtsne\n\nPlease install it first:\n",
    "install.packages('Rtsne', repos='https://cloud.r-project.org')\n"
  )
}

library(dplyr)
library(tidyr)
library(ggplot2)
library(Rtsne)

state_df <- as.data.frame(state.x77)
state_df$State <- rownames(state.x77)
state_df$Region <- factor(state.region)

dplyr::glimpse(state_df)

## Rows: 50
## Columns: 10
## $ Population <dbl> 3615, 365, 2212, 2110, 21198, 2541, 3100, 579, 8277, 4931, …
## $ Income     <dbl> 3624, 6315, 4530, 3378, 5114, 4884, 5348, 4809, 4815, 4091,…
## $ Illiteracy <dbl> 2.1, 1.5, 1.8, 1.9, 1.1, 0.7, 1.1, 0.9, 1.3, 2.0, 1.9, 0.6,…
## $ `Life Exp` <dbl> 69.05, 69.31, 70.55, 70.66, 71.71, 72.06, 72.48, 70.06, 70.…
## $ Murder     <dbl> 15.1, 11.3, 7.8, 10.1, 10.3, 6.8, 3.1, 6.2, 10.7, 13.9, 6.2…
## $ `HS Grad`  <dbl> 41.3, 66.7, 58.1, 39.9, 62.6, 63.9, 56.0, 54.6, 52.6, 40.6,…
## $ Frost      <dbl> 20, 152, 15, 65, 20, 166, 139, 103, 11, 60, 0, 126, 127, 12…
## $ Area       <dbl> 50708, 566432, 113417, 51945, 156361, 103766, 4862, 1982, 5…
## $ State      <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "…
## $ Region     <fct> South, West, West, South, West, West, Northeast, South, Sou…

2.2 2.2 Preprocessing

Separate label (Region) from numeric features
Standardize features (mean 0, sd 1) because variables have different units

y <- state_df$Region

num_vars <- setdiff(colnames(state_df), c("State", "Region"))
X <- state_df %>% dplyr::select(all_of(num_vars)) %>% as.data.frame()

X_scaled <- scale(X)
X_scaled <- as.matrix(X_scaled)

cat("n =", nrow(X_scaled), " | p =", ncol(X_scaled), "\n")

## n = 50  | p = 8

3 3. Exploratory analysis

3.1 3.1 Summary statistics

summary(X)

##    Population        Income       Illiteracy       Life Exp    
##  Min.   :  365   Min.   :3098   Min.   :0.500   Min.   :67.96  
##  1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625   1st Qu.:70.12  
##  Median : 2838   Median :4519   Median :0.950   Median :70.67  
##  Mean   : 4246   Mean   :4436   Mean   :1.170   Mean   :70.88  
##  3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575   3rd Qu.:71.89  
##  Max.   :21198   Max.   :6315   Max.   :2.800   Max.   :73.60  
##      Murder          HS Grad          Frost             Area       
##  Min.   : 1.400   Min.   :37.80   Min.   :  0.00   Min.   :  1049  
##  1st Qu.: 4.350   1st Qu.:48.05   1st Qu.: 66.25   1st Qu.: 36985  
##  Median : 6.850   Median :53.25   Median :114.50   Median : 54277  
##  Mean   : 7.378   Mean   :53.11   Mean   :104.46   Mean   : 70736  
##  3rd Qu.:10.675   3rd Qu.:59.15   3rd Qu.:139.75   3rd Qu.: 81163  
##  Max.   :15.100   Max.   :67.30   Max.   :188.00   Max.   :566432

3.2 3.2 Correlation heatmap

corr <- cor(X_scaled)

corr_df <- as.data.frame(as.table(corr))
colnames(corr_df) <- c("Var1", "Var2", "Corr")

ggplot(corr_df, aes(x = Var1, y = Var2, fill = Corr)) +
  geom_tile() +
  coord_equal() +
  labs(title = "Correlation heatmap of standardized features", x = "", y = "") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

4 4. Evaluation: neighborhood preservation score

Different methods optimize different objectives. To compare them with a simple internal criterion, we compute a k-nearest-neighbor overlap score:

For each observation, take its k nearest neighbors in: - the original standardized feature space, and - the low-dimensional embedding,

then compute the average overlap ratio. Higher means better local neighborhood preservation.

knn_indices <- function(D, k) {
  n <- nrow(D)
  idx <- vector("list", n)
  for (i in 1:n) {
    ord <- order(D[i, ])
    ord <- ord[ord != i]
    idx[[i]] <- ord[1:k]
  }
  idx
}

neighborhood_overlap <- function(X, Y, k = 10) {
  Dx <- as.matrix(dist(X))
  Dy <- as.matrix(dist(Y))

  Nx <- knn_indices(Dx, k)
  Ny <- knn_indices(Dy, k)

  overlaps <- sapply(1:length(Nx), function(i) {
    length(intersect(Nx[[i]], Ny[[i]])) / k
  })
  mean(overlaps)
}

We will compute this score for k = 10.

5 5. Method 1 — PCA (Principal Component Analysis)

PCA finds orthogonal directions that maximize variance and provides interpretable loadings.

pca <- prcomp(X_scaled, center = FALSE, scale. = FALSE)

var_explained <- pca$sdev^2 / sum(pca$sdev^2)
pca_var_df <- data.frame(
  PC = paste0("PC", 1:length(var_explained)),
  var_explained = var_explained,
  cum_var = cumsum(var_explained)
)

pca_var_df

ggplot(pca_var_df, aes(x = PC, y = var_explained)) +
  geom_col() +
  labs(title = "Scree plot (variance explained by each PC)", x = "", y = "Proportion of variance explained") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(pca_var_df, aes(x = PC, y = cum_var, group = 1)) +
  geom_line() + geom_point() +
  labs(title = "Cumulative variance explained", x = "", y = "Cumulative proportion") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

5.1 5.1 PCA 2D embedding

pca2 <- as.data.frame(pca$x[, 1:2])
colnames(pca2) <- c("Dim1", "Dim2")
pca2$Region <- y
pca2$State <- state_df$State

ggplot(pca2, aes(x = Dim1, y = Dim2, color = Region)) +
  geom_point(alpha = 0.9) +
  labs(title = "PCA embedding (2D)", color = "Region (for interpretation)")

5.2 5.2 PCA loadings (interpretation)

Which original variables drive PC1 and PC2?

loadings <- pca$rotation[, 1:2, drop = FALSE]
load_df <- as.data.frame(loadings)
load_df$feature <- rownames(load_df)

load_long <- tidyr::pivot_longer(load_df, cols = c("PC1", "PC2"),
                                 names_to = "PC", values_to = "loading")

ggplot(load_long, aes(x = reorder(feature, loading), y = loading)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~PC, scales = "free_y") +
  labs(title = "PCA loadings for PC1 and PC2", x = "", y = "Loading")

6 6. Method 2 — Classical MDS (cmdscale)

Classical MDS embeds points by attempting to preserve pairwise Euclidean distances.

D <- dist(X_scaled, method = "euclidean")

mds <- cmdscale(D, k = 2, eig = TRUE)
mds2 <- as.data.frame(mds$points)
colnames(mds2) <- c("Dim1", "Dim2")
mds2$Region <- y

ggplot(mds2, aes(x = Dim1, y = Dim2, color = Region)) +
  geom_point(alpha = 0.9) +
  labs(title = "Classical MDS embedding (2D)", color = "Region (for interpretation)")

7 7. Method 3 — t-SNE (Rtsne)

t-SNE emphasizes local neighborhood structure and can reveal nonlinear patterns. Perplexity must satisfy perplexity < (n - 1) / 3, so we choose it automatically.

n <- nrow(X_scaled)
perp_max <- floor((n - 1) / 3) - 1
perp <- min(30, perp_max)
perp <- max(5, perp)

ts <- Rtsne::Rtsne(
  X_scaled,
  dims = 2,
  perplexity = perp,
  verbose = FALSE,
  check_duplicates = FALSE
)

tsne2 <- as.data.frame(ts$Y)
colnames(tsne2) <- c("Dim1", "Dim2")
tsne2$Region <- y

ggplot(tsne2, aes(x = Dim1, y = Dim2, color = Region)) +
  geom_point(alpha = 0.9) +
  labs(title = paste0("t-SNE embedding (2D), perplexity = ", perp), color = "Region (for interpretation)")

8 8. Comparison of methods

8.1 8.1 Neighborhood preservation score

k <- 10

score_pca  <- neighborhood_overlap(X_scaled, pca$x[, 1:2], k = k)
score_mds  <- neighborhood_overlap(X_scaled, mds$points, k = k)
score_tsne <- neighborhood_overlap(X_scaled, ts$Y, k = k)

comparison <- data.frame(
  method = c("PCA (2D)", "Classical MDS (2D)", "t-SNE (2D)"),
  neighborhood_overlap_k10 = c(score_pca, score_mds, score_tsne),
  notes = c(
    "Linear + interpretable loadings; preserves major variance directions.",
    "Distance-preserving embedding in Euclidean geometry.",
    paste0("Nonlinear; emphasizes local neighborhoods (perplexity=", perp, ").")
  )
)

comparison

9 9. Discussion

9.1 9.1 Key takeaways

PCA provides the clearest interpretability through variance explained and loadings.
MDS is a natural choice when the distance metric is central to the analysis.
t-SNE can reveal local structure but is sensitive to hyperparameters and does not preserve global distances.

9.2 9.2 Limitations

Neighborhood overlap is only one internal criterion; different tasks may require different evaluation.
t-SNE embeddings can vary with random seed and parameter choices.
Low-dimensional maps are primarily visualization tools.

9.3 9.3 Next steps

Try UMAP and compare using the same neighborhood-overlap score.
Evaluate stability across multiple random seeds.
Use the embeddings as preprocessing for clustering and compare cluster quality.

10 10. Reproducibility

sessionInfo()

## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.utf8 
## [2] LC_CTYPE=Chinese (Simplified)_China.utf8   
## [3] LC_MONETARY=Chinese (Simplified)_China.utf8
## [4] LC_NUMERIC=C                               
## [5] LC_TIME=Chinese (Simplified)_China.utf8    
## 
## time zone: Europe/Warsaw
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] Rtsne_0.17    ggplot2_4.0.0 tidyr_1.3.1   dplyr_1.1.4  
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.1     tidyselect_1.2.1  
##  [5] Rcpp_1.1.0         jquerylib_0.1.4    scales_1.4.0       yaml_2.3.10       
##  [9] fastmap_1.2.0      R6_2.6.1           labeling_0.4.3     generics_0.1.4    
## [13] knitr_1.50         tibble_3.3.0       bslib_0.9.0        pillar_1.11.1     
## [17] RColorBrewer_1.1-3 rlang_1.1.6        cachem_1.1.0       xfun_0.54         
## [21] sass_0.4.10        S7_0.2.0           cli_3.6.5          withr_3.0.2       
## [25] magrittr_2.0.4     digest_0.6.37      grid_4.5.1         rstudioapi_0.17.1 
## [29] lifecycle_1.0.4    vctrs_0.6.5        evaluate_1.0.5     glue_1.8.0        
## [33] farver_2.1.2       codetools_0.2-20   rmarkdown_2.30     purrr_1.1.0       
## [37] tools_4.5.1        pkgconfig_2.0.3    htmltools_0.5.8.1

Unsupervised Learning — Project 2 (Dimensionality Reduction)

Comparing PCA, Classical MDS, and t-SNE on the US States socio-economic dataset (state.x77)

Rongfeng Qiu (Student ID: 488004)

2025-12-28

1 1. Introduction

1.1 1.1 Motivation

1.2 1.2 Research question

2 2. Data

2.1 2.1 Dataset: `state.x77` (base R)

2.2 2.2 Preprocessing

3 3. Exploratory analysis

3.1 3.1 Summary statistics

3.2 3.2 Correlation heatmap

4 4. Evaluation: neighborhood preservation score

5 5. Method 1 — PCA (Principal Component Analysis)

5.1 5.1 PCA 2D embedding

5.2 5.2 PCA loadings (interpretation)

6 6. Method 2 — Classical MDS (cmdscale)

7 7. Method 3 — t-SNE (Rtsne)

8 8. Comparison of methods

8.1 8.1 Neighborhood preservation score

9 9. Discussion

9.1 9.1 Key takeaways

9.2 9.2 Limitations

9.3 9.3 Next steps

10 10. Reproducibility

Unsupervised Learning — Project 2 (Dimensionality Reduction)

Comparing PCA, Classical MDS, and t-SNE on the US States socio-economic dataset (state.x77)

Rongfeng Qiu (Student ID: 488004)

2025-12-28

1 1. Introduction

1.1 1.1 Motivation

1.2 1.2 Research question

2 2. Data

2.1 2.1 Dataset: state.x77 (base R)

2.2 2.2 Preprocessing

3 3. Exploratory analysis

3.1 3.1 Summary statistics

3.2 3.2 Correlation heatmap

4 4. Evaluation: neighborhood preservation score

5 5. Method 1 — PCA (Principal Component Analysis)

5.1 5.1 PCA 2D embedding

5.2 5.2 PCA loadings (interpretation)

6 6. Method 2 — Classical MDS (cmdscale)

7 7. Method 3 — t-SNE (Rtsne)

8 8. Comparison of methods

8.1 8.1 Neighborhood preservation score

9 9. Discussion

9.1 9.1 Key takeaways

9.2 9.2 Limitations

9.3 9.3 Next steps

10 10. Reproducibility

2.1 2.1 Dataset: `state.x77` (base R)