Introduction to Dimensionality Reduction

In machine learning and data analysis, datasets often contain a large number of features or variables. While having many features can provide detailed information, it can also lead to challenges such as increased computational complexity, over fitting, and difficulties in visualizing data. Dimensional reduction addresses these challenges by transforming high-dimensional data into a lower-dimensional form, retaining the most important information.

Benefits of Dimensionality Reduction:

Common Dimensionality Reduction Techniques:

1. Principal Component Analysis (PCA): A linear technique that transforms data into a new coordinate system, where the greatest variances by any projection of the data come to lie on the first coordinates (called principal components), the second greatest variances on the second coordinates, and so on.

2. t-Distributed Stochastic Neighbor Embedding (t-SNE): A non-linear technique primarily used for data visualization, which models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points.

3. Uniform Manifold Approximation and Projection (UMAP): A non-linear dimensionality reduction technique that is computationally efficient and preserves both local and global data structure, making it suitable for visualization and preserving the relationships in the data.

Each of these techniques has its strengths and is chosen based on the specific requirements of the analysis, such as the need for interpret ability, computational efficiency, or the preservation of data structure.

Research Questions

  1. How do PCA, t-SNE, and UMAP compare in terms of preserving variance and clustering patterns?

  2. Which method is best suited for visualization in 2D?

  3. How does dimensional reduction affect clustering performance?

Dataset

Iris Dataset: This classic dataset contains 150 samples of iris flowers, each described by four features: sepal length, sepal width, petal length, and petal width. The samples are classified into three species: Iris setosa, Iris versicolor, and Iris virginica.

Load Required Libraries

library(ggplot2)
library(FactoMineR)  # For PCA
library(factoextra)  # For PCA visualization
library(Rtsne)       # For t-SNE
library(umap)        # For UMAP

Load and Preprocess Data

# Load the Iris dataset
data(iris)

# Remove the categorical column (species) for dimensional reduction
iris_features <- iris[, -5]  # Exclude Species column

# Standardize the data (important for PCA and t-SNE)
iris_scaled <- scale(iris_features)

# Display the first few rows
head(iris_scaled)
##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,]   -0.8976739  1.01560199    -1.335752   -1.311052
## [2,]   -1.1392005 -0.13153881    -1.335752   -1.311052
## [3,]   -1.3807271  0.32731751    -1.392399   -1.311052
## [4,]   -1.5014904  0.09788935    -1.279104   -1.311052
## [5,]   -1.0184372  1.24503015    -1.335752   -1.311052
## [6,]   -0.5353840  1.93331463    -1.165809   -1.048667

Principal Component Analysis (PCA)

  • Step 1: Apply PCA
# Perform PCA
pca_result <- prcomp(iris_scaled, center = TRUE, scale. = TRUE)

# Check summary to see variance explained
summary(pca_result)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion  0.7296 0.9581 0.99482 1.00000
  • Step 2: Scree Plot (Explained Variance)
fviz_eig(pca_result, addlabels = TRUE, ylim = c(0, 60))

  • Step 3: PCA Biplot
fviz_pca_ind(pca_result, col.ind = iris$Species, 
             palette = "jco", addEllipses = TRUE,
             title = "PCA: Projection of Iris Data")

Interpretation:

  • The first two principal components explain most of the variance.
  • The plot shows how species are clustered.

t-distributed Stochastic Neighbor Embedding (t-SNE)

  • Step 1: Apply t-SNE

t-SNE (Rtsne function) does not allow duplicate rows in the dataset.

# Remove duplicate rows (important for t-SNE)
iris_scaled_unique <- unique(iris_scaled)
set.seed(26)  # For reproducibility
tsne_result <- Rtsne(iris_scaled_unique, dims = 2, perplexity = 30, verbose = TRUE)
## Performing PCA
## Read the 149 x 4 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.01 seconds (sparsity = 0.713752)!
## Learning embedding...
## Iteration 50: error is 44.961679 (50 iterations in 0.01 seconds)
## Iteration 100: error is 46.899944 (50 iterations in 0.01 seconds)
## Iteration 150: error is 45.562550 (50 iterations in 0.01 seconds)
## Iteration 200: error is 45.257420 (50 iterations in 0.01 seconds)
## Iteration 250: error is 44.318333 (50 iterations in 0.01 seconds)
## Iteration 300: error is 0.319061 (50 iterations in 0.01 seconds)
## Iteration 350: error is 0.154203 (50 iterations in 0.01 seconds)
## Iteration 400: error is 0.151281 (50 iterations in 0.01 seconds)
## Iteration 450: error is 0.149260 (50 iterations in 0.01 seconds)
## Iteration 500: error is 0.148146 (50 iterations in 0.01 seconds)
## Iteration 550: error is 0.145943 (50 iterations in 0.01 seconds)
## Iteration 600: error is 0.147368 (50 iterations in 0.01 seconds)
## Iteration 650: error is 0.146480 (50 iterations in 0.01 seconds)
## Iteration 700: error is 0.147560 (50 iterations in 0.01 seconds)
## Iteration 750: error is 0.145561 (50 iterations in 0.01 seconds)
## Iteration 800: error is 0.147101 (50 iterations in 0.01 seconds)
## Iteration 850: error is 0.147464 (50 iterations in 0.01 seconds)
## Iteration 900: error is 0.146714 (50 iterations in 0.01 seconds)
## Iteration 950: error is 0.147126 (50 iterations in 0.01 seconds)
## Iteration 1000: error is 0.145092 (50 iterations in 0.01 seconds)
## Fitting performed in 0.19 seconds.
# Convert to a DataFrame
tsne_data <- data.frame(tsne_result$Y, Species = iris$Species[!duplicated(iris_scaled)])

# Rename columns
colnames(tsne_data) <- c("Dim1", "Dim2", "Species")
  • Step 2: Visualize t-SNE Results
ggplot(tsne_data, aes(x = Dim1, y = Dim2, color = Species)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(title = "t-SNE Visualization of Iris Dataset") +
  theme_minimal()

Interpretation:

  • t-SNE forms well-separated clusters but may not maintain global structure.
  • The method is non-linear, making it ideal for visualization tasks.

Uniform Manifold Approximation and Projection (UMAP)

  • Step 1: Apply UMAP
set.seed(26)  # For reproducibility
umap_result <- umap(iris_scaled)

# Convert to DataFrame
umap_data <- data.frame(umap_result$layout, Species = iris$Species)
colnames(umap_data) <- c("Dim1", "Dim2", "Species")
  • Step 2: Visualize UMAP Results
ggplot(umap_data, aes(x = Dim1, y = Dim2, color = Species)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(title = "UMAP Visualization of Iris Dataset") +
  theme_minimal()

Interpretation:

Conclusion

**Pros**

PCA ; Preserves variance, interpretable Linear, t-SNE;Strong cluster separation, UMAP; Fast, good for visualization & clustering

**Cons**

PCA; not great for non-linear data t-SNE; Computationally expensive, no global structure UMAP; May require parameter tuning

Key Takeaways: PCA works well when the goal is to reduce dimensions while maintaining variance. t-SNE is great for visualizing clusters, though it lacks global structure. UMAP is a fast and powerful method, often outperforming t-SNE.

Research Questions Answers

  1. Question 1 was answered through the illustration of the graphs.

  2. Which method is best suited for visualization in 2D? If the goal is variance retention and feature importance → PCA is best. If the goal is strong cluster separation → t-SNE is ideal. If the goal is both cluster separation and global structure preservation → UMAP is the best choice.

  3. How does dimensional reduction affect clustering performance? Dimensional reduction can impact clustering in several ways:

✅ Advantages:

⚠️ Challenges:

Effect on the Iris dataset:

References

[1]Pearson, K. (1901). On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine.

[2]van der Maaten, L., & Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research.

[3]McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.