Introduction to Dimensionality Reduction

In machine learning and data analysis, datasets often contain a large number of features or variables. While having many features can provide detailed information, it can also lead to challenges such as increased computational complexity, over fitting, and difficulties in visualizing data. Dimensional reduction addresses these challenges by transforming high-dimensional data into a lower-dimensional form, retaining the most important information.

Benefits of Dimensionality Reduction:

Simplification: Reduces the complexity of the data, making it easier to analyze and interpret.
Visualization: Allows for the representation of data in 2D or 3D spaces, facilitating the identification of patterns and relationships.
Noise Reduction: Eliminates less significant features, enhancing the signal-to-noise ratio.
Improved Performance: Can enhance the performance of machine learning models by mitigating the curse of dimensionality.

Common Dimensionality Reduction Techniques:

1. Principal Component Analysis (PCA): A linear technique that transforms data into a new coordinate system, where the greatest variances by any projection of the data come to lie on the first coordinates (called principal components), the second greatest variances on the second coordinates, and so on.

2. t-Distributed Stochastic Neighbor Embedding (t-SNE): A non-linear technique primarily used for data visualization, which models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points.

3. Uniform Manifold Approximation and Projection (UMAP): A non-linear dimensionality reduction technique that is computationally efficient and preserves both local and global data structure, making it suitable for visualization and preserving the relationships in the data.

Each of these techniques has its strengths and is chosen based on the specific requirements of the analysis, such as the need for interpret ability, computational efficiency, or the preservation of data structure.

Research Questions

How do PCA, t-SNE, and UMAP compare in terms of preserving variance and clustering patterns?
Which method is best suited for visualization in 2D?
How does dimensional reduction affect clustering performance?

Dataset

Iris Dataset: This classic dataset contains 150 samples of iris flowers, each described by four features: sepal length, sepal width, petal length, and petal width. The samples are classified into three species: Iris setosa, Iris versicolor, and Iris virginica.

Load Required Libraries

library(ggplot2)
library(FactoMineR)  # For PCA
library(factoextra)  # For PCA visualization
library(Rtsne)       # For t-SNE
library(umap)        # For UMAP

Load and Preprocess Data

# Load the Iris dataset
data(iris)

# Remove the categorical column (species) for dimensional reduction
iris_features <- iris[, -5]  # Exclude Species column

# Standardize the data (important for PCA and t-SNE)
iris_scaled <- scale(iris_features)

# Display the first few rows
head(iris_scaled)

##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,]   -0.8976739  1.01560199    -1.335752   -1.311052
## [2,]   -1.1392005 -0.13153881    -1.335752   -1.311052
## [3,]   -1.3807271  0.32731751    -1.392399   -1.311052
## [4,]   -1.5014904  0.09788935    -1.279104   -1.311052
## [5,]   -1.0184372  1.24503015    -1.335752   -1.311052
## [6,]   -0.5353840  1.93331463    -1.165809   -1.048667

Principal Component Analysis (PCA)

Step 1: Apply PCA

# Perform PCA
pca_result <- prcomp(iris_scaled, center = TRUE, scale. = TRUE)

# Check summary to see variance explained
summary(pca_result)

## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

Step 2: Scree Plot (Explained Variance)

fviz_eig(pca_result, addlabels = TRUE, ylim = c(0, 60))

Step 3: PCA Biplot

fviz_pca_ind(pca_result, col.ind = iris$Species, 
             palette = "jco", addEllipses = TRUE,
             title = "PCA: Projection of Iris Data")

Interpretation:

The first two principal components explain most of the variance.
The plot shows how species are clustered.

t-distributed Stochastic Neighbor Embedding (t-SNE)

Step 1: Apply t-SNE

t-SNE (Rtsne function) does not allow duplicate rows in the dataset.

# Remove duplicate rows (important for t-SNE)
iris_scaled_unique <- unique(iris_scaled)

set.seed(26)  # For reproducibility
tsne_result <- Rtsne(iris_scaled_unique, dims = 2, perplexity = 30, verbose = TRUE)

## Performing PCA
## Read the 149 x 4 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.01 seconds (sparsity = 0.713752)!
## Learning embedding...
## Iteration 50: error is 44.961679 (50 iterations in 0.01 seconds)
## Iteration 100: error is 46.899944 (50 iterations in 0.01 seconds)
## Iteration 150: error is 45.562550 (50 iterations in 0.01 seconds)
## Iteration 200: error is 45.257420 (50 iterations in 0.01 seconds)
## Iteration 250: error is 44.318333 (50 iterations in 0.01 seconds)
## Iteration 300: error is 0.319061 (50 iterations in 0.01 seconds)
## Iteration 350: error is 0.154203 (50 iterations in 0.01 seconds)
## Iteration 400: error is 0.151281 (50 iterations in 0.01 seconds)
## Iteration 450: error is 0.149260 (50 iterations in 0.01 seconds)
## Iteration 500: error is 0.148146 (50 iterations in 0.01 seconds)
## Iteration 550: error is 0.145943 (50 iterations in 0.01 seconds)
## Iteration 600: error is 0.147368 (50 iterations in 0.01 seconds)
## Iteration 650: error is 0.146480 (50 iterations in 0.01 seconds)
## Iteration 700: error is 0.147560 (50 iterations in 0.01 seconds)
## Iteration 750: error is 0.145561 (50 iterations in 0.01 seconds)
## Iteration 800: error is 0.147101 (50 iterations in 0.01 seconds)
## Iteration 850: error is 0.147464 (50 iterations in 0.01 seconds)
## Iteration 900: error is 0.146714 (50 iterations in 0.01 seconds)
## Iteration 950: error is 0.147126 (50 iterations in 0.01 seconds)
## Iteration 1000: error is 0.145092 (50 iterations in 0.01 seconds)
## Fitting performed in 0.19 seconds.

# Convert to a DataFrame
tsne_data <- data.frame(tsne_result$Y, Species = iris$Species[!duplicated(iris_scaled)])

# Rename columns
colnames(tsne_data) <- c("Dim1", "Dim2", "Species")

Step 2: Visualize t-SNE Results

ggplot(tsne_data, aes(x = Dim1, y = Dim2, color = Species)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(title = "t-SNE Visualization of Iris Dataset") +
  theme_minimal()

Interpretation:

t-SNE forms well-separated clusters but may not maintain global structure.
The method is non-linear, making it ideal for visualization tasks.

Uniform Manifold Approximation and Projection (UMAP)

Step 1: Apply UMAP

set.seed(26)  # For reproducibility
umap_result <- umap(iris_scaled)

# Convert to DataFrame
umap_data <- data.frame(umap_result$layout, Species = iris$Species)
colnames(umap_data) <- c("Dim1", "Dim2", "Species")

Step 2: Visualize UMAP Results

ggplot(umap_data, aes(x = Dim1, y = Dim2, color = Species)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(title = "UMAP Visualization of Iris Dataset") +
  theme_minimal()

Interpretation:

UMAP is faster than t-SNE and preserves global structure better. _ It is good for clustering and visualization.

Conclusion

**Pros**

PCA ; Preserves variance, interpretable Linear, t-SNE;Strong cluster separation, UMAP; Fast, good for visualization & clustering

**Cons**

PCA; not great for non-linear data t-SNE; Computationally expensive, no global structure UMAP; May require parameter tuning

Key Takeaways: PCA works well when the goal is to reduce dimensions while maintaining variance. t-SNE is great for visualizing clusters, though it lacks global structure. UMAP is a fast and powerful method, often outperforming t-SNE.

Research Questions Answers

Question 1 was answered through the illustration of the graphs.
Which method is best suited for visualization in 2D? If the goal is variance retention and feature importance → PCA is best. If the goal is strong cluster separation → t-SNE is ideal. If the goal is both cluster separation and global structure preservation → UMAP is the best choice.
How does dimensional reduction affect clustering performance? Dimensional reduction can impact clustering in several ways:

✅ Advantages:

Reduces noise, making clustering algorithms more effective.
Improves efficiency by reducing computational complexity.
Helps remove irrelevant features, enhancing cluster separability.

⚠️ Challenges:

PCA may cause loss of non-linear relationships.
t-SNE and UMAP can distort distances, affecting clustering accuracy.
Some clustering algorithms (e.g., k-means) rely on Euclidean distance, which may be misleading after dimensional reduction.

Effect on the Iris dataset:

PCA: K-means performs poorly due to overlapping clusters.
t-SNE & UMAP: K-means clustering improves significantly due to better separation.

References

[1]Pearson, K. (1901). On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine.

[2]van der Maaten, L., & Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research.

[3]McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.

Dimensional Reduction

Paula Fredrick Gwanchele

30-01-2025

Introduction to Dimensionality Reduction

Research Questions

Dataset

Load Required Libraries

Load and Preprocess Data

Principal Component Analysis (PCA)

Interpretation:

t-distributed Stochastic Neighbor Embedding (t-SNE)

Interpretation:

Uniform Manifold Approximation and Projection (UMAP)

Interpretation:

Conclusion

Research Questions Answers

References