Dimensionality Reduction in Datasets

Introduction

Datasets can start with a single feature and grow to include many features, most of which may have minimal importance to the overall dataset.

Dimensionality reduction is a technique in data science and machine learning used to reduce the number of features (or dimensions) in a dataset while preserving as much relevant information as possible.

High-dimensional datasets can be:

Complex
Computationally expensive
Prone to overfitting

Dimensionality reduction helps to:

Simplify models
Improve performance
Enable better visualization of data

Example: Using the Iris Dataset

To explore dimensionality reduction, we can use the classic Iris dataset, which contains measurements of 150 iris flowers from three different species: setosa, versicolor, and virginica.

We would like to see if there are information in the data that enables us to distinguish between the three species based on the measurements, we can try by visualizing some of the features against each other.

# Magrittr Library for piping operations
library(magrittr)

# Visualization
library(ggplot2)

# Manipulating data frames
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

iris %>% head

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

We can see that the iris dataset has 4 features. Let’s try plotting some of the features, for example Sepal.Length vs Sepal.Width:

iris %>% ggplot() + geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species))

Let’s Plot Petal.Length vs Petal.Width:

iris %>% ggplot() + geom_point(aes(x = Petal.Length, y = Petal.Width, colour = Species))

This scatter plot already shows some clear separation between species based on just two features. Dimensionality reduction techniques help us find such informative combinations automatically when we have many features.

For example, It does look as if we should be able to distinguish the species. Setosa stands out on both plots, but Versicolor and Virginia overlap on the first plot.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a widely used unsupervised dimensionality reduction technique. It transforms a high-dimensional dataset into a lower-dimensional space by identifying new axes—called principal components—that capture the maximum variance in the data.

PCA maps your data from one vector space to another of the same dimensionality. While it doesn’t explicitly reduce the number of dimensions, it reorients the coordinate system so that:

The first principal component captures the most variance (information),
The second principal component captures the second most variance, and so on.

By selecting the top principal components, we can project the original data into a lower-dimensional space while preserving as much meaningful structure as possible.

Applying PCA to the Iris Dataset

PCA only works on numerical data, so we need to remove the Species column before applying the transformation.

We can use the prcomp() function in R to perform PCA:

pca <- iris %>% select(-Species) %>% prcomp
pca

## Standard deviations (1, .., p=4):
## [1] 2.0562689 0.4926162 0.2796596 0.1543862
## 
## Rotation (n x k) = (4 x 4):
##                      PC1         PC2         PC3        PC4
## Sepal.Length  0.36138659 -0.65658877  0.58202985  0.3154872
## Sepal.Width  -0.08452251 -0.73016143 -0.59791083 -0.3197231
## Petal.Length  0.85667061  0.17337266 -0.07623608 -0.4798390
## Petal.Width   0.35828920  0.07548102 -0.54583143  0.7536574

Output Explanation

The output of prcomp() provides two key pieces of information:

1. Standard Deviations

Standard deviations:
[1] 2.0562689 0.4926162 0.2796596 0.1543862

These standard deviations correspond to the principal components (PC1, PC2, PC3, PC4) and are listed in descending order of variance explained.

Interpretation:

PC1: 2.056 – captures the most variance

PC2: 0.493 – captures less variance than PC1

PC3: 0.280 – captures even less

PC4: 0.154 – captures the least variance

Variance Explained To calculate the variance for each component, we square the standard deviations:

Insight: PC1 captures around 92.4% of the variance. PC1 and PC2 together account for ~97.7%, meaning most of the information in the data can be represented using just the first two principal components.

1. Rotation Matrix (Loadings)

Rotation:
                      PC1         PC2         PC3         PC4
Sepal.Length   0.36138659 -0.65658877  0.58202985  0.3154872
Sepal.Width   -0.08452251 -0.73016143 -0.59791083 -0.3197231
Petal.Length   0.85667061  0.17337266 -0.07623608 -0.4798390
Petal.Width    0.35828920  0.07548102 -0.54583143  0.7536574

Rotation Matrix Interpretation

The rotation matrix shows the loadings (or weights) of each original variable on each principal component.

Interpretation by Component:

PC1:

High positive loadings:
- Petal.Length (0.857)
- Sepal.Length (0.361)
- Petal.Width (0.358)
Slight negative:
- Sepal.Width (-0.085)

Meaning:
PC1 mostly reflects overall flower size, especially petal features.

PC2:

High negative loadings:
- Sepal.Width (-0.730)
- Sepal.Length (-0.657)
Small positives:
- Petal.Length (0.173)
- Petal.Width (0.075)

Meaning:
PC2 is mainly about sepal dimensions, contrasting with PC1.

PC3:

Positive:
- Sepal.Length (0.582)
Negative:
- Sepal.Width (-0.598)
- Petal.Width (-0.546)
Small effect:
- Petal.Length (-0.076)

Meaning:
Captures interaction between sepal width and petal width.

PC4:

Strong positive:
- Petal.Width (0.754)
Negative:
- Petal.Length (-0.480)
Mixed:
- Sepal.Length (0.315)
- Sepal.Width (-0.320)

Meaning:
Highlights detailed contrast between petal length and width, though contributes very little variance (~0.5%).

Mapping Data to the Principal Component Space

To map the data to the new space spanned by the principal components, we use the predict() function.

The predict() function projects data onto the principal components defined by the PCA model. Specifically, it transforms a dataset (which may be the original data or new data) into the lower-dimensional space defined by the principal components (PCs) learned during the PCA computation.

mapped_iris <- pca %>% predict(iris)
mapped_iris %>% head

##            PC1        PC2         PC3          PC4
## [1,] -2.684126 -0.3193972  0.02791483  0.002262437
## [2,] -2.714142  0.1770012  0.21046427  0.099026550
## [3,] -2.888991  0.1449494 -0.01790026  0.019968390
## [4,] -2.745343  0.3182990 -0.03155937 -0.075575817
## [5,] -2.728717 -0.3267545 -0.09007924 -0.061258593
## [6,] -2.280860 -0.7413304 -0.16867766 -0.024200858

Visualizing PCA Results

Remember, we remove the Species column before performing PCA. We can now return it and plot PC1 against PC2.

Plotting PC1 against PC2 visualizes the Iris dataset in a 2D space, with each point representing an iris sample projected onto the two principal components that capture 97.7% of the variance.

mapped_iris %>% as.data.frame %>% cbind(Species = iris$Species) %>%
ggplot() +
geom_point(aes(x = PC1, y = PC2, colour = Species))

PCA Summary on the Iris Dataset

Principal Component Analysis (PCA) is applied to the numerical features of the Iris dataset (excluding the Species column) to reduce dimensionality while retaining most of the data’s variance.

PC1 captures approximately 92.4% of the total variance and is mainly influenced by petal measurements.
PC2 adds approximately 5.3% of the variance and is primarily driven by sepal dimensions.
Combined, PC1 + PC2 account for 97.7% of the total variance in the dataset.

Rotation Matrix

The rotation matrix (also known as the loading matrix) shows how much each original feature contributes to each principal component.

Projection and Visualization

Using the predict() function, we project the original data into the new principal component space.

By plotting PC1 vs PC2, we can visualize the Iris dataset in 2D, where each point represents a sample, and natural clustering by species often becomes apparent.
We can now see a clear distinction between the datasets.