Datasets can start with a single feature and grow to include many features, most of which may have minimal importance to the overall dataset.
Dimensionality reduction is a technique in data science and machine learning used to reduce the number of features (or dimensions) in a dataset while preserving as much relevant information as possible.
High-dimensional datasets can be:
Dimensionality reduction helps to:
To explore dimensionality reduction, we can use the classic Iris dataset, which contains measurements of 150 iris flowers from three different species: setosa, versicolor, and virginica.
We would like to see if there are information in the data that enables us to distinguish between the three species based on the measurements, we can try by visualizing some of the features against each other.
# Magrittr Library for piping operations
library(magrittr)
# Visualization
library(ggplot2)
# Manipulating data frames
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
iris %>% head
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
We can see that the iris dataset has 4 features. Let’s try plotting some of the features, for example Sepal.Length vs Sepal.Width:
iris %>% ggplot() + geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species))
Let’s Plot Petal.Length vs Petal.Width:
iris %>% ggplot() + geom_point(aes(x = Petal.Length, y = Petal.Width, colour = Species))
This scatter plot already shows some clear separation between species based on just two features. Dimensionality reduction techniques help us find such informative combinations automatically when we have many features.
For example, It does look as if we should be able to distinguish the species. Setosa stands out on both plots, but Versicolor and Virginia overlap on the first plot.
Principal Component Analysis (PCA) is a widely used unsupervised dimensionality reduction technique. It transforms a high-dimensional dataset into a lower-dimensional space by identifying new axes—called principal components—that capture the maximum variance in the data.
PCA maps your data from one vector space to another of the same dimensionality. While it doesn’t explicitly reduce the number of dimensions, it reorients the coordinate system so that:
By selecting the top principal components, we can project the original data into a lower-dimensional space while preserving as much meaningful structure as possible.
PCA only works on numerical data, so we need to remove the
Species
column before applying the transformation.
We can use the prcomp()
function in R to perform
PCA:
pca <- iris %>% select(-Species) %>% prcomp
pca
## Standard deviations (1, .., p=4):
## [1] 2.0562689 0.4926162 0.2796596 0.1543862
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Sepal.Length 0.36138659 -0.65658877 0.58202985 0.3154872
## Sepal.Width -0.08452251 -0.73016143 -0.59791083 -0.3197231
## Petal.Length 0.85667061 0.17337266 -0.07623608 -0.4798390
## Petal.Width 0.35828920 0.07548102 -0.54583143 0.7536574
The output of prcomp()
provides two key pieces of
information:
Standard deviations:
[1] 2.0562689 0.4926162 0.2796596 0.1543862
These standard deviations correspond to the principal components (PC1, PC2, PC3, PC4) and are listed in descending order of variance explained.
Interpretation:
PC1: 2.056 – captures the most variance
PC2: 0.493 – captures less variance than PC1
PC3: 0.280 – captures even less
PC4: 0.154 – captures the least variance
Variance Explained To calculate the variance for each component, we square the standard deviations:
Insight: PC1 captures around 92.4% of the variance. PC1 and PC2 together account for ~97.7%, meaning most of the information in the data can be represented using just the first two principal components.
Rotation:
PC1 PC2 PC3 PC4
Sepal.Length 0.36138659 -0.65658877 0.58202985 0.3154872
Sepal.Width -0.08452251 -0.73016143 -0.59791083 -0.3197231
Petal.Length 0.85667061 0.17337266 -0.07623608 -0.4798390
Petal.Width 0.35828920 0.07548102 -0.54583143 0.7536574
The rotation matrix shows the loadings (or weights) of each original variable on each principal component.
PC1:
Meaning:
PC1 mostly reflects overall flower size, especially petal features.
PC2:
Meaning:
PC2 is mainly about sepal dimensions, contrasting with PC1.
PC3:
Meaning:
Captures interaction between sepal width and petal width.
PC4:
Meaning:
Highlights detailed contrast between petal length and width, though
contributes very little variance (~0.5%).
To map the data to the new space spanned by the principal components,
we use the predict()
function.
The predict()
function projects data onto the principal
components defined by the PCA model. Specifically, it transforms a
dataset (which may be the original data or new data) into the
lower-dimensional space defined by the principal components (PCs)
learned during the PCA computation.
mapped_iris <- pca %>% predict(iris)
mapped_iris %>% head
## PC1 PC2 PC3 PC4
## [1,] -2.684126 -0.3193972 0.02791483 0.002262437
## [2,] -2.714142 0.1770012 0.21046427 0.099026550
## [3,] -2.888991 0.1449494 -0.01790026 0.019968390
## [4,] -2.745343 0.3182990 -0.03155937 -0.075575817
## [5,] -2.728717 -0.3267545 -0.09007924 -0.061258593
## [6,] -2.280860 -0.7413304 -0.16867766 -0.024200858
Remember, we remove the Species
column before performing
PCA. We can now return it and plot PC1 against PC2.
Plotting PC1 against PC2 visualizes the Iris dataset in a 2D space, with each point representing an iris sample projected onto the two principal components that capture 97.7% of the variance.
mapped_iris %>% as.data.frame %>% cbind(Species = iris$Species) %>%
ggplot() +
geom_point(aes(x = PC1, y = PC2, colour = Species))
Principal Component Analysis (PCA) is applied to the numerical
features of the Iris dataset (excluding the Species
column)
to reduce dimensionality while retaining most of the data’s
variance.
The rotation matrix (also known as the loading matrix) shows how much each original feature contributes to each principal component.
Using the predict()
function, we project the original
data into the new principal component space.
By plotting PC1 vs PC2, we can visualize the Iris
dataset in 2D, where each point represents a sample, and natural
clustering by species often becomes apparent.
We can now see a clear distinction between the
datasets.