Introduction

Dimension reduction is a key technique in data analysis, particularly useful in visualizing high-dimensional data. In this blog, we explore one of the most popular dimension reduction techniques: Principal Component Analysis (PCA) applied to the MNIST dataset.

Data Preparation

Working with the MNIST dataset to demonstrate dimension reduction techniques in R requires a bit more preparation due to the size and complexity of the data. The MNIST dataset consists of 70,000 images of handwritten digits (0-9). We’ll use dimension reduction techniques to reduce these dimensions and visualize the data:

mnist_data <- read.csv('train.csv')
x_train <- mnist_data[,-1]
y_train <- mnist_data[, 1]
x_train <- as.matrix(x_train) # To be used with prcomp function
nzv <- nearZeroVar(x_train,saveMetrics=TRUE)
x_train <- x_train[,!nzv$zeroVar] # removing zero or near zero variance features

Dimension Reduction Techniques

Now, let’s see the technique applied to the dataset:

Principal Component Analysis

PCA reduces the number of variables while preserving as much information as possible. By reducing the number of dimensions, it becomes easier to plot the data; so that patterns and clusters might be more easily identified.

pca_result <- prcomp(x_train, center = TRUE)
#summary(pca_result)

There are many PCs; so let’s see contribution of just three of them:

PC1 <- fviz_contrib(pca_result, choice = "var", axes = 1,fill = "#990066",color = "#990066")
PC2 <- fviz_contrib(pca_result, choice = "var", axes = 2,fill = "#990066",color = "#990066")
PC3 <- fviz_contrib(pca_result, choice = "var", axes = 3,fill = "#990066",color = "#990066")
grid.arrange(PC1, PC2, PC3,   ncol=3)