Fashion MNIST PCA

Intro

PCA on fashion MNIST for dimensionality reduction
SVM or something with Trees Classification before/after PCA

About Fashion MNIST:
* 70,000 grayscale images * 10 distinct image categories * each image is 28x28 pixels (785 features)

Importing the Data

libraries used:

library(caret)
library(tidyr)
library(dplyr)
library(ggplot2)
library(gridExtra)

helper functions for visualizing images

show_digit = function(arr784, col = gray(12:1 / 12), ...) {
  image(matrix(as.matrix(arr784[-785]), nrow = 28)[, 28:1], col = col, labels=FALSE, xaxt = "n", yaxt = "n", ...)
}

# load image files
load_image_file = function(filename) {
  ret = list()
  f = file(filename, 'rb')
  readBin(f, 'integer', n = 1, size = 4, endian = 'big')
  n    = readBin(f, 'integer', n = 1, size = 4, endian = 'big')
  nrow = readBin(f, 'integer', n = 1, size = 4, endian = 'big')
  ncol = readBin(f, 'integer', n = 1, size = 4, endian = 'big')
  x = readBin(f, 'integer', n = n * nrow * ncol, size = 1, signed = FALSE)
  close(f)
  data.frame(matrix(x, ncol = nrow * ncol, byrow = TRUE))
}

# load label files
load_label_file = function(filename) {
  f = file(filename, 'rb')
  readBin(f, 'integer', n = 1, size = 4, endian = 'big')
  n = readBin(f, 'integer', n = 1, size = 4, endian = 'big')
  y = readBin(f, 'integer', n = n, size = 1, signed = FALSE)
  close(f)
  y
}

loading the data

# load images
train = load_image_file("/home/bonzilla/Documents/MSDS/fashion-mnist/data/fashion/train-images-idx3-ubyte")
test  = load_image_file("/home/bonzilla/Documents/MSDS/fashion-mnist/data/fashion/t10k-images-idx3-ubyte")

# load labels
train_y = as.factor(load_label_file("/home/bonzilla/Documents/MSDS/fashion-mnist/data/fashion/train-labels-idx1-ubyte"))
test_y  = as.factor(load_label_file("/home/bonzilla/Documents/MSDS/fashion-mnist/data/fashion/t10k-labels-idx1-ubyte"))

Exploratory Visualizations

Visualize a single fashion MNIST image:

# view test image
show_digit(train[20000, ])
title('Example Image')

Visualiza a sampling of Fashion MNIST images:

# We plot 16 cherry-picked images from the training set
num = 10
par(mfrow=c(num, num), mar=c(0, 0.2, 1, 0.2))
for (i in 1:(num*num)) {
  show_digit(train[i, ])
}

Visualize each pixel’s mean value and standard deviation across all Fashion MNIST images in the training dataset. Additionally visualize which pixels

par(mfrow = c(1,2))

numcols = dim(train)[2]
train.ave = data.frame(pixel = apply(train, 2, mean))
show_digit( train.ave )
title( 'Pixel Mean Value' )

train.sd = data.frame(pixel = apply(train, 2, sd))
show_digit( train.sd )
title( 'Pixel Standard Deviation' )

par(mfrow = c(1,1))

The figure above left visualizes the mean pixel values across the train dataset. The figure to the right shows the standard deviations for each pixel. For both plots, higher intensity values are rendered as dark whereas low values are light. For many of the pixels in the image, the mean values is intermediate (gray) whereas the standard deviation is relatively high (dark). These pixels make up most of the variance in the dataset. However, we can see that there are two regions of pixels towards the image center where the mean pixel value is high (dark) while the pixel standard deviation is low (light). For these regions, the pixels have consistently high pixel values. Towards the periphery of the image there are pixels with both low mean values (light) and low standard deviation (light). These pixels have consistently low values.

The pixels with either consistently low or high values are of low information content and do not contribute much to models for image classification. These low information pixels add redundant features to the dataset. We can use PCA as a tool to reduce the dimensionality of the data such that we represent a maximum of data variance with fewer features.

The following figure highlights which pixels have a mean value less than 5% of the data range:

train.avescale <- train.ave %>%
  mutate( pixscale = pixel/255,
          lowval = pixscale < 0.05 )

show_digit(train.avescale$lowval)

PCA

We have shown above that there is redundancy in the fashion MNIST dataset. Here we will use PCA to reduce the number of features while retaining as much of the variance possible. PCA does this by finding a new set of axes that fit to the variance of the data. At heart, PCA is simply an eigendecomposition of the data which returns a set of eigenvectors and eigenvalues. Eigenvectors and eigenvalues describe the transformations necessary to go from the original axes to a new feature space.

We can use the results of PCA to perform a type of information compression on the original data by subsetting the amount of PCA components we use to describe the original data. For this analysis, we will use a criterion of 95% variance explained. From the 784 components that PCA yields, we will subset the minimum components needed such that the sum of the proportions of explained variance is greater than or equal to 95%. Such a manipulation is favorable because it will reduce the data redundance, the chances of overfitting, and the time necessary to train models.

# use the prcomp function
train.pca <- prcomp( train )
# pca summary
sumTrain <- summary( train.pca )
#calculate total variance explained by each principal component
df_pca <- data.frame( t( sumTrain$importance ) )
df_pca$compnum <- 1:dim(train)[2]

How many components account for 95% of the variance in the data?

comp95 <- min(which(df_pca$Cumulative.Proportion>=0.95))
comp95

## [1] 187

Visualizing the cumulative explained variance described by the principal components with a Skree Plot:

p1 <- ggplot( df_pca, aes( x = compnum, y = Proportion.of.Variance ) ) +
  geom_line() +
  ylim( c(0,0.3) ) +
  xlim( c(0,20)) +
  geom_hline( yintercept = 0.01, linetype = 'dotted', col = 'red') +
  annotate("text", x = 2, y = 0.01, 
           label = expression( "1%" ~ sigma), vjust = -0.5) +
  theme_minimal() +
  xlab( 'Principal Component Number' ) +
  ylab( 'Proportion Explained Variance' ) +
  ggtitle( 'Skree plot' )

p2 <- ggplot( df_pca, aes( x = compnum, y = Cumulative.Proportion ) ) +
  geom_line() +
  ylim( c(0,1.1) ) +
  xlim( c(0,200)) +
  geom_hline( yintercept = 0.95, linetype = 'dotted', col = 'red') +
  annotate("text", x = 2, y = 0.98, 
           label = expression( "95%" ~ sigma), vjust = -0.5) +
  theme_minimal() +
  xlab( 'Principal Component Number' ) +
  ylab( 'Cumulative Explained Variance' ) +
  ggtitle( 'Cumulative Explained Variance' )

grid.arrange( p1, p2, ncol=2)

From the skree plot, we can see a very sharp drop off in the proportion of explained variance. Principal Components greater than 12 account for less than 1% of the dataset’s variance. The first 12 components only account for a cumulative variance of 0.74, therefore it takes the combined contribution of many more components (187 components) to explain 95% of the variance of each pixel for the original images.

Visualizing several PCA components

par(mfrow = c(2,3))

show_digit(sumTrain$rotation[,1])
title(bquote('PC1: ' ~ .(round( df_pca$Proportion.of.Variance[1],3)*100) ~ '% explained variance'))

show_digit(sumTrain$rotation[,2])
title(bquote('PC2: ' ~ .(round( df_pca$Proportion.of.Variance[2],3)*100) ~ '% explained variance'))

show_digit(sumTrain$rotation[,3])
title(bquote('PC3: ' ~ .(round( df_pca$Proportion.of.Variance[3],4)*100) ~ '% explained variance'))

show_digit(sumTrain$rotation[,4])
title(bquote('PC4: ' ~ .(round( df_pca$Proportion.of.Variance[4],4)*100) ~ '% explained variance'))

show_digit(sumTrain$rotation[,392])
title(bquote('PC392: ' ~ .(round( df_pca$Proportion.of.Variance[392],6)*100) ~ '% explained variance'))

show_digit(sumTrain$rotation[,784])
title(bquote('PC784: ~' ~ .(df_pca$Proportion.of.Variance[784]) ~ '% explained variance'))

par(mfrow = c(1,1))

PCA returns a components for every dimension of the data in descending order of the amount of variance accounted for. The figure above shows the the first 4 components, component PC392 (middle), and the last component PC784. As already depicted in the Skree and Cumulative Explained Variance plots, the first four components explain 0.58% variance from PC1: 22.1% \(\rightarrow\) PC2: 5.1% variance. We can see clearly from the visualization of the components that the first several PCs are clearly discriminating between clothing classifications. For instance, PC1 distinguishes between T-shirt/top & Pullover (dark/high values) and shoe categories (light/low values). On the other hand, PC2 appears to distinguish Trousers from shoe categories. The representation become less clear as the explained variance decreases. For instance, PC392 and PC784 only explain 0.01% and 0.001% variance respectively and it is not clear from the visualization just what information these components represent.

PC1 and PC2 account for roughly half (0.47) of the train dataset’s variance. We can visualize the projections of the train data onto the features space of these two components:

label = c('T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot')

train_pca_x <- data.frame(train.pca$x) %>%
  select( c( 'PC1', 'PC2' ) ) %>%
  mutate( labels = train_y,
          labels_text = case_when( labels == 0 ~ 'T-shirt/top',
                                   labels == 1 ~ 'Trouser',
                                   labels == 2 ~ 'Pullover',
                                   labels == 3 ~ 'Dress',
                                   labels == 4 ~ 'Coat',
                                   labels == 5 ~ 'Sandal',
                                   labels == 6 ~ 'Shirt',
                                   labels == 7 ~ 'Sneaker',
                                   labels == 8 ~ 'Bag',
                                   labels == 9 ~ 'Ankle boot') )
cat_mean <- train_pca_x %>%
  group_by( labels_text ) %>%
  summarise( PC1_mean = mean( PC1 ),
             PC2_mean = mean( PC2 ) )
cat_mean

## # A tibble: 10 × 3
##    labels_text PC1_mean PC2_mean
##    <chr>          <dbl>    <dbl>
##  1 Ankle boot    -558.   -1101. 
##  2 Bag             43.7   -880. 
##  3 Coat          1059.    -211. 
##  4 Dress          283.    1043. 
##  5 Pullover       862.    -331. 
##  6 Sandal       -1499.    -120. 
##  7 Shirt          584.      39.7
##  8 Sneaker      -1475.    -404. 
##  9 T-shirt/top    686.     610. 
## 10 Trouser         15.7   1355.

ggplot( train_pca_x, aes( x = PC1, y = PC2, color = labels_text)) +
  geom_point( size = 1 ) +
  theme_classic() +
  geom_text(data=cat_mean, aes( x = PC1_mean, y = PC2_mean, label = labels_text),
            color = 'black', size = 5 ) +
  guides(colour = guide_legend(override.aes = list(size=10))) +
  ggtitle( 'Data Projections onto PC1 & PC2 feature space' )

The figure above shows the representations of the train images in the feature space of the first 2 dimensions. Each clothing item category is represented by a different color. For clarity, a text label (in black) which corresponds to the categorical mean values of PC1 & PC2 has been added. We can see that there is noticeable separation across the categories. Additionally, we see some clustering of category means that meets expectations. For example, the shoe categories (Sandal, Sneaker and Ankle Boot) group together towards the lower left hand corner of the figure. Clothing items that could all be described as tops with sleeves ( Pullover, Coat, Shirt & T-shirt/top) group together in the middle of the distribution. Trousers, on the other hand, have a noticeable distance from tops with sleeves but ar contiguous with the dress category which shares roughly vertical rectangular profile.

Let us now visualize the image representations in the 2 dimensional PC1 and PC2 feature space:

trunc <- train.pca$x[,1:2] %*% t(train.pca$rotation[,1:2])

#and add the center (and re-scale) back to data
if(train.pca$scale != FALSE){
    trunc <- scale(trunc, center = FALSE , scale=1/train.pca$scale)
}
if(train.pca$center != FALSE){
    trunc <- scale(trunc, center = -1 * train.pca$center, scale=FALSE)
}

npics = 6
mpics = 2
par(mfrow=c(mpics, npics), mar=c(0, 0.2, 1, 0.2))
for (i in 1:(npics*mpics)) {
  show_digit(train[i, ])
  title(bquote('original image #' ~ .(i) ))
  show_digit(trunc[i, ])
  title(bquote('compressed image #' ~ .(i) ))
}

The figures above show the original images next to the projections into the compressed PC1+PC2 feature space. With only 2 components, we can see that the representation is not completely lost for all images. For instance, The images for pullovers (Image #6 and #8) are both recognizable as pullovers. However, the representations of Dress images are confounded by PC2 which is tuned to discriminate Trousers. Additionally, many of the representations of shoes are quite ambiguous.

Clearly more features are necessary to represent the variance in the data.

Here we find the representation onto the first 187 components which were shown earlier to account for 95% of the variance in the image data:

trunc <- train.pca$x[,1:comp95] %*% t(train.pca$rotation[,1:comp95])

#and add the center (and re-scale) back to data
if(train.pca$scale != FALSE){
    trunc <- scale(trunc, center = FALSE , scale=1/train.pca$scale)
}
if(train.pca$center != FALSE){
    trunc <- scale(trunc, center = -1 * train.pca$center, scale=FALSE)
}

npics = 6
mpics = 2
par(mfrow=c(mpics, npics), mar=c(0, 0.2, 1, 0.2))
for (i in 1:(npics*mpics)) {
  show_digit(train[i, ])
  title(bquote('original image #' ~ .(i) ))
  show_digit(trunc[i, ])
  title(bquote('compressed image #' ~ .(i) ))
}

Truncating the data to 187 components does result in information loss, however, as we can see from the visualizations above, the images retain much of the detail from the original images while using a feature space 23.85% the size of the original.

PCA Performance Comparison

We used PCA to reduce the dimensionality of the Fashion MNIST train dataset. Here we will evaluate the speed and performance of an SVM model fit to either the original 784 feature image set or the PCA compressed 187 dataset. The compressed dataset will undoubtedly train faster, but at what cost to prediction accuracy?

Train, Time & Evaluate SVM fit on original images

The following code was used to fit a Support Vector Machine classifier to a subset (10000 images) of the original data. To facilitate the process, the model has been previously fit, saved and will be loaded for further analysis

# use a subset of train to expedite SVM
sub_train <- train[1:10000,] %>%
  mutate(labels = train_y[1:10000] )

start_time <- Sys.time()
svm_mdl <- train(labels~.,data=sub_train,
                 method="svmRadial",
                 trControl=trainControl(method="cv", number=3),
                 tuneGrid=data.frame(sigma = 0.01, C = 3.5),
                 verbose=TRUE)
end_time <- Sys.time()

print(svm_mdl)
run_time <- start_time - end_time
paste( 'run time: ', run_time )

runtime using the original feature space was found to be: 11:10
Here we load the previously saved model

svm_mdl_full <- load("svm_mdl_full")
print( get(svm_mdl_full) )

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 10000 samples
##   784 predictor
##    10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 6667, 6666, 6667 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7334001  0.7038107
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01
## Tuning
##  parameter 'C' was held constant at a value of 3.5

Train, Time & Evaluate SVM fit on compressed images

The following code was used to fit a Support Vector Machine classifier to a subset (10000 images) of the PCA compressed data features. Again, to facilitate the process, the model has been previously fit, saved and will be loaded for further analysis

# use a subset of train to expedite SVM
sub_train_trunc <- data.frame( trunc[1:10000,] ) %>%
  mutate(labels = train_y[1:10000] )

start_time2 <- Sys.time()
svm_mdl <- train(labels~.,data=sub_train_trunc,
                 method="svmRadial",
                 trControl=trainControl(method="cv", number=3),
                 tuneGrid=data.frame(sigma = 0.01, C = 3.5),
                 verbose=TRUE)
end_time2 <- Sys.time()

print(svm_mdl)
run_time2 <- start_time2 - end_time2
paste( 'run time: ', run_time2 )

runtime using the original feature space was found to be: 9:54

svm_mdl_trunc <- load("svm_mdl_trunc")
print( get(svm_mdl_trunc) )

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 10000 samples
##   784 predictor
##    10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 6667, 6668, 6665 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7608989  0.7343526
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01
## Tuning
##  parameter 'C' was held constant at a value of 3.5

Evaluate SVM performance on `test` data

We can now evaluate the performance of the models trained on the full feature space and the PCA-compressed feature space on the test data. First, we need to preprocess the test data for PCA in the same way that the train data was processed.

test.pca <- prcomp( test )
test_trunc <- test.pca$x[,1:comp95] %*% t(test.pca$rotation[,1:comp95])
#and add the center (and re-scale) back to data
if(test.pca$scale != FALSE){
    test_trunc <- scale(test_trunc, center = FALSE , scale=1/test.pca$scale)
}
if(test.pca$center != FALSE){
  test_trunc <- scale(test_trunc, center = -1 * test.pca$center, scale=FALSE)
}

pred <- predict( get(svm_mdl_full) , test)
#test$prediction <- pred
pred_trunc <- predict( get(svm_mdl_trunc), test_trunc)
#test_trunc$prediction <- pred_trunc

Confusion Matrix for SVM Classification with the full feature space

tbl_full <- table(Label = test_y[1:10000], Prediction = pred)
print(tbl_full)

##      Prediction
## Label   0   1   2   3   4   5   6   7   8   9
##     0 562   0   8  26   2   0  98   0 304   0
##     1   8 877   1  17   1   0   2   0  94   0
##     2   1   0 629   8  64   0  61   0 237   0
##     3  13   0  11 767  34   0  37   0 138   0
##     4   0   0  89  29 643   0  83   0 156   0
##     5   0   0   0   0   0 574   0  11 412   3
##     6  66   0  62  22  55   0 520   0 275   0
##     7   0   0   0   0   0  53   0 735 141  71
##     8   0   0   0   1   2   3   3   0 991   0
##     9   0   0   0   0   0   4   0  14 507 475

cat("The model is ", 100 * sum(diag(tbl_full)) / nrow(test), "% accurate.", sep = "")

## The model is 67.73% accurate.

Confusion Matrix for SVM Classification with the PCA compressed feature space

tbl_trunc <- table(Label = test_y[1:10000], Prediction = pred_trunc)
print(tbl_trunc)

##      Prediction
## Label   0   1   2   3   4   5   6   7   8   9
##     0 591   0   6  31   2   0  87   0 283   0
##     1   7 867   3  20   0   0   1   0 102   0
##     2   2   0 627   8  55   0  59   0 249   0
##     3  15   0   8 755  28   1  43   0 150   0
##     4   0   0  98  27 612   0  75   0 188   0
##     5   0   0   0   0   0 622   0  20 356   2
##     6  75   0  72  24  51   0 509   0 269   0
##     7   0   0   0   0   0  42   0 804  85  69
##     8   0   0   1   2   2   2   1   2 990   0
##     9   0   0   0   0   0   7   0  15 383 595

cat("The model is ", 100 * sum(diag(tbl_trunc)) / nrow(test), "% accurate.", sep = "")

## The model is 69.72% accurate.

Summary

PCA is a dimensionality reduction method that can be applied to high dimensional datasets. PCA reduces the number of features while preserving as much variance from the data as possible. Here we used PCA to reduce the number of features of the Fashion MNIST dataset from 784 to 187. We showed through a series of visualizations that transforming the images to the reduced feature space does so with an noticeable loss to image quality, however the gist of the images is still present. Furthermore, we used SVM model fits to evaluate performance with the reduced dimension PCA version of Fashion MNIST. A radial SVM classifier was trained on a subset of 10,000 images for both the full-featured set and the PCA-compressed set. The PCA-compressed model trained approximately 1 minute and 20seconds faster than the full-featured data. En face, this does difference does not sound appreciable. However, considering that training time for SVM scales quadratically with dataset size, one would expect to see noticeable improvements in performance had the entire train set been used for model training. Furthermore, we evaluated the classification accuracy and found that the PCA-compressed SVM model had a higher prediction accuracy on the test data. In conclusion, use of the PCA compressed dataset would be beneficial in down-stream modeling analysis. As a detraction, additional steps would be necessary to transform the data between full and compressed feature spaces to facilitate model interpretation.

Fashion MNIST PCA

Group Final Project

David Blumenstiel, Bonnie Cooper, Robert Welk, Leo Yi

2021-11-27

Intro

Importing the Data

Exploratory Visualizations

PCA

PCA Performance Comparison

Train, Time & Evaluate SVM fit on original images

Train, Time & Evaluate SVM fit on compressed images

Evaluate SVM performance on `test` data

Summary

Fashion MNIST PCA

Group Final Project

David Blumenstiel, Bonnie Cooper, Robert Welk, Leo Yi

2021-11-27

Intro

Importing the Data

Exploratory Visualizations

PCA

PCA Performance Comparison

Train, Time & Evaluate SVM fit on original images

Train, Time & Evaluate SVM fit on compressed images

Evaluate SVM performance on test data

Summary

Evaluate SVM performance on `test` data