STAT 218 Final: MNIST Image Classification

Author

Sarah Capute and Dan Urchuk

Code

# Load Packages
library(tidyverse)
library(caret)
library(janitor)
library(dslabs)
library(e1071)
library(nnet)
library(data.table)
library(patchwork)
library(irr)
library(stats)

Can we use supervised learning to accurately classify handwritten single-digit numbers?

Introduction

Our project attempts to classify handwritten numbers using two different trained statistical learning algorithms: a support vector machine (SVM) and a random forest. Our data comes from the famous MNIST dataset, a collection of 70,000 handwritten single-digit numbers split into 60,000 training images and 10,000 testing images. The MNIST dataset is a re-shuffled version of the original NIST dataset, whose train and test data were inherently different; train data was the handwriting of U.S government employees while test data was the handwriting of American high school students—the former was probably more legible than the latter. The MNIST dataset shuffles these images so the train and test data both have samples from each group, and also introduces gray-scale values to better train the model to recognize patterns in the digits (LeCun, https://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf).

We will be evaluating the efficacy of our models using error rate, kappa values, and confusion matrices. Error rate is a percentage of how many predictions a model gets wrong. LeCun used an SVM to achieve an error rate of 0.8—this will be our target (https://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf). Kappa values measure how well a model’s predictions are compared to random guessing. A kappa value of 0.4 means the model is performing 40% better than random guessing. We believe kappa values in the range of 0.5-0.7 will be a solid indication our models are doing a good job at classifying handwritten numbers. Confusion matrices compare model predictions against actual results. For our purposes, we will be using confusion matrices to check our models’ predictions on the MNIST test data versus the actual labels assigned to the data we fed our models. Confusion matrices will show us where our models are particularly struggling. For instance, we can see if our model has a really hard time distinguishing between fours and nines.

MNIST data is in the form of 28x28 pixel images. Images are stored as vectors with 784 variables, each representing the grayscale value in that given pixel. We also mutated the test and train data to include the number label of each image.

Code

set.seed(1)

mnist <- read_mnist(
  path = NULL,
  download = FALSE,
  destdir = tempdir(),
  url = "https://www2.harvardx.harvard.edu/courses/IDS_08_v2_03/",
  keep.files = TRUE
)

mnist_train <- cbind(mnist$train$images, label=mnist$train$labels)
mnist_test <- cbind(mnist$test$images, label=mnist$test$labels)

subset_mnist_train <- as.data.frame(mnist_train) |>
  sample_n(2000)

SVM Model

A support vector machine (SVM) is a supervised learning model used to classify observations into classes/groups based on their variables, or feature values. SVMs are one of the most studied models and are very effective at classification—we believe it will do a good job classifying handwritten numbers. At its most basic level, an SVM is used to classify observations in a two-dimensional space. In two dimensions, an SVM finds a line that best separates data into groups based on their feature values. The “best” fit of this line is found by maximizing the distance between the line and the nearest data points, also known as maximizing the “margin” of the line.

Example two-dimensional SVM here: https://drive.google.com/file/d/14ii3Rh-rm_a42-AVccQ71X5bAJamAE9F/view?usp=sharing

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning: With Applications in R. New York, NY: Springer US Springer, 2021.

An SVM can operate in n dimensions, dividing classifications with a hyperplane of dimension n-1. Observations that exist in a three-dimensional space (i.e. have three feature values) will be divided into groups using a plane, observations with four feature values will be divided by a three-dimensional hyperplane, and so on.

A linear support vector classifier may not always be the best way to separate observations. In fact, this would often be too simple for models attempting to fit real world data. When this is the case, the “kernel trick” is implemented to create more complex separations of data, like a polynomial curve, for example. This works by performing a nonlinear transformation on the data to change it into a form in which an SVM can be more successful. If a polynomial is the best fit for the separation of data, the model will find the “best” polynomial fit, then orthogonally project each point onto that curve. This transformation is then used to re-graph the points so that a linear SVM can be implemented in separation.

Since our dataset contains 10 classification groups (one for each digit) and 784 predictor variables (one for each pixel in a 28x28 image), it’s harder to visualize how our SVM will work compared to the two-dimensional example provided above. Imagine, for the sake of this project, certain numbers will generally have clusters around the same pixels represented by this 784-dimensional space. For instance, a one will almost always occupy a vertical strip of pixels in the middle of an image. While a lot of numbers will overlap in the areas of the image in which they occupy, each has distinct patterns that the SVM can pick up on in order to make accurate classifications. A three and an eight occupy a lot of the same space, the space an eight occupies that a three doesn’t will be the key factor for our SVM.

To construct our SVM, we will use the “svm” function from the “e1071” library. This function is very effective and requires little tuning, although we have decided to use a polynomial kernel in our model because it will fit our data the best.

Code

set.seed(1)

#| warning: false
svm_fit <- svm(factor(label) ~., data = as.data.frame(mnist_train), 
               kernel = "polynomial")

Warning in svm.default(x, y, scale = scale, ..., na.action = na.action):
Variable(s) 'V1' and 'V2' and 'V3' and 'V4' and 'V5' and 'V6' and 'V7' and 'V8'
and 'V9' and 'V10' and 'V11' and 'V12' and 'V17' and 'V18' and 'V19' and 'V20'
and 'V21' and 'V22' and 'V23' and 'V24' and 'V25' and 'V26' and 'V27' and 'V28'
and 'V29' and 'V30' and 'V31' and 'V32' and 'V53' and 'V54' and 'V55' and 'V56'
and 'V57' and 'V58' and 'V83' and 'V84' and 'V85' and 'V86' and 'V112' and
'V113' and 'V141' and 'V142' and 'V169' and 'V477' and 'V561' and 'V645' and
'V646' and 'V672' and 'V673' and 'V674' and 'V700' and 'V701' and 'V702' and
'V728' and 'V729' and 'V730' and 'V731' and 'V755' and 'V756' and 'V757' and
'V758' and 'V759' and 'V760' and 'V781' and 'V782' and 'V783' and 'V784'
constant. Cannot scale data.

Code

svm_fit


Call:
svm(formula = factor(label) ~ ., data = as.data.frame(mnist_train), 
    kernel = "polynomial")


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  polynomial 
       cost:  1 
     degree:  3 
     coef.0:  0 

Number of Support Vectors:  7750

Code

label_pred = predict(svm_fit, mnist$test$images)

#Predictions
mnist_pred <- as.data.frame(mnist_test) |>
  mutate(pred=label_pred) |>
  mutate(correct = if_else(label==pred, 1, 0)) |>
  mutate(id = row_number())

#Accuracy of SVM Model
svm_accuracy <- mnist_pred |>
  summarize(accuracy = mean(correct))

print(svm_accuracy)

  accuracy
1   0.9791

Code

svm_error <- 100*(1-svm_accuracy)

print(svm_error)

  accuracy
1     2.09

Code

#Kappa Value of SVM
svm_kappa <- kappa2(mnist_pred |>
                      select(label, pred))
print(svm_kappa)

 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 10000 
   Raters = 2 
    Kappa = 0.977 

        z = 293 
  p-value = 0

Based on error rate and kappa value, our SVM was very effective. When training our model on all 60,000 observations in the train data, we were able to produce an error rate of 2.1 ((1-0.9791)*100) and a kappa value of 0.977. While our error rate is marginally worse than that achieved by LeCun, our kappa value vastly exceeded our expectations. Our model does significantly better than random guessing, which shows an SVM is effective in this type of image classification.

Code

#See which numbers are wrong more
svm_wrong <- mnist_pred |>
  filter(correct ==0) |>
  group_by(label) |>
  summarize(count=n()) |>
  arrange(count)

#SVM Confusion Matrix
table(Real = mnist_pred$label, Pred = mnist_pred$pred)

    Pred
Real    0    1    2    3    4    5    6    7    8    9
   0  972    0    1    0    0    3    1    0    2    1
   1    0 1124    2    1    1    2    3    0    2    0
   2    7    0 1006    0    2    1    5    7    3    1
   3    0    2    1  987    0    6    0    5    6    3
   4    2    0    2    0  965    0    3    1    0    9
   5    2    0    0    7    1  870    3    0    4    5
   6    4    5    1    0    3    6  937    0    2    0
   7    0    9    9    2    2    0    0 1000    0    6
   8    5    0    1    2    4    4    1    4  950    3
   9    3    5    0    3    9    4    1    1    3  980

Our confusion matrix is pretty symmetrical as well, once again indicating our SVM was pretty accurate in classifying the images we fed it. We created the confusion matrix to see whether there were any classification mistakes our model consistently made, for instance thinking fours are nines. However, there really isn’t anything egregious here—that’s a really good sign for our model.

Image Visualization for SVM Mistakes

Code

plots <- list()
image_data <- mnist_pred

set.seed(1)

id_pred  <- mnist_pred |>
  mutate(id = row_number()) |>
  filter(pred != label) |>
  group_by(pred) |>
  slice_sample(n = 5) |>
  select(id, pred)

row <- 1

for (k in id_pred$id) {
  vals <- image_data |>
    dplyr::slice(k) |>
    dplyr::select(-id, -correct, -pred, -label) |>
    pivot_longer(everything()) |>
    pull(value)
  
  my_data <- data.frame(pixel_val = vals,
                        pixel_num = 1:784,
                        col = rep(1:28, times = 28),
                        row = -rep(1:28, each = 28))
  
  p <- my_data |>
    mutate(pred = id_pred |> filter(id == k) |> pull(pred),
           id = k)
  
  plots[[row]] <- p
  
  row <- row + 1
}

final_data <- rbindlist(plots)

final_data |>
  arrange(pred) |>
  mutate(row_order = rep(1:5, each = 784, times = 10)) |>
  ggplot() +
  geom_tile(aes(x = col, y = row, fill = pixel_val)) +
  facet_grid(row_order~pred) +
  theme(legend.position = "none") +
  labs(x = "", y = "", title = "SVM Incorrect Predictions") +
  theme(axis.text.x = element_blank()) +
  theme(axis.text.y = element_blank()) +
  theme(strip.background.y = element_blank()) +
  theme(strip.text.y = element_blank()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(strip.text.x = element_text(size = 15))

We decided to include some samples of images our SVM got wrong. Each column heading displays the number our SVM guessed for the five images shown below it. There are some obvious mistakes, but for the most part, an onlooker can understand why our model made mistakes on most of these images. For instance, take the images that were predicted to be fours. The bottom three, which we assume are sevens, have convex top lines that make them eerily similar to—albeit poorly written—fours.

Random Forest

Random forests are another learning model that can be used effectively for classification. Random forests build a number of decision trees based on bootstrapped samples of training data. Each time a split is considered in an individual tree, that tree can only choose from a random sample of predictors rather than all available predictors. Random forests reduce variance; forests that don’t randomly sample predictors can end up with a lot of trees that are highly correlated if they consistently split on only the strongest predictors available.

To construct our random forest, we will use the “ranger” method within the “caret” library. Ranger works particularly quickly, which is advantageous to us considering our training data contains sixty thousand observations.

Code

set.seed(1)

rf_model_full <- train(factor(label)~., data = subset_mnist_train, 
                  method = "ranger",
                  importance = "impurity")

rf_model_full

Random Forest 

2000 samples
 784 predictor
  10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 2000, 2000, 2000, 2000, 2000, 2000, ... 
Resampling results across tuning parameters:

  mtry  splitrule   Accuracy   Kappa    
    2   gini        0.7430316  0.7136130
    2   extratrees  0.6466512  0.6057524
   39   gini        0.9124660  0.9026140
   39   extratrees  0.9153897  0.9058656
  783   gini        0.8782679  0.8645674
  783   extratrees  0.8991269  0.8877677

Tuning parameter 'min.node.size' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were mtry = 39, splitrule = extratrees
 and min.node.size = 1.

Four our purposes, we are looking at the combination of mtry and splitrule that produce the most accurate model. In this case that would be 39 and extratrees, respectively. Both error rate and kappa value were worse for our random forest compared to our SVM. Our error rate was 8.5 ((1-0.9154)*100) and our kappa value was 0.9059. Error rate was significantly worse than our target of 0.8, but our kappa value outperformed our expectations.

Code

set.seed(1)

#Make predictions using RF
label_pred_rf = predict(rf_model_full, mnist$train$images)

#Predictions Data Set
mnist_pred_rf <- as.data.frame(mnist_train) |>
  mutate(pred=label_pred_rf) |>
  mutate(correct = if_else(label==pred, 1, 0)) |>
  mutate(id = row_number())

#Confusion Matrix for Full RF
table(Real = mnist_pred_rf$label, Pred = mnist_pred_rf$pred)

    Pred
Real    0    1    2    3    4    5    6    7    8    9
   0 5793    2    4    6    5   12   45    5   48    3
   1    0 6601   51    8    6   27   13   13   14    9
   2   80   43 5458   35   53   20   78   71  108   12
   3   22   46  141 5305   12  263   25   55  164   98
   4   10   26   26    2 5335    4   82   35   37  285
   5   53   57   18   49   44 5020   76   18   42   44
   6   56   26   11    2   22   42 5734    1   24    0
   7   16   60  113    6   60    0    1 5785   22  202
   8   20   83   51   83   31  102   57   17 5288  119
   9   52   23   40   78  148   26    8  187   68 5319

The confusion matrix for our random forest is less impressive than what we got from our SVM, which shows this model was less accurate. A glaring mistake our forest made was mistaking fours for nines; there were almost three hundred instances of this mistake. Interestingly, the forest didn’t mistake nines for fours as much; there were only about hundred and fifty of these mistakes.

This confusion matrix was based on a larger sample than that of our SVM. However, it is important to note this matrix was markedly worse regardless of size.

PCA-Utilized Random Forest

Principal component analysis (PCA) is a tool used to suppress a large number of variables into fewer variables to explain the relationships between predictors and outputs more simply. It also can significantly decrease computation time while retaining a vast majority of the variance within the data. We ran PCA on the training dataset to create principal components (PCs) which we plugged into our model to decrease computation time. 60,000 observations each with 784 variables can take a long time for a random forest to process.

Code

set.seed(1)

#Make subset of training data to run PCA on
some_train <- as.data.frame(mnist_train) |> sample_n(10000)
pca_digits <- prcomp(some_train, scale = FALSE)

variance_explained <- (pca_digits$sdev)^2 / sum((pca_digits$sdev)^2)
cumulative_variance <- cumsum(variance_explained)

variance_data <- data.frame(
  PC = seq_along(variance_explained),
  Variance = variance_explained,
  CumulativeVariance = cumulative_variance)

# Elbow plot for variance explained
ggplot(variance_data[1:50,], aes(x = PC, y = Variance)) +
  geom_line() +
  geom_point() +
  labs(title = "Elbow Plot for PCA",
    x = "Principal Component",
    y = "Proportion of Variance Explained") +
  theme_minimal()

To decide which PCs were worthwhile to feed the model, we made an elbow plot to visualize how much of the variance was being captured by each PC. Once the graph starts to flatten out, including more PCs is less useful because of the extra computation time it would take for minimal improvement. In our case, the curve flattened out around 20, so we included 20 PCs in our model. This is a dramatic difference from the 784 variables on which we would have otherwise had to train our model.

PCA-Utilized Random Forest

Code

set.seed(1)

mnist_PC <- as.data.frame(some_train) |>
  mutate(PC1 = pca_digits$x[,1],
         PC2 = pca_digits$x[,2],
         PC3 = pca_digits$x[,3],
         PC4 = pca_digits$x[,4],
         PC5 = pca_digits$x[,5],
         PC6 = pca_digits$x[,6],
         PC7 = pca_digits$x[,7],
         PC8 = pca_digits$x[,8],
         PC9 = pca_digits$x[,9],
         PC10 = pca_digits$x[,10],
         PC11 = pca_digits$x[,11],
         PC12 = pca_digits$x[,12],
         PC13 = pca_digits$x[,13],
         PC14 = pca_digits$x[,14],
         PC15 = pca_digits$x[,15],
         PC16 = pca_digits$x[,16],
         PC17 = pca_digits$x[,17],
         PC18 = pca_digits$x[,18],
         PC19 = pca_digits$x[,19],
         PC20 = pca_digits$x[,20])

#Split into train and test data
sample_indices <- sample(1:nrow(some_train), size = 5000, replace = FALSE)
rf_train_data <- mnist_PC[sample_indices, ]
rf_test_data <- mnist_PC[-sample_indices, ]

#Train RF on PCs
rf_model_pca <- train(factor(label)~PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+
                    PC10+PC11+PC12+PC13+PC14+PC15+PC16+PC17+PC18+PC19+PC20,
                  data = rf_train_data, 
                  method = "ranger",
                  importance = "impurity")

rf_model_pca

Random Forest 

5000 samples
  20 predictor
  10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 5000, 5000, 5000, 5000, 5000, 5000, ... 
Resampling results across tuning parameters:

  mtry  splitrule   Accuracy   Kappa    
   2    gini        0.9149088  0.9053824
   2    extratrees  0.9237439  0.9152033
  11    gini        0.8840353  0.8710543
  11    extratrees  0.9136362  0.9039676
  20    gini        0.8607759  0.8451937
  20    extratrees  0.9062868  0.8957962

Tuning parameter 'min.node.size' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were mtry = 2, splitrule = extratrees
 and min.node.size = 1.

We ran a random forest on our first 20 principal components. Compared to our full random forest, this model performed pretty well. For this output we are looking at an mtry of 2 and a splitrule of extratrees. Our error rate was 7.6 ((1-0.9237)*100) and our kappa value was 0.915. While this is also not as accurate as our SVM, interestingly, our PCA-based random forest was more accurate than our full random forest. The PC model is beneficial for saving time and energy while still getting an impressive accuracy.

Code

set.seed(1)

#Predictions
label_pred_pca_rf = predict(rf_model_pca, rf_test_data)

mnist_pred_pca_rf <- as.data.frame(rf_test_data) |>
  mutate(pred=label_pred_pca_rf) |>
  mutate(correct = if_else(label==pred, 1, 0)) |>
  mutate(id = row_number())

#PCA RF Confusion Matrix
table(Real = mnist_pred_pca_rf$label, Pred = mnist_pred_pca_rf$pred)

    Pred
Real   0   1   2   3   4   5   6   7   8   9
   0 439   0   0   0   1   1   6   1   4   0
   1   0 581   1   2   0   3   0   1   1   0
   2   2   2 465   9   1   1   5  10  14   2
   3   1   2   3 458   0  11   4   6  12   5
   4   2   7   1   0 426   1   2   2   1  23
   5   4   2   1  21   2 430   9   3   2   2
   6   2   2   0   2   0   3 475   0   1   0
   7   5   5   4   0   4   0   0 487   3  10
   8   4   5   3  28   5   6   5   0 435   2
   9   2   2   3   7  24   3   1  15   6 446

Our confusion matrix for the PCA-based random forest was much smaller than that of the full random forest but once again illustrates that the PCA-based approach was at least slightly more accurate. However, both matrices were worse than that of our SVM.

Image Visualization for PCA RF Mistakes

Code

plots_2 <- list()
image_data_2 <- mnist_pred_pca_rf |>
  select(-contains("PC"))

set.seed(1)

id_pred_2  <- mnist_pred_pca_rf |>
  mutate(id = row_number()) |>
  filter(pred != label) |>
  group_by(pred) |>
  slice_sample(n = 5) |>
  select(id, pred)

row <- 1

for (k in id_pred_2$id) {
  vals <- image_data_2 |>
    dplyr::slice(k) |>
    dplyr::select(-id, -correct, -pred, -label) |>
    pivot_longer(everything()) |>
    pull(value)
  
  my_data_2 <- data.frame(pixel_val = vals,
                        pixel_num = 1:784,
                        col = rep(1:28, times = 28),
                        row = -rep(1:28, each = 28))
  
  p_2 <- my_data_2 |>
    mutate(pred = id_pred_2 |> filter(id == k) |> pull(pred),
           id = k)
  
  plots_2[[row]] <- p_2
  
  row <- row + 1
}

final_data_2 <- rbindlist(plots_2)

final_data_2 |>
  arrange(pred) |>
  mutate(row_order = rep(1:5, each = 784, times = 10)) |>
  ggplot() +
  geom_tile(aes(x = col, y = row, fill = pixel_val)) +
  facet_grid(row_order~pred) +
  theme(legend.position = "none") +
  labs(x = "", y = "", title = "PCA RF Incorrect Predictions") +
  theme(axis.text.x = element_blank()) +
  theme(axis.text.y = element_blank()) +
  theme(strip.background.y = element_blank()) +
  theme(strip.text.y = element_blank()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(strip.text.x = element_text(size = 15))

As with the SVM, we wanted to provide a visualization to show where our PCA-based random forest was making mistakes. While these mistakes are more glaring than that of the SVM, it is interesting to see where the forest went wrong. Some glaring mistakes were classifying eights as threes, eights as fives, and twos as eights. We suspect our forest was making splits based off characteristics these digits shared, leading to incorrect classifications.

Conclusion

We used an SVM and a random forest to classify handwritten numbers from the MNIST dataset. Our SVM was very effective at making classifications, and while our random forest was less successful, it showed promise as well.

Support vector machines have a lot of use in image classification without the meticulous tuning required of some more complex machine learning models like neural networks. We are very pleased with our decision to use an SVM in this project and are excited to explore additional applications of this type of model.