Setup

library(skimr)
library(tidyverse)
library(gridExtra)
library(readr)
library(dplyr)
library(caret)
library(naivebayes)
library(factoextra) # For PCA plots
library(e1071)
library(Rtsne)
library(RColorBrewer)
library(gbm)
library(randomForest)

Data Description

mnist_raw <- read_csv("https://pjreddie.com/media/files/mnist_train.csv", col_names = FALSE)
mnist_raw_test <- read_csv("https://pjreddie.com/media/files/mnist_test.csv", col_names = FALSE)
# Reduce the dataset down from 60,000 observations
mnist_subset <- mnist_raw %>%  head(5000)
# Relabel X1 and add instance number
mnist_subset <- mnist_subset  %>%  rename(label = X1) %>%  mutate(instance = row_number())
# Split dataset into X and y
X <- mnist_subset %>% select(contains('X'))
y <- mnist_subset$label 

The MNIST dataset is a compilation of handwritten digits which have been digitzed for use in supervised machine learning classification applications. The datset is fairly large with 60,000 observations where each observation is a handwritten digit from various subjects. As a result of this diversity, each handwritten digit of the same number can have differences due to penmanship style and variation within the same penmanship style.

Each handwritten digit is on a 28x28 pixel image. The digits per observation range from 0 to 9 where each observation/digit in the 28x28 pixel have had their size normalized and have been centered on the image canvas.

Below is a sample of observations from the dataset. Note that the images are grayscale. Each every pixel on the canvas is represented by a integer range from 0 to 255, where 0 means the pixel is completely white and 255 means the pixel is completely black, the ranges from 1 to 254 are the various shades of the color gray. Since each image is 28x28 pixels in size, then each image can be represented by a 28x28 size matrix.

# Display example instances
pixels_gathered <- mnist_subset %>%  gather(pixel, value, -label, -instance) %>%  tidyr::extract(pixel, "pixel", "(\\d+)", convert = TRUE) %>%  mutate(pixel = pixel - 2, x = pixel %% 28, y = 28 - pixel %/% 28)
theme_set(theme_light())
pixels_gathered %>%  filter(instance <=12) %>%  ggplot(aes(x, y, fill = value)) +  geom_tile() +  facet_wrap(~ instance + label) + scale_fill_gradient(low = "white", high = "black")

To represent each matrix as an observation, each digit matrix has been flattened (converting from a multidimensional array to a single dimensional array) such that each observation is a integer list of (\(28x28=784\)) length 784, where each value in the list is the pixel value ranging from 0 to 255. This list has an additional value which contains a number from the range 0-9 containing the label value of the number represented by the image, bringing the length per observation to 785.

Below is the matrix representation of observation #8 from the sample above.
Matrix Representation of observation #8, the number 3

Bringing this all together, the dataset has 60,000 observations, 784 features and 1 classification label. The dataset of digital images is now represented in a format suitable for analysis and modeling. To ease computation time, a subset of 5,000 observations was used in this modeling exercise.

Data Exploration

The distribution of the draw of 5,000 samples from dataset is displayed below. The distribution is fairly even, though we note that the number 2 is largely more represented than the number 5. The majority of points are either 0 (white) or 255 (black). Most values are not useful which suggest that some dimensionality reduction methods might be useful.

plot1 <- ggplot(mnist_subset, aes(label)) + geom_bar(fill="steelblue")
plot2 <- ggplot(pixels_gathered, aes(value)) +  geom_histogram(bins=256, fill="steelblue")
grid.arrange(plot1, plot2, ncol=2)

The figure below is a representation of the average pixels values for the 10 digits in the dataset. The fuzzy, grey areas are zones of high variability. We can notice the most variability in the tails and long strokes of some digits. On the other hand, some digits such as 0 and 1 have highler pixel concrentrations.

pixel_summary <- pixels_gathered %>%  group_by(x, y, label) %>%  summarize(mean_value = mean(value)) %>%  ungroup()
pixel_summary %>%  ggplot(aes(x, y, fill = mean_value)) +  geom_tile() +  scale_fill_gradient2(low = "white", high = "black", mid = "gray", midpoint = 127.5) +  facet_wrap(~ label, nrow = 2) +  labs(title = "Average value of each pixel in 10 MNIST digits", fill = "Average value") +  theme_void()

Data Processing

Dimensionality Reduction

Two dimensionality reduction methods are explore, Near-Zero Variance (NVZ) and Principal Component Analysis (PCA). Another dimensionality reduction technique is t-Distributed Stochastic Neighbor Embedding (t-SNE), though is mainly used for visual exploration

Near Zero Variance

The near zero variance method looks at the distribution of variables across the dataset and identifies features with low percentages of unique values. These variables are essentially constants and contain no information. As previously shown, the dataset is sparse and the majority of the pixels are always zero. The resulting dataset retains only 250 variables. The figure below is a representation of the most variable pixels on the image canvas.

nonzero_var <- nearZeroVar(X, saveMetrics = FALSE)
image(matrix(1:784 %in% nonzero_var, 28, 28))

X <- mnist_subset[ , -nonzero_var]
X <- select(X, -instance)

t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is a dimensionality reduction technique that will project high dimension data into a two dimensional plane. It is an unsupervised technique that helps identify patters or clusters in the dataset. While the results of the t-SNE reduction cannot be used directly in models t-SNE helps in understanding the underlying data and relationships within and if there is good separation from t-SNE then models are more likely to have better accuracy on unseen data.

From the below t-SNE visualization we see that some digits cluster better than other digits. 0 is well clustered and is seperated, as well as 1 and 6. The other are scattered in various smaller groups

tsne_v <- Rtsne(as.matrix(select(mnist_subset, -label)), dims=2, check_duplicates = FALSE, pca = TRUE, pca_scale = FALSE, theta = 0.1 , perplexity = 45)
tsne_coords <- data.frame(X=tsne_v$Y[,1], Y=tsne_v$Y[,2], Label=as.factor(mnist_subset$label))
ggplot(tsne_coords, aes(x=X, y=Y, color=Label)) + geom_point(size = 1) + labs(title = "t-SNE Cluster Visualization MNIST Digits") + scale_color_brewer(palette = "Set3") + theme_dark()

Principal Component Analysis

Principal Component Analysis (PCA) is a dimensionality reduction technique where a dataset is transformed to use p eigenvectors of the covariance matrix instead of the original number of predictors n, where p < n. The number of eigenvectors p is selected by looking at the sorted eigenvalues and determining a threshold percentage of variance explained and the resulting p. 

The method seeks to project the data into a lower dimensional space where each axis (or principal component) captures the most variability in the data subject to the condition of being uncorrelated to the other axes. This last condition is important for dimensionality reduction in the sense that large datasets can contain many correlated variables which hold no additional information.

pca <- prcomp(select(mnist_raw, -X1), center = TRUE, scale = FALSE)
qqnorm(pca[["x"]][,1])

Compared to t-SNE, the PCA reduction is much less useful and non-interpretable. This is becuase the first two Principal Components only explain approximately 18% of the variance. To get to a higher percentage of the variance explained, we will have to increase our dimensions.

fviz_pca_ind(pca,
             geom.ind = "point", 
             col.ind = as.factor(mnist_raw$X1),
             addEllipses = TRUE,
             legend.title = "Digits")

Post PCA execution there are 3 methods to reduce dimensionality. These 3 methods are:

  • The Kaiser Rule
  • The Scree Method
  • Cumulative Proportion of Variance Explained Method

The Kaiser Rule states that the optimal factor reduction is achieved by selecting Principal Components that have an eigenvalue greater than 1. An eigenvalue of 1 contains the same information as 1 variable Below are all dimensions/Principal Components that have the an eigenvalue that are greater than 1.

This results in far to many dimensions, 650 to be exact. This reduction would not reduce computation complexity significantly

get_eigenvalue(pca) %>% filter(eigenvalue > 1)

The Scree Method states that the best number of components to reduce a dataset by is the component that forms an “elbow”. The scree plot shows that after the 3rd Principal component, additional components successively diminish in explaining the variance.

The results of the scree plot shows that there is no such elbow to be found

pca %>% fviz_eig(addlabels = TRUE)

The Cumulative Proportion of Variance Explained Method states that at least two-thirds of the variance is explained by the reduced feature set. In order words keep the number of Principal Components that cumulatively explain two-thirds of the variation and discard the Principal Components after the two-thirds mark.

In this case we will opt to select at least 95% of cumulative variance explained, and this results in a dataset that is reduced to 154 dimensions instead of 784

reduced_dim_95 <- get_eigenvalue(pca) %>% filter(cumulative.variance.percent < 95.02)
reduced_dim_95
mnist_pca_reduced <- as.data.frame(pca$x[,c(1:nrow(reduced_dim_95))])
head(mnist_pca_reduced)

Modeling

# Data Partitioning
set.seed(622)
trainIndex <- createDataPartition(y, p = .8, list = FALSE, times = 1)
X_train <- X[trainIndex,]
y_train <- y[trainIndex]
X_test <- X[-trainIndex,]
y_test <- y[-trainIndex]
y_train <- as.factor(y_train)
y_test <- as.factor(y_test)
performance_df <- data.frame(Model = NULL, Accuracy = NULL, Kappa = NULL)

Multinomial Naive Bayes

Naïve Bayes is a set of probabilistic algorithms that utilizes Bayes’ Theorem and probability theory to make predictions such that the algorithm looks to determine the probability of an event A given the probability that another event B has already occurred.

The Naïve Bayes model completed with an overall 83% accuracy on the test dataset and finished within seconds. The speed at which the Naïve Bayes model completes gives it a unique advantage when computational complexity and speed are driving factors. The digits that it misclassified the most compared to other digits are 5, 4, and 8.

nb <- multinomial_naive_bayes(select(mnist_raw, -X1), as.factor(mnist_raw$X1), laplace=5)
summary(nb)
## 
## ============================ Multinomial Naive Bayes ============================ 
##  
## - Call: multinomial_naive_bayes(x = select(mnist_raw, -X1), y = as.factor(mnist_raw$X1),      laplace = 5) 
## - Laplace: 5 
## - Classes: 10 
## - Samples: 60000 
## - Features: 784 
## - Prior probabilities: 
##     - 0: 0.0987
##     - 1: 0.1124
##     - 2: 0.0993
##     - 3: 0.1022
##     - 4: 0.0974
##     - 5: 0.0904
##     - 6: 0.0986
##     - 7: 0.1044
##     - 8: 0.0975
##     - 9: 0.0992
## 
## ---------------------------------------------------------------------------------
nb_pred <- predict(nb, newdata = data.matrix(select(mnist_raw_test, -X1)), type = "class")
nb_cm <- confusionMatrix(nb_pred, as.factor(mnist_raw_test$X1))
perf_nb <- data.frame(Model = "Naive Bayes", Accuracy = nb_cm$overall[1], Kappa = nb_cm$overall[2])
performance_df <- rbind(performance_df, perf_nb)
nb_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0  912    0   15    4    2   23   18    1    6    6
##          1    0 1061   11   11    2   12   13   21   26    7
##          2    2    5  858   34    6    6   17   11   13    3
##          3    6    9   24  851    0  107    1    5   54   10
##          4    1    0   10    1  732   18    7   19   14   66
##          5    8    2    3   21    0  589   25    0   27    9
##          6   14    6   33    7   25   17  859    1    8    0
##          7    1    0   11   14    1    6    0  861    9   17
##          8   36   51   66   40   38   78   18   40  777   28
##          9    0    1    1   27  176   36    0   69   40  863
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8363          
##                  95% CI : (0.8289, 0.8435)
##     No Information Rate : 0.1135          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.818           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9306   0.9348   0.8314   0.8426   0.7454   0.6603
## Specificity            0.9917   0.9884   0.9892   0.9760   0.9849   0.9896
## Pos Pred Value         0.9240   0.9115   0.8984   0.7976   0.8433   0.8611
## Neg Pred Value         0.9925   0.9916   0.9808   0.9822   0.9726   0.9675
## Prevalence             0.0980   0.1135   0.1032   0.1010   0.0982   0.0892
## Detection Rate         0.0912   0.1061   0.0858   0.0851   0.0732   0.0589
## Detection Prevalence   0.0987   0.1164   0.0955   0.1067   0.0868   0.0684
## Balanced Accuracy      0.9611   0.9616   0.9103   0.9093   0.8652   0.8249
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity            0.8967   0.8375   0.7977   0.8553
## Specificity            0.9877   0.9934   0.9562   0.9611
## Pos Pred Value         0.8856   0.9359   0.6630   0.7115
## Neg Pred Value         0.9890   0.9816   0.9777   0.9834
## Prevalence             0.0958   0.1028   0.0974   0.1009
## Detection Rate         0.0859   0.0861   0.0777   0.0863
## Detection Prevalence   0.0970   0.0920   0.1172   0.1213
## Balanced Accuracy      0.9422   0.9155   0.8770   0.9082

KNN

K-Nearest Neighbors (KNN) algorithm is non-parametric in the sense that no coefficients are estimated. Instead KNN uses distances between observation points and identifies the k closest datapoints to determine the classification. It is typical to scale the variables in order to even out the influence of variables with large values. In our case, all variables are already on the same scale so no additional processing is required.

Cross-validation over 3 folds and various values of k are used in determining the optimal parameter. The highest accuracy on the training set is reached used a model with only a single neighbor. The classification results on the test set are displayed below. This 1-NN model reaches an accuracy of 94.9%.

trControl <- trainControl(method  = "cv", number  = 3)
knn.fit <- train(X_train, y_train,
             method     = "knn",
             tuneGrid   = expand.grid(k = 1:5),
             trControl  = trControl)
plot(knn.fit)

knnPredict <- predict(knn.fit, newdata = X_test) 
knn_cm <- confusionMatrix(knnPredict, y_test)
perf_knn <- data.frame(Model = "KNN", Accuracy = knn_cm$overall[1], Kappa = knn_cm$overall[2])
performance_df <- rbind(performance_df, perf_knn)
knn_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 105   0   0   0   0   0   0   0   0   2
##          1   0  99   2   0   1   1   1   0   1   1
##          2   0   1  91   0   0   0   1   0   2   0
##          3   0   0   1  87   0   1   0   0   0   0
##          4   0   0   0   0 108   0   1   1   0   5
##          5   0   0   0   2   0  80   0   0   2   2
##          6   0   0   0   1   1   2 101   0   1   0
##          7   0   2   5   2   0   1   0 104   1   1
##          8   0   0   0   0   0   0   0   0  79   0
##          9   0   0   0   1   2   0   0   3   0  94
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9489          
##                  95% CI : (0.9334, 0.9618)
##     No Information Rate : 0.1121          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9432          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            1.0000   0.9706  0.91919  0.93548   0.9643  0.94118
## Specificity            0.9978   0.9922  0.99556  0.99779   0.9921  0.99344
## Pos Pred Value         0.9813   0.9340  0.95789  0.97753   0.9391  0.93023
## Neg Pred Value         1.0000   0.9966  0.99115  0.99341   0.9955  0.99452
## Prevalence             0.1051   0.1021  0.09910  0.09309   0.1121  0.08509
## Detection Rate         0.1051   0.0991  0.09109  0.08709   0.1081  0.08008
## Detection Prevalence   0.1071   0.1061  0.09510  0.08909   0.1151  0.08609
## Balanced Accuracy      0.9989   0.9814  0.95737  0.96664   0.9782  0.96731
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity            0.9712   0.9630  0.91860  0.89524
## Specificity            0.9944   0.9865  1.00000  0.99329
## Pos Pred Value         0.9528   0.8966  1.00000  0.94000
## Neg Pred Value         0.9966   0.9955  0.99239  0.98776
## Prevalence             0.1041   0.1081  0.08609  0.10511
## Detection Rate         0.1011   0.1041  0.07908  0.09409
## Detection Prevalence   0.1061   0.1161  0.07908  0.10010
## Balanced Accuracy      0.9828   0.9747  0.95930  0.94426

Gradient Boosting

With gradient boosting, trees are grown in a sequential way using information from previously grown trees. Boosting does not involve bootstrap sampling, instead each tree is fit on a modified version of the original data set. Unlike fitting a single large decision tree (fitting the data hard) which can lead to over-fitting, the boosting approach learns slowly. Given a current model, a decision tree is fit to the residuals from the model. Each of these trees can be rather small, with just a few terminal nodes (determined by the interaction depth parameter). By fitting small trees to the residuals, we slowly improve our estimate in areas where it does not perform well. The shrinkage parameter further slows the process down. Cross-validation over 3 folds was used to determine the optimal number of trees. This model reached 94.3% accuracy.

tune <- FALSE
trControl <- trainControl(method = "cv", number = 3)
if (tune == TRUE) {
  gbmGrid <-  expand.grid(interaction.depth = c(5,9), 
                        n.trees = (1:10)*25, 
                        shrinkage = 0.1,
                        n.minobsinnode = 20)
   gbm.fit <- train(X_train, y_train,
               method     = "gbm",
               tuneGrid   = gbmGrid,
               trControl  = trControl,
               verbose    = FALSE)
} else {
  gbmGrid <-  expand.grid(interaction.depth = 9, 
                        n.trees = 250, 
                        shrinkage = 0.1,
                        n.minobsinnode = 20)
  gbm.fit <- train(X_train, y_train,
                method     = "gbm",
                tuneGrid   = gbmGrid,
                trControl  = trControl,
                verbose    = FALSE)
}

Gradient Boosting Training

gbm_pred <- predict(gbm.fit, newdata = X_test)
gbm_cm <- confusionMatrix(gbm_pred, as.factor(y_test))
perf_gbm <- data.frame(Model = "Gradient Boosting", Accuracy = gbm_cm$overall[1], Kappa = gbm_cm$overall[2])
performance_df <- rbind(performance_df, perf_gbm)
gbm_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 104   0   0   0   0   1   0   0   0   2
##          1   0  96   1   0   1   1   1   1   0   1
##          2   0   4  93   2   0   0   1   0   0   0
##          3   0   0   0  87   0   1   0   0   1   0
##          4   0   0   1   0 106   0   0   2   0   4
##          5   1   1   0   1   0  76   2   0   2   0
##          6   0   0   0   0   2   3 100   0   0   0
##          7   0   0   2   2   0   1   0 101   0   1
##          8   0   0   2   0   0   1   0   2  82   0
##          9   0   1   0   1   3   1   0   2   1  97
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9429          
##                  95% CI : (0.9267, 0.9565)
##     No Information Rate : 0.1121          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9365          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9905   0.9412  0.93939  0.93548   0.9464  0.89412
## Specificity            0.9966   0.9933  0.99222  0.99779   0.9921  0.99234
## Pos Pred Value         0.9720   0.9412  0.93000  0.97753   0.9381  0.91566
## Neg Pred Value         0.9989   0.9933  0.99333  0.99341   0.9932  0.99017
## Prevalence             0.1051   0.1021  0.09910  0.09309   0.1121  0.08509
## Detection Rate         0.1041   0.0961  0.09309  0.08709   0.1061  0.07608
## Detection Prevalence   0.1071   0.1021  0.10010  0.08909   0.1131  0.08308
## Balanced Accuracy      0.9936   0.9672  0.96581  0.96664   0.9693  0.94323
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity            0.9615   0.9352  0.95349   0.9238
## Specificity            0.9944   0.9933  0.99452   0.9899
## Pos Pred Value         0.9524   0.9439  0.94253   0.9151
## Neg Pred Value         0.9955   0.9922  0.99561   0.9910
## Prevalence             0.1041   0.1081  0.08609   0.1051
## Detection Rate         0.1001   0.1011  0.08208   0.0971
## Detection Prevalence   0.1051   0.1071  0.08709   0.1061
## Balanced Accuracy      0.9780   0.9642  0.97401   0.9569

Random Forest

Random forest is a machine learning algorithm that uses a collection of decision trees to provide more flexibility and accuracy while reducing variance and the effect of multicollinearity. For each tree and at each split, a random sample of the predictors is considerered from the full set. This has the effect of making the individual trees more dissimilar.

The model is tuned via using cross-validation over five-folds in order to find the optimal number of trees. Usually, we select a cutoff where the training error settles.

control <- trainControl(method="cv", number = 5)
grid <- data.frame(mtry = c(1, 5, 10, 25, 50, 100))
train_rf <-  train(X_train, y_train, 
                   method = "rf", 
                   ntree = 150,
                   trControl = control,
                   tuneGrid = grid,
                   nSamp = 5000)
print(train_rf)
## Random Forest 
## 
## 4001 samples
##  249 predictor
##   10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 3200, 3200, 3203, 3201, 3200 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##     1   0.9035157  0.8926661
##     5   0.9300102  0.9221670
##    10   0.9317664  0.9241211
##    25   0.9277664  0.9196786
##    50   0.9242648  0.9157861
##   100   0.9235123  0.9149503
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 10.
print(train_rf$bestTune$mtry)
## [1] 10

The best resulting model used 25 trees in the ensemble and achieve a high accuracy of 94.1% on the test set.

fit_rf <- randomForest(X_train, y_train, 
                       minNode = train_rf$bestTune$mtry)
plot(fit_rf)

y_pred_rf <- predict(fit_rf, X_test)
rf_cm <- confusionMatrix(y_pred_rf, y_test)
rf_cm$overall["Accuracy"]
##  Accuracy 
## 0.9409409
perf_rf <- data.frame(Model = "Random Forests", Accuracy = rf_cm$overall[1], Kappa = rf_cm$overall[2])
performance_df <- rbind(performance_df, perf_rf)
rf_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 104   0   0   0   0   1   1   0   0   2
##          1   0  97   1   0   1   0   0   1   0   1
##          2   0   3  94   2   0   0   2   0   0   1
##          3   0   2   0  85   0   1   0   0   1   1
##          4   0   0   0   0 107   0   0   1   0   2
##          5   0   0   0   1   0  76   1   0   2   1
##          6   1   0   1   0   3   3 100   0   1   0
##          7   0   0   2   2   0   0   0 101   0   1
##          8   0   0   1   0   0   2   0   1  80   0
##          9   0   0   0   3   1   2   0   4   2  96
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9409          
##                  95% CI : (0.9245, 0.9547)
##     No Information Rate : 0.1121          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9343          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9905   0.9510  0.94949  0.91398   0.9554  0.89412
## Specificity            0.9955   0.9955  0.99111  0.99448   0.9966  0.99453
## Pos Pred Value         0.9630   0.9604  0.92157  0.94444   0.9727  0.93827
## Neg Pred Value         0.9989   0.9944  0.99443  0.99120   0.9944  0.99020
## Prevalence             0.1051   0.1021  0.09910  0.09309   0.1121  0.08509
## Detection Rate         0.1041   0.0971  0.09409  0.08509   0.1071  0.07608
## Detection Prevalence   0.1081   0.1011  0.10210  0.09009   0.1101  0.08108
## Balanced Accuracy      0.9930   0.9733  0.97030  0.95423   0.9760  0.94432
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity            0.9615   0.9352  0.93023   0.9143
## Specificity            0.9899   0.9944  0.99562   0.9866
## Pos Pred Value         0.9174   0.9528  0.95238   0.8889
## Neg Pred Value         0.9955   0.9922  0.99344   0.9899
## Prevalence             0.1041   0.1081  0.08609   0.1051
## Detection Rate         0.1001   0.1011  0.08008   0.0961
## Detection Prevalence   0.1091   0.1061  0.08408   0.1081
## Balanced Accuracy      0.9757   0.9648  0.96293   0.9504

The plot below displays the important variables in this model. Columns X380, X435, X463 among others are identified as the most important in the sense that they produce the largest decrease in mean gini index.

varImpPlot(fit_rf)

Results

The summary of the modeling results are displayed below. KNN performed the best overall with the best accuracy on the test set of 94.9%.

rownames(performance_df) <- NULL
performance_df 

We can take a look at some of the misclassified instances to evaluate where the model might have gone wrong. On the tile plot below, the predicted number is on the top of the cell, and the actually digit just below it. We see that most of the instances are obvious mistakes. Other than the middle 6 and 8 and the last 7 which are fairly distorted, the other digits remain easily distinguishable to our eyes.

misclass <- data.frame(knn_pred=knnPredict,y_test=y_test)
misclass <- misclass %>% mutate(row=row_number()) %>% filter(knn_pred != y_test) 
# filter misclassified instances
sub_test <- mnist_subset[-trainIndex,]
sub_test <- sub_test[misclass$row,]
sub_test$pred <- misclass$knn_pred
# Display example instances
pixels_gathered <- sub_test %>%  gather(pixel, value, -label, -instance, -pred) %>%  tidyr::extract(pixel, "pixel", "(\\d+)", convert = TRUE) %>%  mutate(pixel = pixel - 2, x = pixel %% 28, y = 28 - pixel %/% 28)
theme_set(theme_light())
pixels_gathered %>%  filter(instance <= 745) %>%  ggplot(aes(x, y, fill = value)) +  geom_tile() +  facet_wrap(~ pred + label) + scale_fill_gradient(low = "white", high = "black")

Conclusion

The intention of this project was to apply classification models on image data, in this case digited handwritten digits. The dataset is different from what we used in different projects in the sense that it only contained sparse numerical values with no categorial features. To deal with this sparsity we used dimensionality reduction techniques like NZV, t-SNE and PCA. The reduced datasets was used to train different classification models tuned to recognize handwritten digits. While Naive Bayes was inferior, most models had high accuracy. The KNN classification model has the best overall accuracy of 94.89%.

Future work could include exploring neural networks method on this dataset which have demonstrated high performance with image recognition. Additional, the utilization of more powerful hardware could allow for finer tuning and training on the 60,000 observations.