Machine Learning with Handwritten Digits

This exploration of model fitting and recommendations systems is completed as an assignment and part of the “HarvardX PH125.8x Data Science: Machine Learning” course.

Let’s load some libraries for this project.

libraries <- c("caret", "dslabs", "tidyverse", "e1071")
sapply(libraries, require, character.only=TRUE)

And use the MNIST database.

mnist <- read_mnist()
dim(mnist$train$images)
## [1] 60000   784
table(mnist$train$labels)
## 
##    0    1    2    3    4    5    6    7    8    9 
## 5923 6742 5958 6131 5842 5421 5918 6265 5851 5949

About MNIST: > Consists of a training and test set, each with 60,000 and 10,000 entries respectively. Each entry consists of 784 integers ranging from 0-255 as grey scale, to represent a handwritten digit in a 28x28 pixel area.


Preprocessing

For run-time’s sake, let’s sample only 10,000 and 1,000 from the train and test sets respectively.

set.seed(123, sample.kind = "Rounding")
# use default sampling method compatible for versions <= R 3.5
index <- sample(nrow(mnist$train$images), 10000)
X_train <- mnist$train$images[index,]
y_train <- factor(mnist$train$labels[index])

index <- sample(nrow(mnist$test$images), 1000)
X_test <- mnist$test$images[index,]
y_test <- factor(mnist$test$labels[index])

Images have a format of 28x28 pixels with digits written in the center.

We would expect a lot of space near the edges or corners to have no writing.
Under this assumption, let’s find out the proportion of pixels that aren’t used by computing the variance for each pixel.

library(matrixStats)
sds <- colSds(X_train)
qplot(sds,bins=256, color=I("black"))

These features can be removed due to near zero variance.

nzv <- nearZeroVar(X_train)
image(matrix(1:784 %in% nzv, 28, 28))

Remove these columns.

col_index <- setdiff(1:ncol(X_train), nzv)
length(col_index)
## [1] 252

Adding column names is a requirement of the caret package.

colnames(X_train) <- 1:ncol(mnist$train$images)
colnames(X_test) <- colnames(X_train)

Modelling

K Nearest Neighbors

Use 10-fold cross validation to optimize a k nearest neighbors model.

control <- trainControl(method = "cv", number = 10, p = .9)
train_knn <- train(X_train[,col_index], y_train,
                   method = "knn", 
                   tuneGrid = data.frame(k = c(1,3,5,7)),
                   trControl = control)
ggplot(train_knn)

A k=3 model yields the highest accuracy.

fit_knn <- knn3(X_train[,col_index], y_train, k=3)

Compute our prediction.

y_hat_knn <- predict(fit_knn, 
                     X_test[,col_index],
                     type="class")
cm <- confusionMatrix(y_hat_knn, factor(y_test))
cm$overall["Accuracy"]
## Accuracy 
##    0.959

A closer look at the classification of each digit. In this case,
8 is the hardest digit to detect.

cm$byClass[,1:2]
##          Sensitivity Specificity
## Class: 0   1.0000000   0.9977728
## Class: 1   1.0000000   0.9931741
## Class: 2   0.9528302   0.9988814
## Class: 3   0.9166667   0.9934498
## Class: 4   0.9357798   0.9977553
## Class: 5   0.9714286   0.9913978
## Class: 6   0.9901961   0.9977728
## Class: 7   0.9545455   0.9955056
## Class: 8   0.8681319   0.9977998
## Class: 9   0.9809524   0.9910615

A look at some wrong classifications by the knn model.

Find the wrong classifications that were predicted with a high confidence.

p_max <- predict(fit_knn, X_test[,col_index])
p_max <- apply(p_max, 1, max)
ind  <- which(y_hat_knn != y_test)
ind <- ind[order(p_max[ind], decreasing = TRUE)]

Plot them.

rafalib::mypar(3,4)
for(i in ind[1:12]){
  image(matrix(X_test[i,], 28, 28)[, 28:1],
        main = paste0("Pr(",y_hat_knn[i],")=",round(p_max[i], 2),
                      " but is a ",y_test[i]),
        xaxt="n", yaxt="n")
}

Random Forest

Before building our model let’s have a look at pixels that have a strong predictive value in a random forest model.

library(randomForest)
x <- mnist$train$images[index,]
y <- factor(mnist$train$labels[index])
rf <- randomForest(x, y,  ntree = 1000)
imp <- importance(rf)

image(matrix(imp, 28, 28))

Use 5-fold cross validation to optimize a random forest model.

library(Rborist)
control <- trainControl(method="cv", number = 5, p = 0.8)
grid <- expand.grid(minNode = c(1,5) , predFixed = c(10, 15, 25, 35, 50))
train_rf <-  train(X_train[, col_index], y_train,
                   method = "Rborist",
                   nTree = 50,
                   trControl = control,
                   tuneGrid = grid,
                   nSamp = 5000)
ggplot(train_rf)

train_rf$bestTune
##   predFixed minNode
## 3        25       1

Fit the model with optimized parameters.

fit_rf <- Rborist(X_train[, col_index], y_train,
                  nTree = 1000,
                  minNode = train_rf$bestTune$minNode,
                  predFixed = train_rf$bestTune$predFixed)

Make our predictions.

y_hat_rf <- factor(levels(y_train)[predict(fit_rf, X_test[ ,col_index])$yPred])
cm <- confusionMatrix(y_hat_rf, y_test)
cm$overall["Accuracy"]
## Accuracy 
##     0.96

Let’s have a look at some of the predictions that the model made.

rafalib::mypar(3,4)
for(i in 1:12){
  image(matrix(X_test[i,], 28, 28)[, 28:1], 
        main = paste("Our prediction:", y_hat_rf[i]),
        xaxt="n", yaxt="n")
}

Now another look at some wrong classifications by the random forest model.

Find the wrong classifications that were predicted with a high confidence.

p_max <- predict(fit_rf, X_test[,col_index])$census  
p_max <- p_max / rowSums(p_max)
p_max <- apply(p_max, 1, max)
ind  <- which(y_hat_rf != y_test)
ind <- ind[order(p_max[ind], decreasing = TRUE)]

And plot them.

rafalib::mypar(3,4)
for(i in ind[1:12]){
  image(matrix(X_test[i,], 28, 28)[, 28:1], 
        main = paste0("Pr(",y_hat_rf[i],")=",round(p_max[i], 2),
                      " but is a ",y_test[i]),
        xaxt="n", yaxt="n")
}


Ensembles with mnist27

About mnist27: > “includes a randomly selected set of 2s and 7s from the mnist dataset along with the two predictors based on the proportion of dark pixels in the upper left and lower right quadrants respectively.”

Let us now build a complex ensemble for the mnist27 data.

Define our models.

# use default sampling method compatible for versions <= R 3.5
set.seed(1, sample.kind = "Rounding")

models <- c("glm", "lda", "naive_bayes", "svmLinear", "knn", "gamLoess", "multinom", "qda", "rf", "adaboost")

Load in the dataset.

data("mnist_27")
mnist_27$train %>% head()
##   y        x_1        x_2
## 1 2 0.03947368 0.18421053
## 2 7 0.16071429 0.08928571
## 3 2 0.02127660 0.27659574
## 4 2 0.13580247 0.22222222
## 5 7 0.39024390 0.36585366
## 6 2 0.04854369 0.28155340

Fit all the models.

fits <- lapply(models, function(model){
  train(y ~ ., method=model, data= mnist_27$train)
})

names(fits) <- models

Make predictions for each model.

predictions <- sapply(fits, function(fit){
  predict(fit, mnist_27$test)
})

names(predictions) <- models

Find out the accuracies of each model.

accuracies <- sapply(seq(1,length(models)), function(i){
  mean(predictions[,i] == mnist_27$test$y)
})
names(accuracies) <- models

accuracies %>% data.frame()
##                 .
## glm         0.750
## lda         0.750
## naive_bayes 0.795
## svmLinear   0.755
## knn         0.840
## gamLoess    0.845
## multinom    0.750
## qda         0.820
## rf          0.780
## adaboost    0.785
accuracies %>% mean()
## [1] 0.787

Building our ensemble prediction.
Predict 7 if more than 50% of the models vote 7, otherwise predict 2.

ensemble_predictions <- ifelse(rowMeans(predictions == 7) > 0.5, 7, 2)

ens_pred_accuracy <- mean(ensemble_predictions == mnist_27$test$y)
print(c("ensemble model prediction accuracy: ", ens_pred_accuracy))
## [1] "ensemble model prediction accuracy: "
## [2] "0.81"

Models that do better than the ensemble.

accuracies[accuracies > ens_pred_accuracy]
##      knn gamLoess      qda 
##    0.840    0.845    0.820

Remove the methods that do not perform well and re-do the ensemble?

No, because that would mean using the test data to make a decision!

We could instead use the minimum accuracy estimates obtained from cross validation with the training data for each model from fit$results$Accuracy.

accuracy_estimates <- sapply(fits, function(fit){
  mean(fit$results$Accuracy)
})
accuracy_estimates
##         glm         lda naive_bayes   svmLinear         knn    gamLoess 
##   0.8010386   0.7987611   0.8188442   0.7969828   0.8141486   0.8455962 
##    multinom         qda          rf    adaboost 
##   0.7907668   0.8336128   0.8057573   0.8029408
accuracy_estimates %>% mean()
## [1] 0.8108449

Let us now define two ensembles and only include models that surpass an 80% or 81% accuracy estimate threshold respectively.

ensemble80_models <- accuracy_estimates[accuracy_estimates>0.8] %>% names()
ensemble81_models <- accuracy_estimates[accuracy_estimates>0.81] %>% names()

ensemble80_preds <- ifelse(rowMeans(predictions[,ensemble80_models]==7)>0.5, 7, 2)
ensemble81_preds <- ifelse(rowMeans(predictions[,ensemble81_models]==7)>0.5, 7, 2)

In this case,
using only 4 models with an 81% accuracy threshold actually performs worse than
using 7 models with 80% accuracy threshold.

mean(ensemble80_preds == mnist_27$test$y)
## [1] 0.835
mean(ensemble81_preds == mnist_27$test$y)
## [1] 0.83

Here are our final ensemble constituents.

ensemble80_models
## [1] "glm"         "naive_bayes" "knn"         "gamLoess"    "qda"        
## [6] "rf"          "adaboost"