Overview and Intro

One of the biggest challenges in machine learning is computer vision, which is the field of training computers to correctly recognize imagery in a similar way a human would. The goal is always to approach human level accuracy, and as you’ll see some hand written numbers are hard to discern with the eye.

The computer vision challenge for the demonstration to follow involves training the computer to recognize hand-written numbers. The data used to train the machine is sourced from the widely-known database called MNIST. It is a good database for people who want to study deep learning techniques and pattern recognition methods on real-world data while spending minimal efforts on pre-processing and formatting.

There are many ways to conduct training and testing on the MNIST data set, but the deep learning methods here leverage Keras interface to R, which is underpinned by a Google-developed capability called Tensorflow. While using Keras we will define a model with two separate approaches and compare both accuracy and runtimes. Generally, readers can consider this to be a hello world test for Keras interface to R.

Setup

Libraries

Deep learning Install

#library(keras)
#keras_install()

loadedNamespaces()
##   [1] "colorspace"   "ellipsis"     "class"        "rprojroot"    "base64enc"   
##   [6] "fs"           "rstudioapi"   "stats"        "remotes"      "prodlim"     
##  [11] "fansi"        "lubridate"    "xml2"         "codetools"    "splines"     
##  [16] "R.methodsS3"  "knitr"        "pkgload"      "zeallot"      "jsonlite"    
##  [21] "pROC"         "caret"        "broom"        "base"         "kernlab"     
##  [26] "dbplyr"       "R.oo"         "tfruns"       "readr"        "compiler"    
##  [31] "httr"         "backports"    "assertthat"   "Matrix"       "lazyeval"    
##  [36] "cli"          "htmltools"    "prettyunits"  "tools"        "gtable"      
##  [41] "glue"         "reshape2"     "dplyr"        "naivebayes"   "grDevices"   
##  [46] "Rcpp"         "cellranger"   "vctrs"        "nlme"         "iterators"   
##  [51] "timeDate"     "xfun"         "gower"        "stringr"      "ps"          
##  [56] "testthat"     "rvest"        "lifecycle"    "pacman"       "devtools"    
##  [61] "MASS"         "scales"       "ipred"        "graphics"     "hms"         
##  [66] "tidyverse"    "utils"        "yaml"         "memoise"      "reticulate"  
##  [71] "ggplot2"      "keras"        "datasets"     "rpart"        "stringi"     
##  [76] "tensorflow"   "desc"         "randomForest" "foreach"      "pkgbuild"    
##  [81] "lava"         "rlang"        "pkgconfig"    "evaluate"     "lattice"     
##  [86] "purrr"        "recipes"      "htmlwidgets"  "processx"     "tidyselect"  
##  [91] "plyr"         "magrittr"     "R6"           "generics"     "DBI"         
##  [96] "pillar"       "haven"        "whisker"      "withr"        "survival"    
## [101] "nnet"         "tibble"       "modelr"       "crayon"       "plotly"      
## [106] "rmarkdown"    "usethis"      "grid"         "readxl"       "data.table"  
## [111] "blob"         "callr"        "methods"      "forcats"      "ModelMetrics"
## [116] "webshot"      "reprex"       "digest"       "tidyr"        "R.utils"     
## [121] "stats4"       "munsell"      "MLeval"       "kableExtra"   "viridisLite" 
## [126] "sessioninfo"

Model Data Functions

Load Data

Load the data from the Kaggle digits data set. There is a train and test data set. For this example, I only use the train data set but run cross validation. Also note: this data set is large and so I reduce it by “percent” – see below.

## [1] 42000   785
label pixel0 pixel1 pixel2 pixel3
1 0 0 0 0
0 0 0 0 0
1 0 0 0 0
4 0 0 0 0
0 0 0 0 0
0 0 0 0 0

Exploratory Data Analysis

Digit distributions

EDA <- DigitTotalDF

## Distribution of digits across all data sets


plot1 <- ggplot(EDA, aes(x = label, y = (..count..)/sum(..count..),fill = label)) + geom_bar() + theme_light() +
                labs(y = "Relative frequency", title = "mnist_train dataset") + 
                scale_y_continuous(labels=scales::percent, limits = c(0 , 0.15)) +
                geom_text(stat = "count", 
                          aes(label = scales:: percent((..count..)/sum(..count..)), vjust = -1))


plot1

Digit images

View the digits data images, and how they vary.

library(readr)
#split training data into list by label number
digit_groups <- split(EDA, EDA$label)

# Remove label column in each
dg2 <- lapply(digit_groups, function(x) { x["label"] <- NULL; x })

gs1 = grey(c(0:128)/128)

# Pixel-wise function
pixel_wise <- function(df, func) {
  output <- apply(t(df),1,func)
  u <- matrix(output, ncol = 28)
  v <- u
  for(i in 1:28)  {
    v[i,] <- rev(u[i,])
  }
  image(v, col = gs1, axes = FALSE)
}
# Plot each digits mean
#par(mfrow=c(4,3))
plot <- lapply(dg2, pixel_wise, func = mean)

title("Pixel-wise Mean", outer=TRUE, line = -2)

# Plot each digits median
#par(mfrow=c(4,3))
plot <- lapply(dg2, pixel_wise, func = median)

title("Pixel-wise Median", outer=TRUE, line = -2)

#par(mfrow=c(4,3))
# Plot each digits standard deviation
plot <- lapply(dg2, pixel_wise, func = sd)

title("Pixel-wise Standard Deviation", outer=TRUE, line = -2)

Keras R Modeling

This will provide you with default CPU-based installations of Keras and TensorFlow. If you want a more customized installation, e.g. if you want to take advantage of NVIDIA GPUs, see the documentation for install_keras().

Learning Keras Below we walk through a simple example of using Keras to recognize handwritten digits from the MNIST dataset. After getting familiar with the basics, check out the tutorials and additional learning resources available on this website.

The Deep Learning with R book by François Chollet (the creator of Keras) provides a more comprehensive introduction to both Keras and the concepts and practice of deep learning.

You may also find it convenient to download the Deep Learning with Keras cheat sheet, a quick high-level reference to all of the capabilities of Keras.

MNIST Example We can learn the basics of Keras by walking through a simple example: recognizing handwritten digits from the MNIST dataset. MNIST consists of 28 x 28 grayscale images of handwritten digits like these:

Keras Data Prep

Data Splits

require(kableExtra)
df <-DigitTotalDF #keep this to reset without reloading
head(df[,1:5],10)
##    label pixel0 pixel1 pixel2 pixel3
## 1      1      0      0      0      0
## 2      0      0      0      0      0
## 3      1      0      0      0      0
## 4      4      0      0      0      0
## 5      0      0      0      0      0
## 6      0      0      0      0      0
## 7      7      0      0      0      0
## 8      3      0      0      0      0
## 9      5      0      0      0      0
## 10     3      0      0      0      0
dim(df)
## [1] 42000   785
nDim <- dim(df)[2]-1
nDim
## [1] 784
###############

x <- modelDataC(df)
train <-x@trn
test <- x@tst
hold <- x@hld

trn <- as.data.frame(dim(train))
names(trn) <- c("dimension_training_train")
tst <-as.data.frame(dim(test))
names(tst) <- c("dimension_testing_test")
hld <-as.data.frame(dim(hold))
names(hld) <- c("dimension_testing_holdout")

train <- train
dim(train)
## [1] 33604   785
test <- test
dim(test)
## [1] 8396  785
hold <- hold
dim(hold)
## [1] 1676  785
split <- cbind(trn,tst,hld)
split %>%kable() %>% kable_styling()
dimension_training_train dimension_testing_test dimension_testing_holdout
33604 8396 1676
785 785 785
#view the data sets
head(train[,1:5]) %>% kable() %>% kable_styling()
label pixel0 pixel1 pixel2 pixel3
1 0 0 0 0
0 0 0 0 0
1 0 0 0 0
4 0 0 0 0
0 0 0 0 0
0 0 0 0 0
head(test[,1:5]) %>% kable() %>% kable_styling()
label pixel0 pixel1 pixel2 pixel3
10 3 0 0 0 0
11 8 0 0 0 0
15 3 0 0 0 0
18 0 0 0 0 0
20 5 0 0 0 0
23 2 0 0 0 0
head(hold[,1:5]) %>% kable() %>% kable_styling()
label pixel0 pixel1 pixel2 pixel3
11 8 0 0 0 0
24 0 0 0 0 0
81 5 0 0 0 0
100 5 0 0 0 0
104 7 0 0 0 0
111 0 0 0 0 0
#devtools::install_github("rstudio/keras")
library(keras)

x_train <- train[,2:785] #omit label
x_test <- test[,2:785] #omit label
x_hold <- hold[,2:785] #omit label

#make data frames for features matrix shape
x_train <- as.matrix(x_train)
x_test <- as.matrix(x_test)
x_hold <- as.matrix(x_hold)

dim(x_train)
## [1] 33604   784
head(x_train[,1:5]) %>% kable() %>% kable_styling()
pixel0 pixel1 pixel2 pixel3 pixel4
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
dim(x_test)
## [1] 8396  784
head(x_test[,1:5])  %>% kable() %>% kable_styling()
pixel0 pixel1 pixel2 pixel3 pixel4
10 0 0 0 0 0
11 0 0 0 0 0
15 0 0 0 0 0
18 0 0 0 0 0
20 0 0 0 0 0
23 0 0 0 0 0
dim(x_hold)
## [1] 1676  784
head(x_hold[,1:5])  %>% kable() %>% kable_styling()
pixel0 pixel1 pixel2 pixel3 pixel4
11 0 0 0 0 0
24 0 0 0 0 0
81 0 0 0 0 0
100 0 0 0 0 0
104 0 0 0 0 0
111 0 0 0 0 0

Features

The x data is a 3-d array (images,width,height) of grayscale values . To prepare the data for training we convert the 3-d arrays into matrices by reshaping width and height into a single dimension (28x28 images are flattened into length 784 vectors). Then, we convert the grayscale values from integers ranging between 0 to 255 into floating point values ranging between 0 and 1:

x_train <- array_reshape(x_train, c(nrow(x_train), nDim))
x_test <- array_reshape(x_test, c(nrow(x_test), nDim))
x_hold <- array_reshape(x_hold, c(nrow(x_hold), nDim))


#check dimensions
dim(x_train)
## [1] 33604   784
dim(x_test)
## [1] 8396  784
dim(x_hold)
## [1] 1676  784
#rescale - Don't do this if using PCA input as its already scaled
x_train <- x_train / 255
x_test <- x_test / 255
x_hold <- x_hold / 255

Labels

Note that we use the array_reshape() function rather than the dim<-() function to reshape the array. This is so that the data is re-interpreted using row-major semantics (as opposed to R’s default column-major semantics), which is in turn compatible with the way that the numerical libraries called by Keras interpret array dimensions.

The y data is an integer vector with values ranging from 0 to 9. To prepare this data for training we one-hot encode the vectors into binary class matrices using the Keras to_categorical() function:

#y_train <- as.numeric(train$label)
y_train <- train$label #these were already R factor
head(y_train)
## [1] 1 0 1 4 0 0
## Levels: 0 1 2 3 4 5 6 7 8 9
#y_test <- as.numeric(test$label)
y_test <- test$label #these were already R factor
head(y_test)
## [1] 3 8 3 0 5 2
## Levels: 0 1 2 3 4 5 6 7 8 9
y_hold <- hold$label #these were already R factor
head(y_hold)
## [1] 8 0 5 5 7 0
## Levels: 0 1 2 3 4 5 6 7 8 9
# categorical conversion
y_train <- to_categorical(y_train, 10)
y_test <- to_categorical(y_test, 10)
y_hold <- to_categorical(y_hold, 10)

Defining the tf.fit

The simplest type of model is the Sequential mode, a linear stack of layers. We begin by creating a sequential model and then adding layers using the pipe (%>%) operator:

tf.fit <- keras_model_sequential() 
tf.fit %>% 
  layer_dense(units = 512, activation = 'relu', input_shape = c(nDim)) %>%
  #layer_dense(units = 256, activation = 'relu') %>% 
  layer_dropout(rate = 0.4) %>% 
  layer_dense(units = 256, activation = 'relu') %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 64, activation = 'relu') %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 10, activation = 'softmax')

Fit Summary

The input_shape argument to the first layer specifies the shape of the input data (a length 784 numeric vector representing a grayscale image). The final layer outputs a length 10 numeric vector (probabilities for each digit) using a softmax activation function.

Use the summary() function to print the details of the tf.fit:

summary(tf.fit)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense (Dense)                       (None, 512)                     401920      
## ________________________________________________________________________________
## dropout (Dropout)                   (None, 512)                     0           
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 256)                     131328      
## ________________________________________________________________________________
## dropout_1 (Dropout)                 (None, 256)                     0           
## ________________________________________________________________________________
## dense_2 (Dense)                     (None, 64)                      16448       
## ________________________________________________________________________________
## dropout_2 (Dropout)                 (None, 64)                      0           
## ________________________________________________________________________________
## dense_3 (Dense)                     (None, 10)                      650         
## ================================================================================
## Total params: 550,346
## Trainable params: 550,346
## Non-trainable params: 0
## ________________________________________________________________________________

Fit Compile

Next, compile the tf.fit with appropriate loss function, optimizer, and metrics:

tf.fit %>% keras::compile(
  loss = 'categorical_crossentropy',
#  optimizer = optimizer_rmsprop(),
   optimizer = 'rmsprop', 
#optimizer = 'rmsprop', this was a work around given on stackoverflow, which made it work on Linux with ts 2.3.1. It limited the ability to change the learning rate though.

  metrics = c('accuracy')
)

Training and Evaluation

Use the fit() function to train the tf.fit for 30 epochs using batches of 128 images:

ptm <- proc.time() #timer

tf.train <- tf.fit %>% fit(
  x_train, y_train,
  #epochs = 10, batch_size = 10,
  epochs = 30, batch_size = 128,
  validation_split = 0.2
)
proc.time() - ptm
##    user  system elapsed 
##  211.43   61.64  109.69

Plot

The history object returned by fit() includes loss and accuracy metrics which we can plot:

plot(tf.train)
## `geom_smooth()` using formula 'y ~ x'

# Results

#alluvial plots
plotCM <- function(cm){
  cmdf <- as.data.frame(cm[["table"]])
  cmdf[["color"]] <- ifelse(cmdf[[1]] == cmdf[[2]], "grey", "red")
  
  alluvial::alluvial(cmdf[,1:2]
                     , freq = cmdf$Freq
                     , col = cmdf[["color"]]
                     , alpha = 0.5
                     , hide  = cmdf$Freq == 0
                     )
}
 
require(plotly)
ggPlotCM <- function(cm){
## ggplotly version
cm_d <- as.data.frame(cm$table)
# confusion matrix statistics as data.frame
cm_st <-data.frame(cm$overall)
# round the values
cm_st$cm.overall <- round(cm_st$cm.overall,2)

# here we also have the rounded percentage values
cm_p <- as.data.frame(prop.table(cm$table))
cm_d$Perc <- round(cm_p$Freq*100,2)

# plotting the matrix
cm_d_p <-  ggplot(data = cm_d, aes(x = Prediction , y =  Reference, fill = Freq))+
  geom_tile() +
  geom_text(aes(label = paste("",Freq,",",Perc,"%")), color = 'red', size = 3) +
  theme_light() +
  guides(fill=FALSE) + theme_1()

# plotting the stats
#cm_st_p <-  tableGrob(cm_st)

cm_d_ply <-ggplotly(cm_d_p)
return (cm_d_ply)
}

Testing

Evaluate on test

Evaluate the tf.fit’s performance on the test data:

tf.fit %>% keras::evaluate(x_test, y_test)
##      loss  accuracy 
## 0.1750664 0.9767746

Confusion Matrix Test

#Generate predictions on new data:
pred <-tf.fit %>% predict_classes(x_test) 
#table(Predicted=pred, Actual =test$label)

cm <- confusionMatrix(factor(pred), factor(test$label))
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 815   0   1   3   2   2   4   3   3   3
##          1   0 925   0   0   0   0   1   2   5   0
##          2   0   3 817   5   1   2   0   4   3   0
##          3   3   0   3 847   0  10   0   0   4   5
##          4   1   5   2   0 804   0   3   1   2  19
##          5   0   0   0   3   1 730   1   0   6   0
##          6   2   0   1   0   1   7 814   0   2   1
##          7   1   1   7   7   0   0   0 865   1   5
##          8   3   1   4   0   0   4   3   1 782   2
##          9   1   1   0   5   5   4   1   4   4 802
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9768          
##                  95% CI : (0.9733, 0.9799)
##     No Information Rate : 0.1115          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9742          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.98668   0.9882  0.97844   0.9736  0.98771  0.96179
## Specificity           0.99723   0.9989  0.99762   0.9967  0.99565  0.99856
## Pos Pred Value        0.97488   0.9914  0.97844   0.9713  0.96057  0.98516
## Neg Pred Value        0.99854   0.9985  0.99762   0.9969  0.99868  0.99621
## Prevalence            0.09838   0.1115  0.09945   0.1036  0.09695  0.09040
## Detection Rate        0.09707   0.1102  0.09731   0.1009  0.09576  0.08695
## Detection Prevalence  0.09957   0.1111  0.09945   0.1039  0.09969  0.08826
## Balanced Accuracy     0.99195   0.9936  0.98803   0.9851  0.99168  0.98018
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.98428   0.9830  0.96305  0.95818
## Specificity           0.99815   0.9971  0.99763  0.99669
## Pos Pred Value        0.98309   0.9752  0.97750  0.96977
## Neg Pred Value        0.99828   0.9980  0.99605  0.99538
## Prevalence            0.09850   0.1048  0.09671  0.09969
## Detection Rate        0.09695   0.1030  0.09314  0.09552
## Detection Prevalence  0.09862   0.1056  0.09528  0.09850
## Balanced Accuracy     0.99122   0.9900  0.98034  0.97744
#plotCM(cm)
#ggPlotCM(cm)

Evaluate on hold out

Evaluate the tf.fit’s performance on the hold out data:

tf.fit %>% keras::evaluate(x_hold, y_hold)
##      loss  accuracy 
## 0.1880576 0.9743437

Confusion Matrix Holdout

#Generate predictions on new data:
pred2 <-tf.fit %>% predict_classes(x_hold) 
#table(Predicted=pred, Actual =test$label)

cm <- confusionMatrix(factor(pred2), factor(hold$label))
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 164   0   0   1   0   0   3   1   1   1
##          1   0 184   0   0   0   0   0   1   0   0
##          2   0   1 161   3   0   0   0   0   0   0
##          3   1   0   1 164   0   2   0   0   1   1
##          4   0   1   1   0 160   0   1   0   0   3
##          5   0   0   0   2   0 145   0   0   0   0
##          6   0   0   1   0   1   2 161   0   0   0
##          7   0   0   1   3   0   0   0 173   0   0
##          8   0   1   2   0   0   1   0   0 159   0
##          9   0   0   0   1   1   1   0   1   1 162
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9743          
##                  95% CI : (0.9656, 0.9814)
##     No Information Rate : 0.1116          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9715          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.99394   0.9840  0.96407  0.94253  0.98765  0.96026
## Specificity           0.99537   0.9993  0.99735  0.99601  0.99604  0.99869
## Pos Pred Value        0.95906   0.9946  0.97576  0.96471  0.96386  0.98639
## Neg Pred Value        0.99934   0.9980  0.99603  0.99336  0.99868  0.99608
## Prevalence            0.09845   0.1116  0.09964  0.10382  0.09666  0.09010
## Detection Rate        0.09785   0.1098  0.09606  0.09785  0.09547  0.08652
## Detection Prevalence  0.10203   0.1104  0.09845  0.10143  0.09905  0.08771
## Balanced Accuracy     0.99465   0.9916  0.98071  0.96927  0.99185  0.97948
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.97576   0.9830  0.98148  0.97006
## Specificity           0.99735   0.9973  0.99736  0.99669
## Pos Pred Value        0.97576   0.9774  0.97546  0.97006
## Neg Pred Value        0.99735   0.9980  0.99802  0.99669
## Prevalence            0.09845   0.1050  0.09666  0.09964
## Detection Rate        0.09606   0.1032  0.09487  0.09666
## Detection Prevalence  0.09845   0.1056  0.09726  0.09964
## Balanced Accuracy     0.98656   0.9901  0.98942  0.98337
#plotCM(cm)
#ggPlotCM(cm)

Keras CNN Model

The previous run 97.83 accuracy was attained on th test set. The following code will try to improve upon that by using the convolutions neural networking (CNN) approach. The MNIST data preparation is a little different with this approach.

batch_size <- 128
num_classes <- 10
epochs <- 12

# Input image dimensions
img_rows <- 28
img_cols <- 28

Data Prep

library(keras)

#The data, shuffled and split between train and test sets
mnist <- dataset_mnist()
x_train <- mnist$train$x
y_train <- mnist$train$y
x_test <- mnist$test$x
y_test <- mnist$test$y

Reshape Arrays

x_train <- array_reshape(x_train, c(nrow(x_train), img_rows, img_cols, 1))
x_test <- array_reshape(x_test, c(nrow(x_test), img_rows, img_cols, 1))
input_shape <- c(img_rows, img_cols, 1)

Normalize Data

#RGB values converted to values ranging from 0 to 1
x_train <- x_train / 255
x_test <- x_test / 255

cat('x_train_shape:', dim(x_train), '\n')
## x_train_shape: 60000 28 28 1
cat(nrow(x_train), 'train samples\n')
## 60000 train samples
cat(nrow(x_test), 'test samples\n')
## 10000 test samples

Convert labels to binary matrices

#Convert class vectors to binary class matrices
y_train <- to_categorical(y_train, 10)
y_test <- to_categorical(y_test, 10)

Keras CNN Model

Define model

The key difference with the CNN model definition is the data inputs structure and the use of 2 dimensional layers implemented in the code as “layer_conv_2d” layers.

model <- keras_model_sequential() %>%
  layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = 'relu',
                input_shape = input_shape) %>% 
  layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu') %>% 
  layer_max_pooling_2d(pool_size = c(2, 2)) %>% 
  layer_dropout(rate = 0.25) %>% 
  layer_flatten() %>% 
  layer_dense(units = 128, activation = 'relu') %>% 
  layer_dropout(rate = 0.5) %>% 
  layer_dense(units = num_classes, activation = 'softmax')

Fit summary

summary(model)
## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## conv2d (Conv2D)                     (None, 26, 26, 32)              320         
## ________________________________________________________________________________
## conv2d_1 (Conv2D)                   (None, 24, 24, 64)              18496       
## ________________________________________________________________________________
## max_pooling2d (MaxPooling2D)        (None, 12, 12, 64)              0           
## ________________________________________________________________________________
## dropout_3 (Dropout)                 (None, 12, 12, 64)              0           
## ________________________________________________________________________________
## flatten (Flatten)                   (None, 9216)                    0           
## ________________________________________________________________________________
## dense_4 (Dense)                     (None, 128)                     1179776     
## ________________________________________________________________________________
## dropout_4 (Dropout)                 (None, 128)                     0           
## ________________________________________________________________________________
## dense_5 (Dense)                     (None, 10)                      1290        
## ================================================================================
## Total params: 1,199,882
## Trainable params: 1,199,882
## Non-trainable params: 0
## ________________________________________________________________________________

Compile CNN model

model %>% keras::compile(
  loss = loss_categorical_crossentropy,
  optimizer = optimizer_adadelta(),
  metrics = c('accuracy')
)

Train CNN model

ptm <- proc.time() #timer

model %>% fit(
  x_train, y_train,
  batch_size = batch_size,
  epochs = epochs,
  validation_split = 0.2
)

proc.time() - ptm
##    user  system elapsed 
## 3031.71  612.77 1412.36

Test CNN Model

scores <- model %>% keras::evaluate(
  x_test, y_test, verbose = 0
)

# Output metrics
cat('Test loss:', scores[[1]], '\n')
## Test loss: 0.02950084
cat('Test accuracy:', scores[[2]], '\n')
## Test accuracy: 0.9909

Conclusions

We set out to demonstrate that deep learning methods can be used to train a machine to recognize hand-written digits. After this demonstration, we can safely conclude that computer vision is very effective here with 97.83% and greater accuracy. There is a slight increase in accuracy with the second case using Keras with the CNN approach, which yielded 99.13% accuracy. Wile the CNN approach did yield slightly better accuracy, the training time was 14x longer indicating trade-offs to consider. If the case was a larger scale model training initiative where one iteration was 1 hour with one method, the second method would burn the entire workday. Is the 1.3% gain worth it? It may depend on the use case, but either way you’ll have to decide.