# import libs
library(keras)
library(tidyverse)
library(caret)
library(tidymodels)
library(rsample)

1 Introduction

The MNIST dataset is one of most popular dataset used to introduce people to image classification and deep learning or neural networks. This dataset is a collection of 28 x 28 pixel images of handwritten digits from 0 to 9. The digits are white on black background. There are several datasets related to MNIST, one of them is the Handwritten Symbol database version 2 (HASYv2). The HASYv2 dataset is similar to MNIST in that it contains handwritten digits, but it went more than that. This dataset also contains handwritten Latin alphabets (both uppercase and lowercase) and most handwritten mathematical symbols. This dataset should be more usable than MNIST because it can be used to create a handwritten mathematical equation recognizer or even more than that, a handwritten mathematical equation solver. Before creating such application, we will start by creating handwritten digits and arithmetical operation symbol recognizer which should form a basis for the more advanced applications. Other differences from the MNIST dataset includes the pixel size of 32 x 32 and that the writing is black on white instead of white on black.

2 Data Generation

train <- read.csv("HASYv2/classification-task/fold-1/train.csv")
test <- read.csv("HASYv2/classification-task/fold-1/test.csv")
head(train)
nrow(train)
#> [1] 151241
nrow(test)
#> [1] 16992

As can be seen above this dataset contains 151241 training images and 16992 testing images. there is also 369 classes of labels according to the [HASYv2 Paper] (https://arxiv.org/abs/1701.08380). We are going to limit the classes to 15 to shorten the training time and prevent crashes. The classes will be the 10 numerical digits from 0 to 9, the + and - symbols, the multiplication, division and root symbols. These symbols are the most used in mathematical equations. We would like to use the = symbol as well but unfortunately that symbol is not included in the dataset.

Before we subset the data, lets check for any missing values.

anyNA(train)
#> [1] FALSE
anyNA(test)
#> [1] FALSE

There are no missing values so we can proceed right to processing the data.

train2 <- train %>% 
  filter(symbol_id %in% c(70:79, 195, 196, 513, 526, 960))%>%  # subset the data to have only 15 different kinds of classes 
  mutate(filepath = str_replace(path, "../../", "HASYv2/"), # change the filepath from the files to point the images relative from the location of this rmd.
         label = case_when(symbol_id == 70 ~ "0",
                           symbol_id == 71 ~ "1",
                           symbol_id == 72 ~ "2",
                           symbol_id == 73 ~ "3",
                           symbol_id == 74 ~ "4",
                           symbol_id == 75 ~ "5",
                           symbol_id == 76 ~ "6",
                           symbol_id == 77 ~ "7",
                           symbol_id == 78 ~ "8",
                           symbol_id == 79 ~ "9",
                           symbol_id == 195 ~ "-",
                           symbol_id == 196 ~ "+",
                           symbol_id == 513 ~ "multiplication",
                           symbol_id == 526 ~ "division",
                           symbol_id == 960 ~ "root")) %>% 
  select(c(label,filepath)) 

test2 <- test %>% 
  filter(symbol_id %in% c(70:79, 195, 196, 513, 526, 960)) %>% 
  mutate(filepath = str_replace(path, "../../", "HASYv2/"),
         label = case_when(symbol_id == 70 ~ "0",
                           symbol_id == 71 ~ "1",
                           symbol_id == 72 ~ "2",
                           symbol_id == 73 ~ "3",
                           symbol_id == 74 ~ "4",
                           symbol_id == 75 ~ "5",
                           symbol_id == 76 ~ "6",
                           symbol_id == 77 ~ "7",
                           symbol_id == 78 ~ "8",
                           symbol_id == 79 ~ "9",
                           symbol_id == 195 ~ "-",
                           symbol_id == 196 ~ "+",
                           symbol_id == 513 ~ "multiplication",
                           symbol_id == 526 ~ "division",
                           symbol_id == 960 ~ "root")) %>% 
  select(c(label,filepath))

3 Cross validation

Now let us set the train-validation split from the training dataset.

set.seed(100)
splitter <- initial_split(data = train2, prop = 0.8, strata = label)
train3 <- training(splitter)
val <- testing(splitter)

4 Setting up the image augmentation and data sample generation

Data augmentation is one of the neural network training practice that improves generalisation of the model, or reducing model overfitting. The image samples are rotated, flipped, sheared, dimensions changed in order to force the model to learn what are the important features and what are the features that are just incidental of an image sample.

image_size <- c(32,32)
batch_size <- 128

image_gen_train <- image_data_generator(rescale = 1/255,
                                        rotation_range = 15,
                                        width_shift_range = 0.1,
                                        height_shift_range = 0.1,
                                        shear_range = 0.1,
                                        zoom_range = 0.1,
                                        fill_mode = "nearest")

image_gen_val <- image_data_generator(rescale = 1/255)

train_gen <-  flow_images_from_dataframe(dataframe = train3,
                                         x_col = "filepath",
                                         y_col = "label",
                                         generator = image_gen_train,
                                         target_size = image_size,
                                         color_mode = "grayscale",
                                         class_mode = "categorical",
                                         batch_size = batch_size,
                                         shuffle = TRUE,
                                         seed = 100)

val_gen <-  flow_images_from_dataframe(dataframe = val,
                                         x_col = "filepath",
                                         y_col = "label",
                                         generator = image_gen_val,
                                         target_size = image_size,
                                         color_mode = "grayscale",
                                         class_mode = "categorical",
                                         batch_size = batch_size,
                                         shuffle = FALSE)

5 Building The First Model

In this model we will use train-validation-test split strategy. The validation dataset is going to be used to determine the accuracy of the model in predicting classes of images not in the train dataset during training. The test dataset is going to be used to test the performance of the model on really new data.

This model is a deep learning model consisted of input, three hidden layers, and an output layer.

# build model

initializer <- initializer_random_normal(seed = 100)

model1 <- keras_model_sequential() %>% 
  layer_flatten(name = "dense_flatten", input_shape = c(image_size,1)) %>% 
  layer_dense(units = 512, activation = "relu",
              kernel_initializer = initializer, bias_initializer = initializer, name="hidden1") %>% 
  layer_dense(units = 512, activation = "relu",
              kernel_initializer = initializer, bias_initializer = initializer, name="hidden2") %>% 
  layer_dense(units = 512, activation = "relu", 
              kernel_initializer = initializer, bias_initializer = initializer, name = "hidden3") %>% 
  layer_dense(units = 15, activation = "softmax", 
              kernel_initializer = initializer, bias_initializer = initializer, name = "output") %>% 
  compile(loss = "categorical_crossentropy",
          optimizer = optimizer_adam(lr = 0.001),
          metrics = "accuracy")
  summary(model1)
#> Model: "sequential"
#> ________________________________________________________________________________
#> Layer (type)                        Output Shape                    Param #     
#> ================================================================================
#> dense_flatten (Flatten)             (None, 1024)                    0           
#> ________________________________________________________________________________
#> hidden1 (Dense)                     (None, 512)                     524800      
#> ________________________________________________________________________________
#> hidden2 (Dense)                     (None, 512)                     262656      
#> ________________________________________________________________________________
#> hidden3 (Dense)                     (None, 512)                     262656      
#> ________________________________________________________________________________
#> output (Dense)                      (None, 15)                      7695        
#> ================================================================================
#> Total params: 1,057,807
#> Trainable params: 1,057,807
#> Non-trainable params: 0
#> ________________________________________________________________________________

6 First Model Fitting.

steps_per_epoch <- ceiling(nrow(train3) / batch_size)

validation_steps <- ceiling(nrow(val) / batch_size)

history1 <- model1 %>% fit_generator(generator = train_gen,
                                     steps_per_epoch = steps_per_epoch,
                                     epochs = 30,
                                     validation_data = val_gen,
                                     validation_steps = validation_steps)
# save model
#model1 %>% 
#  save_model_hdf5("model1_digsym_keras.hdf5")
# load model
model1 <- load_model_hdf5("model1_digsym_keras.hdf5")
# save history1
#saveRDS(history1, file = "history1.RDS")
# load history1
history1 <- readRDS(file = "history1.RDS")
plot(history1)

The plots show that in regards to the validation dataset, our model did not overfit. Rather, it generalises pretty well since the validation accuracy is always higher than the training accuracy. The validation accuracy is 95% while the training accuracy is 90%.

7 Test Data Processing

image_prep <- function(x) {
  arrays <- lapply(x, function(path) {
    img <- image_load(path, target_size = image_size, 
                      grayscale = T # Set FALSE if image is RGB
                      )
    
    x <- image_to_array(img)
    x <- array_reshape(x, c(1, dim(x)))
    x <- x/255 # rescale image pixel
  })
  do.call(abind::abind, c(arrays, list(along = 1)))
}

test_x <- image_prep(test2$filepath)

# Check dimension of testing data set
dim(test_x)
#> [1] 410  32  32   1

8 Test Data Prediction With The First Model

pred_test <- predict_classes(model1, test_x)
head(pred_test,10)
#>  [1] 2 2 2 2 2 2 2 2 2 2
# to convert class prediction to the human readable label we read class_id from train_gen
train_gen$class_indices
#> $`+`
#> [1] 0
#> 
#> $`-`
#> [1] 1
#> 
#> $`0`
#> [1] 2
#> 
#> $`1`
#> [1] 3
#> 
#> $`2`
#> [1] 4
#> 
#> $`3`
#> [1] 5
#> 
#> $`4`
#> [1] 6
#> 
#> $`5`
#> [1] 7
#> 
#> $`6`
#> [1] 8
#> 
#> $`7`
#> [1] 9
#> 
#> $`8`
#> [1] 10
#> 
#> $`9`
#> [1] 11
#> 
#> $division
#> [1] 12
#> 
#> $multiplication
#> [1] 13
#> 
#> $root
#> [1] 14

Based on the dictionary above, we decode the class indices into human readable labels.

decode <- function(x){
                 case_when(x == 0 ~ "+",
                           x == 1 ~ "-",
                           x == 2 ~ "0",
                           x == 3 ~ "1",
                           x == 4 ~ "2",
                           x == 5 ~ "3",
                           x == 6 ~ "4",
                           x == 7 ~ "5",
                           x == 8 ~ "6",
                           x == 9 ~ "7",
                           x == 10 ~ "8",
                           x == 11 ~ "9",
                           x == 12 ~ "division",
                           x == 13 ~ "multiplication",
                           x == 14 ~ "root")
}

pred_test <- sapply(pred_test, decode) 

head(pred_test, 10)
#>  [1] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
confusionMatrix(as.factor(pred_test), 
                as.factor(test2$label)
                )
#> Confusion Matrix and Statistics
#> 
#>                 Reference
#> Prediction         -   +   0   1   2   3   4   5   6   7   8   9 division
#>   -               12   1   0   0   0   0   0   0   0   0   0   0        0
#>   +                0   6   0   0   0   0   0   0   0   0   0   0        0
#>   0                0   0  14   0   0   0   0   0   0   0   0   0        0
#>   1                0   0   0  10   0   1   0   0   0   0   0   0        0
#>   2                0   0   0   0  12   0   0   0   0   0   0   0        0
#>   3                0   0   0   0   0   9   0   1   0   0   1   3        0
#>   4                0   0   0   0   0   0   3   0   0   0   0   0        0
#>   5                0   0   0   0   0   0   0   7   1   0   0   0        1
#>   6                0   0   0   0   0   0   0   0   8   0   4   0        0
#>   7                0   0   0   1   0   1   0   0   0   8   0   0        0
#>   8                0   0   0   0   0   0   0   0   0   0   7   0        0
#>   9                0   0   0   0   0   1   0   0   0   0   0   6        1
#>   division         0   2   0   0   0   0   1   0   0   0   0   0       32
#>   multiplication   0   0   0   0   1   0   3   0   0   0   0   0        0
#>   root             0   0   0   1   0   0   0   0   1   0   1   0        0
#>                 Reference
#> Prediction       multiplication root
#>   -                           0    0
#>   +                           0    0
#>   0                           0    0
#>   1                           0    0
#>   2                           0    0
#>   3                           0    0
#>   4                           0    0
#>   5                           0    0
#>   6                           0    0
#>   7                           0    0
#>   8                           0    0
#>   9                           0    0
#>   division                    0    0
#>   multiplication            151    0
#>   root                        0   98
#> 
#> Overall Statistics
#>                                                
#>                Accuracy : 0.9341               
#>                  95% CI : (0.9056, 0.9562)     
#>     No Information Rate : 0.3683               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.9163               
#>                                                
#>  Mcnemar's Test P-Value : NA                   
#> 
#> Statistics by Class:
#> 
#>                      Class: - Class: + Class: 0 Class: 1 Class: 2 Class: 3
#> Sensitivity           1.00000  0.66667  1.00000  0.83333  0.92308  0.75000
#> Specificity           0.99749  1.00000  1.00000  0.99749  1.00000  0.98744
#> Pos Pred Value        0.92308  1.00000  1.00000  0.90909  1.00000  0.64286
#> Neg Pred Value        1.00000  0.99257  1.00000  0.99499  0.99749  0.99242
#> Prevalence            0.02927  0.02195  0.03415  0.02927  0.03171  0.02927
#> Detection Rate        0.02927  0.01463  0.03415  0.02439  0.02927  0.02195
#> Detection Prevalence  0.03171  0.01463  0.03415  0.02683  0.02927  0.03415
#> Balanced Accuracy     0.99874  0.83333  1.00000  0.91541  0.96154  0.86872
#>                      Class: 4 Class: 5 Class: 6 Class: 7 Class: 8 Class: 9
#> Sensitivity          0.428571  0.87500  0.80000  1.00000  0.53846  0.66667
#> Specificity          1.000000  0.99502  0.99000  0.99502  1.00000  0.99501
#> Pos Pred Value       1.000000  0.77778  0.66667  0.80000  1.00000  0.75000
#> Neg Pred Value       0.990172  0.99751  0.99497  1.00000  0.98511  0.99254
#> Prevalence           0.017073  0.01951  0.02439  0.01951  0.03171  0.02195
#> Detection Rate       0.007317  0.01707  0.01951  0.01951  0.01707  0.01463
#> Detection Prevalence 0.007317  0.02195  0.02927  0.02439  0.01707  0.01951
#> Balanced Accuracy    0.714286  0.93501  0.89500  0.99751  0.76923  0.83084
#>                      Class: division Class: multiplication Class: root
#> Sensitivity                  0.94118                1.0000      1.0000
#> Specificity                  0.99202                0.9846      0.9904
#> Pos Pred Value               0.91429                0.9742      0.9703
#> Neg Pred Value               0.99467                1.0000      1.0000
#> Prevalence                   0.08293                0.3683      0.2390
#> Detection Rate               0.07805                0.3683      0.2390
#> Detection Prevalence         0.08537                0.3780      0.2463
#> Balanced Accuracy            0.96660                0.9923      0.9952

The accuracy of the first model on the unseen test data is higher than the training data, 93%.

9 Train The Same Model With The Whole Train Data

Now we will use the whole training dataset, without splitting them, to train our model. Although the model will have the same parameter and only differs in the number of training data, we will call the model in this section the second model.

train_gen2 <-  flow_images_from_dataframe(dataframe = train2,
                                         x_col = "filepath",
                                         y_col = "label",
                                         generator = image_gen_train,
                                         target_size = image_size,
                                         color_mode = "grayscale",
                                         class_mode = "categorical",
                                         batch_size = batch_size,
                                         shuffle = TRUE,
                                         seed = 100)
# build model

initializer <- initializer_random_normal(seed = 100)

model2 <- keras_model_sequential() %>% 
  layer_flatten(name = "dense_flatten", input_shape = c(image_size,1)) %>% 
  layer_dense(units = 512, activation = "relu",
              kernel_initializer = initializer, bias_initializer = initializer, name="hidden1") %>% 
  layer_dense(units = 512, activation = "relu",
              kernel_initializer = initializer, bias_initializer = initializer, name="hidden2") %>% 
  layer_dense(units = 512, activation = "relu", 
              kernel_initializer = initializer, bias_initializer = initializer, name = "hidden3") %>% 
  layer_dense(units = 15, activation = "softmax", 
              kernel_initializer = initializer, bias_initializer = initializer, name = "output") %>% 
  compile(loss = "categorical_crossentropy",
          optimizer = optimizer_adam(lr = 0.001),
          metrics = "accuracy")
  summary(model2)
#> Model: "sequential_1"
#> ________________________________________________________________________________
#> Layer (type)                        Output Shape                    Param #     
#> ================================================================================
#> dense_flatten (Flatten)             (None, 1024)                    0           
#> ________________________________________________________________________________
#> hidden1 (Dense)                     (None, 512)                     524800      
#> ________________________________________________________________________________
#> hidden2 (Dense)                     (None, 512)                     262656      
#> ________________________________________________________________________________
#> hidden3 (Dense)                     (None, 512)                     262656      
#> ________________________________________________________________________________
#> output (Dense)                      (None, 15)                      7695        
#> ================================================================================
#> Total params: 1,057,807
#> Trainable params: 1,057,807
#> Non-trainable params: 0
#> ________________________________________________________________________________
steps_per_epoch <- ceiling(nrow(train2) / batch_size)

history2 <- model2 %>% fit_generator(generator = train_gen2,
                                     steps_per_epoch = steps_per_epoch,
                                     epochs = 30)
# save model
#model2 %>% 
#  save_model_hdf5("model2_digsym_keras.hdf5")
#load model
model2 <- load_model_hdf5("model2_digsym_keras.hdf5")
# save history2
#saveRDS(history2, file = "history2.RDS")
# load history2
history2 <- readRDS(file = "history2.RDS")
plot(history2)

# Test Data Prediction With The Second Model

pred_test2 <- predict_classes(model2, test_x)
head(pred_test2,10)
#>  [1] 2 2 2 2 2 2 2 2 2 2
pred_test2 <- sapply(pred_test2, decode) 

head(pred_test2, 10)
#>  [1] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
confusionMatrix(as.factor(pred_test2), 
                as.factor(test2$label)
                )
#> Confusion Matrix and Statistics
#> 
#>                 Reference
#> Prediction         -   +   0   1   2   3   4   5   6   7   8   9 division
#>   -               11   0   0   0   0   0   0   0   0   0   0   0        0
#>   +                0   8   0   0   0   0   0   0   0   0   0   0        0
#>   0                0   0  14   0   0   0   0   0   0   0   0   0        0
#>   1                0   0   0   9   0   0   0   0   0   0   0   0        0
#>   2                0   0   0   1  13   1   0   0   0   2   0   0        0
#>   3                0   0   0   0   0   5   0   1   0   0   0   1        0
#>   4                0   0   0   0   0   0   6   0   0   0   0   0        0
#>   5                0   0   0   0   0   1   0   7   0   0   1   0        0
#>   6                0   0   0   0   0   0   0   0   6   0   1   0        0
#>   7                0   0   0   0   0   0   0   0   0   6   0   0        0
#>   8                0   0   0   0   0   0   0   0   4   0  11   0        0
#>   9                0   0   0   1   0   5   0   0   0   0   0   8        1
#>   division         1   1   0   0   0   0   1   0   0   0   0   0       33
#>   multiplication   0   0   0   1   0   0   0   0   0   0   0   0        0
#>   root             0   0   0   0   0   0   0   0   0   0   0   0        0
#>                 Reference
#> Prediction       multiplication root
#>   -                           0    0
#>   +                           0    0
#>   0                           0    0
#>   1                           0    0
#>   2                           0    3
#>   3                           0    0
#>   4                           0    0
#>   5                           0    0
#>   6                           0    0
#>   7                           0    0
#>   8                           0    0
#>   9                           0    0
#>   division                    0    0
#>   multiplication            151    0
#>   root                        0   95
#> 
#> Overall Statistics
#>                                                
#>                Accuracy : 0.9341               
#>                  95% CI : (0.9056, 0.9562)     
#>     No Information Rate : 0.3683               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.9169               
#>                                                
#>  Mcnemar's Test P-Value : NA                   
#> 
#> Statistics by Class:
#> 
#>                      Class: - Class: + Class: 0 Class: 1 Class: 2 Class: 3
#> Sensitivity           0.91667  0.88889  1.00000  0.75000  1.00000  0.41667
#> Specificity           1.00000  1.00000  1.00000  1.00000  0.98237  0.99497
#> Pos Pred Value        1.00000  1.00000  1.00000  1.00000  0.65000  0.71429
#> Neg Pred Value        0.99749  0.99751  1.00000  0.99252  1.00000  0.98263
#> Prevalence            0.02927  0.02195  0.03415  0.02927  0.03171  0.02927
#> Detection Rate        0.02683  0.01951  0.03415  0.02195  0.03171  0.01220
#> Detection Prevalence  0.02683  0.01951  0.03415  0.02195  0.04878  0.01707
#> Balanced Accuracy     0.95833  0.94444  1.00000  0.87500  0.99118  0.70582
#>                      Class: 4 Class: 5 Class: 6 Class: 7 Class: 8 Class: 9
#> Sensitivity           0.85714  0.87500  0.60000  0.75000  0.84615  0.88889
#> Specificity           1.00000  0.99502  0.99750  1.00000  0.98992  0.98254
#> Pos Pred Value        1.00000  0.77778  0.85714  1.00000  0.73333  0.53333
#> Neg Pred Value        0.99752  0.99751  0.99007  0.99505  0.99494  0.99747
#> Prevalence            0.01707  0.01951  0.02439  0.01951  0.03171  0.02195
#> Detection Rate        0.01463  0.01707  0.01463  0.01463  0.02683  0.01951
#> Detection Prevalence  0.01463  0.02195  0.01707  0.01463  0.03659  0.03659
#> Balanced Accuracy     0.92857  0.93501  0.79875  0.87500  0.91804  0.93572
#>                      Class: division Class: multiplication Class: root
#> Sensitivity                  0.97059                1.0000      0.9694
#> Specificity                  0.99202                0.9961      1.0000
#> Pos Pred Value               0.91667                0.9934      1.0000
#> Neg Pred Value               0.99733                1.0000      0.9905
#> Prevalence                   0.08293                0.3683      0.2390
#> Detection Rate               0.08049                0.3683      0.2317
#> Detection Prevalence         0.08780                0.3707      0.2317
#> Balanced Accuracy            0.98130                0.9981      0.9847

It turns out that the accuracy is exactly the same, 93%. For our model, training with the whole data does not improve the model’s accuracy on the testing data.

10 Conclusion

We have trained a deep learning model consisting of three hidden dense layers to predict hand written digits and arithmetical operation symbols (+, -, division, multiplication) and root operation symbol. Our model has a pretty good accuracy on the test dataset, 93%. We also found that using train-validation-test split and train-test split does not influence the accuracy at all.