On this short article, we will try to build Neural Network model that can identify handwritten digits. We will use the famous MNIST dataset.

Library and Setup

The following packages are required if you wish to replicate the code in R.

library(dplyr)
library(keras)
library(tensorflow)
library(yardstick)

options(scipen = 999)

Import Data

The data consists of train and test set. However, the test set don’t have any label and we need to submit to the Kaggle competition in order to measure the performance. For now, we will only use the train set to train and evaluate the model.

train_mnist <- read.csv("data/train.csv")

head(train_mnist)
## Data dimension: 42000 785

The data has 42,000 observations and 785 variables. The first variable (label) is the target variable, while the rest are the brightness value of each pixel of the image.

Cross Validation

We will split the data into training and testing set, with 90% of the data will be used to train the model. We may also check the proportion of the target variable in case there is some imbalance between classes.

set.seed(123)
index <- sample(x = nrow(train_mnist), 
                size = nrow(train_mnist) * 0.9 )

data_train <- train_mnist[ index, ]
data_test <- train_mnist[ -index, ]

table(data_train$label) %>% prop.table()
## 
##          0          1          2          3          4          5          6 
## 0.09896825 0.11158730 0.09936508 0.10325397 0.09674603 0.08986772 0.09833333 
##          7          8          9 
## 0.10529101 0.09722222 0.09936508

The next step is to separate the target and the features. We will also scale the pixel value into range of [0,1] by dividing the value with the maximum possible value of a pixel brightness (255) on greyscale.

train_x <- data_train %>% 
  select(-label) %>% 
  as.matrix() / 255

test_x <- data_test %>% 
  select(-label) %>% 
  as.matrix() / 255

train_x[1:10, 203:208]
##        pixel202  pixel203  pixel204  pixel205  pixel206  pixel207
## 2986  0.0000000 0.0000000 0.4588235 0.7686275 0.3137255 0.2196078
## 29925 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 29710 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.5960784
## 37529 0.2117647 0.9921569 0.8235294 0.7843137 0.7843137 0.7843137
## 2757  0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.8392157
## 38938 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 9642  0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 31313 0.0000000 0.5333333 0.9333333 0.0627451 0.0000000 0.0000000
## 14183 0.0000000 0.0000000 0.0000000 0.0000000 0.4745098 0.9843137
## 15180 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000

Multi-Layer Perceptron

First we build the vanilla Multi Layer Perceptron (MLP) where we only use simple layer dense. We convert the target variable by encoding it into several variable based on the number of classes (10 classes).

train_x_keras <- array_reshape(train_x, dim(train_x))
test_x_keras <- array_reshape(test_x, dim(test_x))

train_y <- to_categorical(data_train$label, num_classes = 10)
test_y <- data_test$label

cat("Encoded Target \n \n")
## Encoded Target 
## 
head(train_y)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0    0    0    0    0    0    1    0     0
## [2,]    0    0    0    0    0    0    1    0    0     0
## [3,]    0    0    0    1    0    0    0    0    0     0
## [4,]    0    0    0    1    0    0    0    0    0     0
## [5,]    0    1    0    0    0    0    0    0    0     0
## [6,]    0    0    0    0    1    0    0    0    0     0

Next, we need to build the model architecture. Here I build an MLP model with 4 hidden layers.

tf$random$set_seed(123)

model <- keras_model_sequential(name = "MLP_Model") %>% 
  layer_dense(units = 512, 
              activation = "relu",
              input_shape = ncol(train_x),
              name = "Hidden_1"
              ) %>% 
  layer_dense(units = 128, 
              activation = "relu",
              name = "Hidden_2"
              ) %>% 
  layer_dense(units = 32,
              activation = "relu",
              name = "Hidden_3"
              ) %>% 
  layer_dense(units = 16,
              activation = "relu",
              name = "Hidden_4"
              ) %>% 
  layer_dense(units = 10,
              activation = "softmax"
              )

model
## Model
## Model: "MLP_Model"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## Hidden_1 (Dense)                    (None, 512)                     401920      
## ________________________________________________________________________________
## Hidden_2 (Dense)                    (None, 128)                     65664       
## ________________________________________________________________________________
## Hidden_3 (Dense)                    (None, 32)                      4128        
## ________________________________________________________________________________
## Hidden_4 (Dense)                    (None, 16)                      528         
## ________________________________________________________________________________
## dense (Dense)                       (None, 10)                      170         
## ================================================================================
## Total params: 472,410
## Trainable params: 472,410
## Non-trainable params: 0
## ________________________________________________________________________________

The model will use categorical cross-entropy as the loss function, with learning rate of 0.001 using optimizer adam to update the weight during training. The model will be trained for 15 epochs using batch size of 32. To validate the training process and see if there are any sign of overfitting, 10% of the data will be used as the validation set.

model %>% 
  compile(loss = "categorical_crossentropy",
          optimizer = optimizer_adam(lr = 0.001),
          metrics = "accuracy"
          )

train_history <- model %>% 
  fit(x = train_x_keras,
      y = train_y,
      epochs = 15,
      batch_size = 32,
      validation_split = 0.1,
      verbose = 1
      )

plot(train_history)
## `geom_smooth()` using formula 'y ~ x'

The final step is to evaluate our model on the testing dataset. Here we acquire model with accuracy, recall, and precision of of 97%.

pred_test <- predict_classes(model, test_x_keras)

data.frame(
  Accuracy = accuracy_vec(truth = as.factor(data_test$label), 
                          estimate = as.factor(pred_test)
                          ),
  Recall = sens_vec(truth = as.factor(data_test$label),
                    estimate = as.factor(pred_test)
                          ),
  Precision = precision_vec(truth = as.factor(data_test$label), 
                          estimate = as.factor(pred_test)
                          ),
  F1 = f_meas_vec(truth = as.factor(data_test$label), 
                          estimate = as.factor(pred_test)
                          )
) %>% 
  mutate_all(scales::percent, accuracy = 0.01)

Convolutional Neural Network

You may want to improve the model a little bit. You can use different layer on top of the layer dense. Since it is an image data, we can use Convolutional Neural Network (CNN). If you want to learn more about the behaviour of this model, you can watch this interesting animation from Israel Vicars.

To make the data works with CNN, we reshape the data by changing it’s dimension, from 2 dimensions (row, column) to 4 dimensions (row, length, wide, color channel). The image has dimension of 28 x 28 pixel with greyscale color (only 1 color channels), so the dimension would be (number of row, 28, 28, 1).

train_x_keras <- array_reshape(train_x, 
                               dim = c(nrow(train_x), 28, 28, 1)
                               )

test_x_keras <- array_reshape(test_x, 
                               dim = c(nrow(test_x), 28, 28, 1)
                               )

Next, we build the refined model with CNN on it. The kernel size is 3 x 3 with 32 filters, followed by max pooling to reduce the dimension for the subsequent process.

tensorflow::tf$random$set_seed(123)

model <- keras_model_sequential(name = "CNN_Model") %>% 
  
  layer_conv_2d(filters = 32, 
                kernel_size = c(4,4), 
                padding = "same", activation = "relu",
                input_shape = c(28, 28, 1)
                ) %>% 
  
  layer_max_pooling_2d(pool_size = c(3,3)) %>% 
  
  layer_conv_2d(filters = 32, 
                kernel_size = c(4,4), 
                padding = "same", activation = "relu",
                input_shape = c(28, 28, 1)
                ) %>% 
  
  layer_max_pooling_2d(pool_size = c(3,3)) %>% 
  
  layer_conv_2d(filters = 32, 
                kernel_size = c(4,4), 
                padding = "same", activation = "relu",
                input_shape = c(28, 28, 1)
                ) %>% 
  
  layer_max_pooling_2d(pool_size = c(3,3)) %>% 
  
  layer_flatten() %>% 
  
  layer_dense(units = 16, 
              activation = "relu") %>% 
 
  layer_dense(units = 10, 
              activation = "softmax",
              name = "Output"
              )

model
## Model
## Model: "CNN_Model"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## conv2d (Conv2D)                     (None, 28, 28, 32)              544         
## ________________________________________________________________________________
## max_pooling2d (MaxPooling2D)        (None, 9, 9, 32)                0           
## ________________________________________________________________________________
## conv2d_1 (Conv2D)                   (None, 9, 9, 32)                16416       
## ________________________________________________________________________________
## max_pooling2d_1 (MaxPooling2D)      (None, 3, 3, 32)                0           
## ________________________________________________________________________________
## conv2d_2 (Conv2D)                   (None, 3, 3, 32)                16416       
## ________________________________________________________________________________
## max_pooling2d_2 (MaxPooling2D)      (None, 1, 1, 32)                0           
## ________________________________________________________________________________
## flatten (Flatten)                   (None, 32)                      0           
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 16)                      528         
## ________________________________________________________________________________
## Output (Dense)                      (None, 10)                      170         
## ================================================================================
## Total params: 34,074
## Trainable params: 34,074
## Non-trainable params: 0
## ________________________________________________________________________________

The configuration for the loss function and the optimizer are still the same with the previous model.

# Compile Model
model %>% 
  compile(loss = "categorical_crossentropy",
          optimizer = optimizer_adam(lr = 0.001), 
          metrics = "accuracy"
          )

train_history <- model %>% 
  fit(x = train_x_keras, 
      y = train_y,
      epochs = 15, 
      batch_size = 32,
      validation_split = 0.1, 
      
      verbose = 1
      )

plot(train_history)
## `geom_smooth()` using formula 'y ~ x'

Let’s evaluate the result. Here we have a slightly better result, with significant increase for all evaluation metrics.

pred_test <- predict_classes(model, test_x_keras)

data.frame(
  Accuracy = accuracy_vec(truth = as.factor(data_test$label), 
                          estimate = as.factor(pred_test)
                          ),
  Recall = sens_vec(truth = as.factor(data_test$label),
                    estimate = as.factor(pred_test)
                          ),
  Precision = precision_vec(truth = as.factor(data_test$label), 
                          estimate = as.factor(pred_test)
                          ),
  F1 = f_meas_vec(truth = as.factor(data_test$label), 
                          estimate = as.factor(pred_test)
                          )
) %>% 
  mutate_all(scales::percent, accuracy = 0.01)

Kaggle Submission

If you are interested to see how you compete with other people, you can submit your prediction on the test.csv to the official Kaggle competition. Use the following code to generate the prediction.

# your code here
submit_mnist <- read.csv("data/mnist/test.csv")

test_x <- submit_mnist %>% 
  as.matrix()/255

test_x <- array_reshape(test_x, 
                        dim = c(nrow(test_x), 28,28, 1) 
                        )

pred_test <- predict_classes(model, test_x)

submission <- data.frame(
  ImageId = 1:nrow(submit_mnist),
  Label = pred_test
)

write.csv(submission, "submission kaggle.csv", row.names = F)