Library

library(keras)
library(dplyr)
library(caret)

Problem Statement

American Sign Language (ASL) is a complete, natural language that has the same linguistic properties as spoken languages, with grammar that differs from English. ASL is expressed by movements of the hands and face. It is the primary language of many North Americans who are deaf and hard of hearing, and is used by many hearing people as well. We are given pictures of hand gestures of ASL, each represents a classic Roman alphabet. Can we classify each picture as an alphabet? This is a multiclass classification problem which will be solved by developing a Neural Network (NN) model.

Dataset

Let’s read the dataset.

train <- read.csv('sign_mnist_train.csv')
test <- read.csv('sign_mnist_test.csv')

The dataset format is patterned to match closely with the classic MNIST. The training data (27,455 cases) and test data (7,172 cases) are approximately half the size of the standard MNIST but otherwise similar with a header row of label, pixel1, pixel2, …, pixel784 which represent a single \(28 \times 28\) pixel image with grayscale values between 0-255.

dim(train)

#> [1] 27455   785

dim(test)

#> [1] 7172  785

Each training and test case represents a label (0-25) as a one-to-one map for each alphabetic letter A-Z (and no cases for 9=J or 25=Z because of gesture motions).

sort(unique(train$label))

#>  [1]  0  1  2  3  4  5  6  7  8 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

sort(unique(test$label))

#>  [1]  0  1  2  3  4  5  6  7  8 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

EDA & Preprocessing

First, we need to correct the label. Since label 9 and 25 are missing, we can substract by 1 all label bigger than 9. This way, our label becomes all integers from 0 to 24.

train[train$label > 9, 'label'] <- train[train$label > 9, 'label'] - 1
test[test$label > 9, 'label'] <- test[test$label > 9, 'label'] - 1

It doesn’t hurt to see what our dataset looks like. Let’s take a look at the first 32 pictures from train.

vizTrain <- function(input) {
  dimmax <- sqrt(ncol(train[, -1]))
  cols <- 8
  rows <- floor((nrow(input) - 1) / cols) + 1
  par(mfrow = c(rows, cols), mar = c(0.1, 0.1, 0.1, 0.1))
  for (i in 1:nrow(input)) {
    m1 <- matrix(input[i, 2:ncol(input)], nrow = dimmax, byrow = T)
    m1 <- apply(m1, 2, as.numeric)
    m1 <- t(apply(m1, 2, rev))
    image(1:dimmax, 1:dimmax, m1, col = grey.colors(255), xaxt = "n", yaxt = "n")
    text(3, 26, col = "black", cex = 1.2, train[i, 1])
  }
}

vizTrain(train[1:32, ])

We perform a grayscale normalization to reduce the effect of illumination’s differences. Moreover, the NN models we will use converge faster on [0..1] data than on [0..255]. To di this, simply divide each pixel value by 255. We also separate predictor and target from train and test dataset simultaneously, resulting in train_x, test_x, train_y, test_y.

train_x <- train %>% 
  select(-label) %>%
  data.matrix()/255
  
test_x <- test %>% 
  select(-label) %>%
  data.matrix()/255

train_y <- train %>% 
  select(label)

test_y <- test %>% 
  select(label)

NN models don’t recognize categorical features. For that reason, we need to do one-hot encoding for the labels train_y and test_y. Basically, what one-hot encoding does is to generate columns of ones and zeros for each category. So in our case, the result will be a matrix with 24 columns in which each rows will all have zero values except at one cell which has value of 1. The column at which this value 1 occured corresponds to the label that column represents. For example, the first six observations in train_y are the label 3, 6, 2, 2, 12, and 15, as can be seen in the following table.

train_y_keras <- train_y %>% 
  data.matrix() %>% 
  to_categorical(num_classes = 24)

head(train_y_keras)

#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,]    0    0    0    1    0    0    0    0    0     0     0     0     0     0
#> [2,]    0    0    0    0    0    0    1    0    0     0     0     0     0     0
#> [3,]    0    0    1    0    0    0    0    0    0     0     0     0     0     0
#> [4,]    0    0    1    0    0    0    0    0    0     0     0     0     0     0
#> [5,]    0    0    0    0    0    0    0    0    0     0     0     0     1     0
#> [6,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0
#>      [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
#> [1,]     0     0     0     0     0     0     0     0     0     0
#> [2,]     0     0     0     0     0     0     0     0     0     0
#> [3,]     0     0     0     0     0     0     0     0     0     0
#> [4,]     0     0     0     0     0     0     0     0     0     0
#> [5,]     0     0     0     0     0     0     0     0     0     0
#> [6,]     0     1     0     0     0     0     0     0     0     0

We do the same to test_y.

test_y_keras <- test_y %>% 
  data.matrix() %>% 
  to_categorical(num_classes = 24)

head(test_y_keras)

#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,]    0    0    0    0    0    0    1    0    0     0     0     0     0     0
#> [2,]    0    0    0    0    0    1    0    0    0     0     0     0     0     0
#> [3,]    0    0    0    0    0    0    0    0    0     1     0     0     0     0
#> [4,]    1    0    0    0    0    0    0    0    0     0     0     0     0     0
#> [5,]    0    0    0    1    0    0    0    0    0     0     0     0     0     0
#> [6,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0
#>      [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
#> [1,]     0     0     0     0     0     0     0     0     0     0
#> [2,]     0     0     0     0     0     0     0     0     0     0
#> [3,]     0     0     0     0     0     0     0     0     0     0
#> [4,]     0     0     0     0     0     0     0     0     0     0
#> [5,]     0     0     0     0     0     0     0     0     0     0
#> [6,]     0     0     0     0     0     0     1     0     0     0

Metric and Validation

Since we don’t prefer one class above the others, and all the classes are balanced as shown below, we will simply use accuracy as our metric. Also, since NN algorithm on image classification takes time to train, we want the model validation as simple as possible. In that scenario, we will simply use a separate test dataset to validate the model.

ggplot(train %>% 
         group_by(label) %>% 
         count(name = 'observation_count'), 
       aes(x = label, y = observation_count)) + 
  geom_bar(stat = 'identity') + 
  ggtitle("Number of Observations among Labels")

Modeling

Dense with 2 Hidden Layers

Before going on to modeling, we need to transform our dataset from matrices into arrays with the same dimension.

train_x_keras <- train_x %>% 
  array_reshape(dim = dim(train_x))

test_x_keras <-  test_x %>% 
  array_reshape(dim = dim(test_x))

Then, build the model. For now, we will create the architecture as follows.

Input Layer: 784 nodes
Hidden Layer 1: 128 nodes, relu activation function
Hidden Layer 2: 64 nodes, relu activation function
Output Layer: 24 nodes, softmax activation function

We use 784 nodes in the input layer since there are in total 784 pixels in each picture in the dataset. Relu activation function is chosen in the hidden layers since it is suitable for image data (the value of each pixel and node is positive between 0 and 1). Softmax activation function is chosen in the output layer since this is a multiclass classification problem.

tensorflow::tf$random$set_seed(42)

model_2hidden <- keras_model_sequential()
model_2hidden %>% 
  layer_dense(input_shape = ncol(train_x_keras),
              units = 128,
              activation = "relu",
              name = "hidden1") %>% 
  layer_dense(units = 64,
              activation = "relu",
              name = "hidden2") %>%
  layer_dense(units = 24,
              activation = "softmax",
              name = "output")
  
summary(model_2hidden)

#> Model: "sequential"
#> ________________________________________________________________________________
#> Layer (type)                        Output Shape                    Param #     
#> ================================================================================
#> hidden1 (Dense)                     (None, 128)                     100480      
#> ________________________________________________________________________________
#> hidden2 (Dense)                     (None, 64)                      8256        
#> ________________________________________________________________________________
#> output (Dense)                      (None, 24)                      1560        
#> ================================================================================
#> Total params: 110,296
#> Trainable params: 110,296
#> Non-trainable params: 0
#> ________________________________________________________________________________

Compile the model. Use optimizer_adam as optimizer with the default learning rate 0.001, categorical_crossentropy as loss function as this is a multiclass classification problem, and accuracy as metrics as discussed earlier.

model_2hidden %>% 
  compile(optimizer = optimizer_adam(lr=0.001),
          loss = "categorical_crossentropy",
          metrics = "accuracy")

Then, train the model. We will train the model in batches for 10 epochs with 32 number of observations in each batch.

history <- model_2hidden %>% 
  fit(train_x_keras, 
      train_y_keras, 
      batch_size = 32, 
      epoch = 10,
      validation_data = list(test_x_keras, test_y_keras))

plot(history)

As we can see, there is an indication of overfitting (train accuracy is too good compared to test accuracy).

pred_2hidden <- predict_classes(object = model_2hidden, x = test_x_keras)
confusionMatrix(as.factor(pred_2hidden), as.factor(test_y$label))

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16
#>         0  310   0   0   0   0   0   0   0   6   0   0   3  42   0   0   0   0
#>         1    0 391   0  12  11   0   0   0   0   0   0   0   0   0   0   0   0
#>         2    0   0 307   0   0   3   0   8   0   0   0   0  20  20  21   0   0
#>         3    0   0   0 168   0   7   0   0   3   0   0   0   0   0   0   0   0
#>         4    0   0   0   0 464   0   0  21   0   0   0   7  41  21   0   0   0
#>         5    0   0   2   0   0 178   0   0   0  21   0   0   0   1   4   0   0
#>         6    0   0   0   0   0   0 274  34   0   0   0   0   3  38   0   0   0
#>         7    0   0   0   0   0   0  33 304   0   0   0   0   0   0   0   0   0
#>         8    0   0   0   0   0   0   0  21 206   0   0   0   0   0  35   0   0
#>         9    0   0   0   0   0  21   0   0   0 132   0   0   0   0   0   0   0
#>         10   0   0   0   3   0   8   0   0   0   0 209   0   0   0   0   0  21
#>         11   0   0   0   0   1   0   0   0   0  19   0 286  42   0   0  21   0
#>         12  21   0   0   0   0   0  18   0   0   0   0  80 124   7   0   0   0
#>         13   0   0   0   0   0   0   0   0   0   0   0   1   0 155   0   0   0
#>         14   0   0   0   0   0   0   0   0   0   0   0   0   0   0 282  21   0
#>         15   0   0   0   0   0   0  20   0   0   0   0   0  18   3   5 122   0
#>         16   0   0   0  21   0   0   0   0  31  55   0   0   0   0   0   0  57
#>         17   0   0   0   0  22   0   0   0   0   0   0  16   0   0   0   0   0
#>         18   0   0   1   0   0  17   0  21   0   0   0   1   1   1   0   0   0
#>         19   0  41   0  10   0  13   0  27   0  43   0   0   0   0   0   0  66
#>         20   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
#>         21   0   0   0   0   0   0   0   0   0  21   0   0   0   0   0   0   0
#>         22   0   0   0  31   0   0   3   0   0   0   0   0   0   0   0   0   0
#>         23   0   0   0   0   0   0   0   0  42  40   0   0   0   0   0   0   0
#>           Reference
#> Prediction  17  18  19  20  21  22  23
#>         0    0   0   0   0   0   0   0
#>         1    0   0   0   0   0   0  10
#>         2    0   0   0   0   0   0   0
#>         3   18   0   2   0   0   0   0
#>         4   21   0   0   0   0   0   0
#>         5    0   0   0  25   0   0   0
#>         6    0  18   0  20   0   0   0
#>         7    0   0   0   0   0   0   0
#>         8   21   0   0   0   0   0   0
#>         9    0   0   2  15   0   0   0
#>         10   0  22  20   1   0   6  41
#>         11  62   0   0   0   0   0   0
#>         12  21   0   0   0   0   0   0
#>         13   0   0   0   0   0   0   0
#>         14   0   0   0  14   0   0   0
#>         15   1   0   0   0   0   0   0
#>         16   0   0  16  14  13  19   7
#>         17  82   0   0   0   0   1  20
#>         18   0  99   0   6  14   0  21
#>         19   0  20 205  67  36   0   0
#>         20   0   5   0 136   0   0  19
#>         21   0   0   0  12 143  40   0
#>         22   0  81   0   0   0 201   0
#>         23  20   3  21  36   0   0 214
#> 
#> Overall Statistics
#>                                                
#>                Accuracy : 0.704                
#>                  95% CI : (0.6933, 0.7145)     
#>     No Information Rate : 0.0694               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.6902               
#>                                                
#>  Mcnemar's Test P-Value : NA                   
#> 
#> Statistics by Class:
#> 
#>                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
#> Sensitivity           0.93656  0.90509  0.99032  0.68571  0.93173  0.72065
#> Specificity           0.99254  0.99510  0.98951  0.99567  0.98337  0.99235
#> Pos Pred Value        0.85873  0.92217  0.81003  0.84848  0.80696  0.77056
#> Neg Pred Value        0.99692  0.99392  0.99956  0.98896  0.99485  0.99006
#> Prevalence            0.04615  0.06023  0.04322  0.03416  0.06944  0.03444
#> Detection Rate        0.04322  0.05452  0.04281  0.02342  0.06470  0.02482
#> Detection Prevalence  0.05033  0.05912  0.05284  0.02761  0.08017  0.03221
#> Balanced Accuracy     0.96455  0.95010  0.98992  0.84069  0.95755  0.85650
#>                      Class: 6 Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
#> Sensitivity           0.78736  0.69725  0.71528  0.39879   1.00000   0.72589
#> Specificity           0.98344  0.99510  0.98881  0.99445   0.98248   0.97861
#> Pos Pred Value        0.70801  0.90208  0.72792  0.77647   0.63142   0.66357
#> Neg Pred Value        0.98909  0.98069  0.98810  0.97158   1.00000   0.98398
#> Prevalence            0.04852  0.06079  0.04016  0.04615   0.02914   0.05494
#> Detection Rate        0.03820  0.04239  0.02872  0.01840   0.02914   0.03988
#> Detection Prevalence  0.05396  0.04699  0.03946  0.02370   0.04615   0.06009
#> Balanced Accuracy     0.88540  0.84617  0.85205  0.69662   0.99124   0.85225
#>                      Class: 12 Class: 13 Class: 14 Class: 15 Class: 16
#> Sensitivity            0.42612   0.63008   0.81268   0.74390  0.395833
#> Specificity            0.97864   0.99986   0.99487   0.99329  0.974957
#> Pos Pred Value         0.45756   0.99359   0.88959   0.72189  0.244635
#> Neg Pred Value         0.97580   0.98703   0.99052   0.99400  0.987462
#> Prevalence             0.04057   0.03430   0.04838   0.02287  0.020078
#> Detection Rate         0.01729   0.02161   0.03932   0.01701  0.007948
#> Detection Prevalence   0.03779   0.02175   0.04420   0.02356  0.032487
#> Balanced Accuracy      0.70238   0.81497   0.90378   0.86860  0.685395
#>                      Class: 17 Class: 18 Class: 19 Class: 20 Class: 21
#> Sensitivity            0.33333   0.39919   0.77068   0.39306   0.69417
#> Specificity            0.99148   0.98801   0.95323   0.99648   0.98952
#> Pos Pred Value         0.58156   0.54396   0.38826   0.85000   0.66204
#> Neg Pred Value         0.97667   0.97868   0.99082   0.97005   0.99094
#> Prevalence             0.03430   0.03458   0.03709   0.04824   0.02872
#> Detection Rate         0.01143   0.01380   0.02858   0.01896   0.01994
#> Detection Prevalence   0.01966   0.02538   0.07362   0.02231   0.03012
#> Balanced Accuracy      0.66241   0.69360   0.86195   0.69477   0.84185
#>                      Class: 22 Class: 23
#> Sensitivity            0.75281   0.64458
#> Specificity            0.98335   0.97632
#> Pos Pred Value         0.63608   0.56915
#> Neg Pred Value         0.99037   0.98264
#> Prevalence             0.03723   0.04629
#> Detection Rate         0.02803   0.02984
#> Detection Prevalence   0.04406   0.05243
#> Balanced Accuracy      0.86808   0.81045

As we can see in the confusion matrix, many labels are still hard to differentiate between some others using this model. The easiest labels to predict are 2=C and 10=L (perfectly). This comes with no surprise since the hand gestures of these two letters are intuitive and stand out among other alphabets. In summary, the accuracy on train dataset is 96% while the accuracy on test dataset is 70%.

result <- data.frame(
  'train_acc' = tail(history$metrics$accuracy, n=1),
  'test_acc' = tail(history$metrics$val_accuracy, n=1), 
  row.names = 'Dense with 2 hidden layers')

result

#>                            train_acc  test_acc
#> Dense with 2 hidden layers 0.9635403 0.7039877

Dense with 3 Hidden Layers

Despite overfitting indication in the previous model, let’s just expand the model and see to what extend the overfitting occurs. This time, we will use a bigger model with the following architecture.

Input Layer: 784 nodes
Hidden Layer 1: 512 nodes, relu activation function
Hidden Layer 2: 256 nodes, relu activation function
Hidden Layer 3: 128 nodes, relu activation function
Output Layer: 24 nodes, softmax activation function

Not only we add another hidder layer, but also increase the number of nodes in each of them.

tensorflow::tf$random$set_seed(42)

model_3hidden <- keras_model_sequential()
model_3hidden %>% 
  layer_dense(input_shape = ncol(train_x_keras),
              units = 512,
              activation = "relu",
              name = "hidden1") %>% 
  layer_dense(units = 256,
              activation = "relu",
              name = "hidden2") %>%
  layer_dense(units = 128,
              activation = "relu",
              name = "hidden3") %>%
  layer_dense(units = 24,
              activation = "softmax",
              name = "output")
  
summary(model_3hidden)

#> Model: "sequential_1"
#> ________________________________________________________________________________
#> Layer (type)                        Output Shape                    Param #     
#> ================================================================================
#> hidden1 (Dense)                     (None, 512)                     401920      
#> ________________________________________________________________________________
#> hidden2 (Dense)                     (None, 256)                     131328      
#> ________________________________________________________________________________
#> hidden3 (Dense)                     (None, 128)                     32896       
#> ________________________________________________________________________________
#> output (Dense)                      (None, 24)                      3096        
#> ================================================================================
#> Total params: 569,240
#> Trainable params: 569,240
#> Non-trainable params: 0
#> ________________________________________________________________________________

As before, compile and train.

model_3hidden %>% 
  compile(optimizer = optimizer_adam(lr=0.001),
          loss = "categorical_crossentropy",
          metrics = "accuracy")

history <- model_3hidden %>% 
  fit(train_x_keras, 
      train_y_keras, 
      batch_size = 32, 
      epoch = 10,
      validation_data = list(test_x_keras, test_y_keras))

plot(history)

Of course, the model still overfits train dataset. But look, accuracy on test dataset increases as well. Let’s see in detail with a confusion matrix.

pred_3hidden <- predict_classes(object = model_3hidden, x = test_x_keras)
confusionMatrix(as.factor(pred_3hidden), as.factor(test_y$label))

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16
#>         0  331   0   0   0   0   0   0   0   0   0   0   1  21   0   0  21   0
#>         1    0 391   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
#>         2    0   0 310   0   0  21   0   0   0   0   4   0   0  16   0   0   0
#>         3    0   0   0 208   0   0   0   0   0   0   0   0   0   0   0   0   0
#>         4    0   0   0   0 498   0   0   0   0   0   0   9  17   0   0   0   0
#>         5    0   0   0   0   0 226   0   0   0   0   0   0   0   0   0   0   0
#>         6    0   0   0   0   0   0 301  37   0   0   0   0   0   0   0   0   0
#>         7    0   0   0   0   0   0   5 399   0   0   0   0   0   0   0   0   0
#>         8    0   0   0   0   0   0   0   0 230  21   0   0   0   0   0   0   0
#>         9    0   0   0   0   0   0   0   0   0 176   0   0   0   0   0   0   0
#>         10   0   0   0   0   0   0   0   0   0   0 205   0   0   0   0   0  21
#>         11   0   0   0   0   0   0   0   0   0   0   0 263   6   0   0   0   0
#>         12   0   0   0   0   0   0  21   0   0   0   0  50 197   0   0   0   0
#>         13   0   0   0   0   0   0   0   0   0   0   0   0   8 182   0   0   0
#>         14   0   0   0   0   0   0   0   0   0   0   0   0   0   0 347   0   0
#>         15   0   0   0   0   0   0  20   0   0   0   0   0  21  19   0 143   0
#>         16   0   0   0  15   0   0   0   0  20  27   0   0   0   0   0   0 104
#>         17   0   0   0   0   0   0   0   0   0   0   0  71   0   0   0   0   0
#>         18   0   0   0   0   0   0   1   0   0   0   0   0  21  29   0   0   0
#>         19   0  41   0   0   0   0   0   0   0  68   0   0   0   0   0   0  19
#>         20   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0
#>         21   0   0   0   0   0   0   0   0   0  14   0   0   0   0   0   0   0
#>         22   0   0   0  22   0   0   0   0  17   7   0   0   0   0   0   0   0
#>         23   0   0   0   0   0   0   0   0  21  17   0   0   0   0   0   0   0
#>           Reference
#> Prediction  17  18  19  20  21  22  23
#>         0    0   0   0   0   0   0   0
#>         1    0   0   0   0   0   0   0
#>         2    0   0   0   0   0   0   0
#>         3    0   0  19   0   0   0   0
#>         4   41   0   0   0   0   0   0
#>         5    0   0   0  20   0   0   0
#>         6    0   0   0  11   0   0   0
#>         7    0   0   0   0   0   0   0
#>         8   17   0   0   0   0   0  21
#>         9    0   0   5   1   0   0   0
#>         10   0  20   0   0   0   0  21
#>         11  41   0   0   0   0   0   0
#>         12   0   0   0   0   0   0   0
#>         13   0   0   0   0   0   0   0
#>         14   0   0   0  20   0   0   0
#>         15   1   0   0   0   0   0   0
#>         16   0   0  21   0  34   0   0
#>         17 142   0   0   0   0   0  19
#>         18   0 166   0   8   0   0  21
#>         19   0   0 192   1  20   0   0
#>         20   0   0  23 262  21   0  19
#>         21   0   0   0  23 131  37   0
#>         22   0  62   0   0   0 230   0
#>         23   4   0   6   0   0   0 231
#> 
#> Overall Statistics
#>                                                
#>                Accuracy : 0.8178               
#>                  95% CI : (0.8086, 0.8266)     
#>     No Information Rate : 0.0694               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.8093               
#>                                                
#>  Mcnemar's Test P-Value : NA                   
#> 
#> Statistics by Class:
#> 
#>                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
#> Sensitivity           1.00000  0.90509  1.00000  0.84898  1.00000  0.91498
#> Specificity           0.99371  1.00000  0.99403  0.99726  0.98996  0.99711
#> Pos Pred Value        0.88503  1.00000  0.88319  0.91630  0.88142  0.91870
#> Neg Pred Value        1.00000  0.99395  1.00000  0.99467  1.00000  0.99697
#> Prevalence            0.04615  0.06023  0.04322  0.03416  0.06944  0.03444
#> Detection Rate        0.04615  0.05452  0.04322  0.02900  0.06944  0.03151
#> Detection Prevalence  0.05215  0.05452  0.04894  0.03165  0.07878  0.03430
#> Balanced Accuracy     0.99686  0.95255  0.99701  0.92312  0.99498  0.95605
#>                      Class: 6 Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
#> Sensitivity           0.86494  0.91514  0.79861  0.53172   0.98086   0.66751
#> Specificity           0.99297  0.99926  0.99143  0.99912   0.99110   0.99307
#> Pos Pred Value        0.86246  0.98762  0.79585  0.96703   0.76779   0.84839
#> Neg Pred Value        0.99311  0.99453  0.99157  0.97783   0.99942   0.98091
#> Prevalence            0.04852  0.06079  0.04016  0.04615   0.02914   0.05494
#> Detection Rate        0.04197  0.05563  0.03207  0.02454   0.02858   0.03667
#> Detection Prevalence  0.04866  0.05633  0.04030  0.02538   0.03723   0.04322
#> Balanced Accuracy     0.92895  0.95720  0.89502  0.76542   0.98598   0.83029
#>                      Class: 12 Class: 13 Class: 14 Class: 15 Class: 16
#> Sensitivity            0.67698   0.73984   1.00000   0.87195   0.72222
#> Specificity            0.98968   0.99884   0.99707   0.99130   0.98335
#> Pos Pred Value         0.73507   0.95789   0.94550   0.70098   0.47059
#> Neg Pred Value         0.98638   0.99083   1.00000   0.99699   0.99425
#> Prevalence             0.04057   0.03430   0.04838   0.02287   0.02008
#> Detection Rate         0.02747   0.02538   0.04838   0.01994   0.01450
#> Detection Prevalence   0.03737   0.02649   0.05117   0.02844   0.03081
#> Balanced Accuracy      0.83333   0.86934   0.99853   0.93162   0.85279
#>                      Class: 17 Class: 18 Class: 19 Class: 20 Class: 21
#> Sensitivity            0.57724   0.66935   0.72180   0.75723   0.63592
#> Specificity            0.98701   0.98845   0.97842   0.99062   0.98938
#> Pos Pred Value         0.61207   0.67480   0.56305   0.80368   0.63902
#> Neg Pred Value         0.98501   0.98816   0.98917   0.98773   0.98923
#> Prevalence             0.03430   0.03458   0.03709   0.04824   0.02872
#> Detection Rate         0.01980   0.02315   0.02677   0.03653   0.01827
#> Detection Prevalence   0.03235   0.03430   0.04755   0.04545   0.02858
#> Balanced Accuracy      0.78212   0.82890   0.85011   0.87392   0.81265
#>                      Class: 22 Class: 23
#> Sensitivity            0.86142   0.69578
#> Specificity            0.98436   0.99298
#> Pos Pred Value         0.68047   0.82796
#> Neg Pred Value         0.99459   0.98535
#> Prevalence             0.03723   0.04629
#> Detection Rate         0.03207   0.03221
#> Detection Prevalence   0.04713   0.03890
#> Balanced Accuracy      0.92289   0.84438

This model is definitely better than the previous one. We can predict class 0=A, 2=C, 4=E, and 14=P perfectly. Class 10=L is also predicted well.

temp <- data.frame(
  'train_acc' = tail(history$metrics$accuracy, n=1),
  'test_acc' = tail(history$metrics$val_accuracy, n=1), 
  row.names = 'Dense with 3 hidden layers')

result <- rbind(result, temp)
result

#>                            train_acc  test_acc
#> Dense with 2 hidden layers 0.9635403 0.7039877
#> Dense with 3 hidden layers 1.0000000 0.8177635

The accuracy score above shows that this model is extremely overfitting. We got a perfect prediction on train dataset! But again, this model is still better than the previous one since the accuracy on test dataset also significantly increased, so we stick to use this model. Now, what can we do to combat overfitting without making our model smaller?

Dense with Data Augmentation

The problem with previous models is that they tend to memorize the pictures in train dataset so that when new test dataset comes in they can’t recognize it. Data augmentation is one of many techniques to solve this problem. Given a picture, data augmentation will transform it slightly to create some new picture. These new pictures are then fitted into the model. This way, the model knows many versions of the original picture, hopefully understands what the picture means instead of memorizing it. We will only use some simple transformations:

Randomly rotate by 10 degree
Randomly zoom by a factor of 0.1
Randonly shift horizontally by 0.1 fraction of total width
Randomly shift horizontally by 0.1 fraction of total height

We don’t use horizontal flip or vertical flip since in our case they can change the meaning of the image. This data augmentation can be done using image_data_generator() function. Save the generator to an object named datagen.

datagen <- image_data_generator(
  rotation_range = 10,
  zoom_range = 0.1,
  width_shift_range = 0.1,
  height_shift_range = 0.1
)

This time we will do modeling in a slightly different way. Instead of fitting 32 rows consists of 784 pixel values to the model, we will fit 32 images of size \(28 \times 28\) pixels at a time. We can use flow_images_from_data() function by inserting datagen as generator. Now our generator is complete for train dataset, let’s call it train_generator. For validation, as before, we will use all test dataset observations at once for each epoch by reading rows of 784 pixel values.

Now, since train_generator takes images as inputs, we need to reshape the array of inputs from 784 to (28, 28, 1). The number 1 at the end is the number of channel, indicates that we use grayscale images. If the input images were colored, then the number of channels were usually 3 (for red, green, and blue).

train_x_keras <- train_x_keras %>% 
  array_reshape(dim = c(nrow(train_x), 28, 28, 1))

test_x_keras <-  test_x_keras %>% 
  array_reshape(dim = c(nrow(test_x), 28, 28, 1))

train_generator <- flow_images_from_data(
  x = train_x_keras,
  y = train_y_keras,
  generator = datagen,
  batch_size = 32,
  seed = 42
)

Build the model by the following architecture:

Input Layer: (28, 28, 1) nodes
Flatten Layer: used for flattening (28, 28, 1) nodes into 784 nodes
Hidden Layer 1: 512 nodes, relu activation function
Hidden Layer 2: 256 nodes, relu activation function
Hidden Layer 3: 128 nodes, relu activation function
Output Layer: 24 nodes, softmax activation function

Please note that we have the same hidden layers and output layer as model_3hidden.

tensorflow::tf$random$set_seed(42)

model_3hidden_aug <- keras_model_sequential()
model_3hidden_aug %>% 
  layer_flatten(input_shape = c(28, 28, 1)) %>% 
  layer_dense(units = 512,
              activation = "relu",
              name = "hidden1") %>% 
  layer_dense(units = 256,
              activation = "relu",
              name = "hidden2") %>%
  layer_dense(units = 128,
              activation = "relu",
              name = "hidden3") %>%
  layer_dense(units = 24,
              activation = "softmax",
              name = "output")
  
summary(model_3hidden_aug)

#> Model: "sequential_2"
#> ________________________________________________________________________________
#> Layer (type)                        Output Shape                    Param #     
#> ================================================================================
#> flatten (Flatten)                   (None, 784)                     0           
#> ________________________________________________________________________________
#> hidden1 (Dense)                     (None, 512)                     401920      
#> ________________________________________________________________________________
#> hidden2 (Dense)                     (None, 256)                     131328      
#> ________________________________________________________________________________
#> hidden3 (Dense)                     (None, 128)                     32896       
#> ________________________________________________________________________________
#> output (Dense)                      (None, 24)                      3096        
#> ================================================================================
#> Total params: 569,240
#> Trainable params: 569,240
#> Non-trainable params: 0
#> ________________________________________________________________________________

Compile the model as before.

model_3hidden_aug %>% 
  compile(optimizer = optimizer_adam(lr=0.001),
          loss = "categorical_crossentropy",
          metrics = "accuracy")

To train the model, we won’t use the usual fit() function. Instead, we’ll use fit_generator() function and insert train_generator as generator. We also need to specify steps_per_epoch parameter which is just the number of steps within one epoch, that is, the number of all train observations divided by batch size. Lastly, we will train the model for 70 epochs to sqeeze out as many information as possible. But please note that too many epochs may also lead to overfitting.

history <- model_3hidden_aug %>% 
  fit_generator(
    generator = train_generator,
    steps_per_epoch = nrow(train_x_keras) / 32,
    epoch = 70,
    validation_data = list(test_x_keras, test_y_keras))

plot(history)

Now we’re talking! No more overfitting!

pred_3hidden_aug <- predict_classes(object = model_3hidden_aug, x = test_x_keras)
confusionMatrix(as.factor(pred_3hidden_aug), as.factor(test_y$label))

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16
#>         0  331   0   0   0   0   0   0   0   0   0   0   0   2   0   0   0   0
#>         1    0 423   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1
#>         2    0   0 310   0   0   0   0   0   0   0   0   0   0   0   0   0   0
#>         3    0   0   0 223   0   0   0   1   0   1   0   0   0   0   0   0   0
#>         4    0   0   0   0 474   0  19   0   0   0   0   0   0   0   0   0   0
#>         5    0   0   0   0   0 247   0   0   0   0   0   0   0   0   0   0   0
#>         6    0   0   0  21   0   0 329   1   0   0   0   0   0   0   0   0   0
#>         7    0   0   0   1   0   0   0 391   0   0   0   0   0   0   0   0   0
#>         8    0   9   0   0   0   0   0   0 283   0   0   0   0   0   0   0   5
#>         9    0   0   0   0   0   0   0   0   0 330   0   0   0   0   0   0   0
#>         10   0   0   0   0   0   0   0   0   0   0 209   0   0   0   0   0   0
#>         11   0   0   0   0   0   0   0   0   0   0   0 389   0   0   0   0   0
#>         12   0   0   0   0   3   0   0   0   0   0   0   0 265   0   0   0   0
#>         13   0   0   0   0   0   0   0   0   0   0   0   0   0 246   0   0   0
#>         14   0   0   0   0   0   0   0   0   0   0   0   0   0   0 319   0   0
#>         15   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 164   0
#>         16   0   0   0   0   0   0   0   0   0   0   0   0   3   0   0   0  99
#>         17   0   0   0   0  21   0   0   0   0   0   0   5  21   0   0   0   8
#>         18   0   0   0   0   0   0   0   5   0   0   0   0   0   0   0   0   0
#>         19   0   0   0   0   0   0   0  18   0   0   0   0   0   0   0   0  31
#>         20   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
#>         21   0   0   0   0   0   0   0   0   0   0   0   0   0   0  28   0   0
#>         22   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
#>         23   0   0   0   0   0   0   0  20   5   0   0   0   0   0   0   0   0
#>           Reference
#> Prediction  17  18  19  20  21  22  23
#>         0    0   0   0   0   0   0   0
#>         1    0   0   0   0   0   0   0
#>         2    0   0   0   0   0   0   0
#>         3    0  21  19   5   0   0   0
#>         4   32   0   0   0   0   0   0
#>         5    0   0   0   0   0   0   0
#>         6    0   0   0   0   0   0   0
#>         7    0   0   0   0   0   0   0
#>         8    0   0   0   0   0   0  26
#>         9    0   0   0   1   0   0   0
#>         10   0   1   0   0   0   0   0
#>         11   0   0   0   0   0   0   0
#>         12   0   0   0   0   0   0   0
#>         13   0   0   0   0   0   0   0
#>         14   0   0   0   0   0   0   0
#>         15   0   0   0   0   0   0   0
#>         16   0   0  20  21   0   0   0
#>         17 214   0   0   0   0   0   0
#>         18   0 226   0   0   0   0   0
#>         19   0   0 219   0  32   3   0
#>         20   0   0   0 297   0   0   0
#>         21   0   0   8  22 174   0   0
#>         22   0   0   0   0   0 264   0
#>         23   0   0   0   0   0   0 306
#> 
#> Overall Statistics
#>                                                
#>                Accuracy : 0.9387               
#>                  95% CI : (0.9328, 0.9441)     
#>     No Information Rate : 0.0694               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.9358               
#>                                                
#>  Mcnemar's Test P-Value : NA                   
#> 
#> Statistics by Class:
#> 
#>                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
#> Sensitivity           1.00000  0.97917  1.00000  0.91020  0.95181  1.00000
#> Specificity           0.99971  0.99985  1.00000  0.99321  0.99236  1.00000
#> Pos Pred Value        0.99399  0.99764  1.00000  0.82593  0.90286  1.00000
#> Neg Pred Value        1.00000  0.99867  1.00000  0.99681  0.99639  1.00000
#> Prevalence            0.04615  0.06023  0.04322  0.03416  0.06944  0.03444
#> Detection Rate        0.04615  0.05898  0.04322  0.03109  0.06609  0.03444
#> Detection Prevalence  0.04643  0.05912  0.04322  0.03765  0.07320  0.03444
#> Balanced Accuracy     0.99985  0.98951  1.00000  0.95171  0.97208  1.00000
#>                      Class: 6 Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
#> Sensitivity           0.94540  0.89679  0.98264  0.99698   1.00000   0.98731
#> Specificity           0.99678  0.99985  0.99419  0.99985   0.99986   1.00000
#> Pos Pred Value        0.93732  0.99745  0.87616  0.99698   0.99524   1.00000
#> Neg Pred Value        0.99721  0.99336  0.99927  0.99985   1.00000   0.99926
#> Prevalence            0.04852  0.06079  0.04016  0.04615   0.02914   0.05494
#> Detection Rate        0.04587  0.05452  0.03946  0.04601   0.02914   0.05424
#> Detection Prevalence  0.04894  0.05466  0.04504  0.04615   0.02928   0.05424
#> Balanced Accuracy     0.97109  0.94832  0.98841  0.99842   0.99993   0.99365
#>                      Class: 12 Class: 13 Class: 14 Class: 15 Class: 16
#> Sensitivity            0.91065    1.0000   0.91931   1.00000   0.68750
#> Specificity            0.99956    1.0000   1.00000   1.00000   0.99374
#> Pos Pred Value         0.98881    1.0000   1.00000   1.00000   0.69231
#> Neg Pred Value         0.99623    1.0000   0.99591   1.00000   0.99360
#> Prevalence             0.04057    0.0343   0.04838   0.02287   0.02008
#> Detection Rate         0.03695    0.0343   0.04448   0.02287   0.01380
#> Detection Prevalence   0.03737    0.0343   0.04448   0.02287   0.01994
#> Balanced Accuracy      0.95511    1.0000   0.95965   1.00000   0.84062
#>                      Class: 17 Class: 18 Class: 19 Class: 20 Class: 21
#> Sensitivity            0.86992   0.91129   0.82331   0.85838   0.84466
#> Specificity            0.99206   0.99928   0.98784   1.00000   0.99167
#> Pos Pred Value         0.79554   0.97835   0.72277   1.00000   0.75000
#> Neg Pred Value         0.99536   0.99683   0.99316   0.99287   0.99539
#> Prevalence             0.03430   0.03458   0.03709   0.04824   0.02872
#> Detection Rate         0.02984   0.03151   0.03054   0.04141   0.02426
#> Detection Prevalence   0.03751   0.03221   0.04225   0.04141   0.03235
#> Balanced Accuracy      0.93099   0.95528   0.90557   0.92919   0.91817
#>                      Class: 22 Class: 23
#> Sensitivity            0.98876   0.92169
#> Specificity            1.00000   0.99635
#> Pos Pred Value         1.00000   0.92447
#> Neg Pred Value         0.99957   0.99620
#> Prevalence             0.03723   0.04629
#> Detection Rate         0.03681   0.04267
#> Detection Prevalence   0.03681   0.04615
#> Balanced Accuracy      0.99438   0.95902

We can observe that many classes are predicted perfectly or almost perfectly. Some classes are still hard to differentiate such as 4=E and 17=S. This is due to similar hand gestures between alphabet E and S.

temp <- data.frame(
  'train_acc' = tail(history$metrics$accuracy, n=1),
  'test_acc' = tail(history$metrics$val_accuracy, n=1), 
  row.names = 'Dense with 3 hidden layers and data augmentation')

result <- rbind(result, temp)
result

#>                                                  train_acc  test_acc
#> Dense with 2 hidden layers                       0.9635403 0.7039877
#> Dense with 3 hidden layers                       1.0000000 0.8177635
#> Dense with 3 hidden layers and data augmentation 0.9593772 0.9386503

We obtain the best model with similar train and test accuracy, around 95-96%.

Conclusion

Neural Network (NN) is very suitable for image classification problem. This is because it’s hard to extract features from images manually and NN can do this internally without us worrying what features to be extracted. For our problem, we see that NN model alone may lead to overfitting. Hence, data augmentation is introduced and able to lift the model performance significantly and reduce overfitting. However, many things could still be improved:

Tune learning rate, use learning rate scheduler if necessary
Try other approaches to combat overfitting, such as dropout layer or regularization
Switch to Convolutional Neural Network which almost certainly will have better performance

Hand Gesture Recognition

Albers Uzila

February 14, 2021