American Sign Language (ASL) is a complete, natural language that has the same linguistic properties as spoken languages, with grammar that differs from English. ASL is expressed by movements of the hands and face. It is the primary language of many North Americans who are deaf and hard of hearing, and is used by many hearing people as well. We are given pictures of hand gestures of ASL, each represents a classic Roman alphabet. Can we classify each picture as an alphabet? This is a multiclass classification problem which will be solved by developing a Neural Network (NN) model.
Let’s read the dataset.
The dataset format is patterned to match closely with the classic MNIST. The training data (27,455 cases) and test data (7,172 cases) are approximately half the size of the standard MNIST but otherwise similar with a header row of label, pixel1, pixel2, …, pixel784 which represent a single \(28 \times 28\) pixel image with grayscale values between 0-255.
#> [1] 27455 785
#> [1] 7172 785
Each training and test case represents a label (0-25) as a one-to-one map for each alphabetic letter A-Z (and no cases for 9=J or 25=Z because of gesture motions).
#> [1] 0 1 2 3 4 5 6 7 8 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#> [1] 0 1 2 3 4 5 6 7 8 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
First, we need to correct the label. Since label 9 and 25 are missing, we can substract by 1 all label bigger than 9. This way, our label becomes all integers from 0 to 24.
train[train$label > 9, 'label'] <- train[train$label > 9, 'label'] - 1
test[test$label > 9, 'label'] <- test[test$label > 9, 'label'] - 1It doesn’t hurt to see what our dataset looks like. Let’s take a look at the first 32 pictures from train.
vizTrain <- function(input) {
dimmax <- sqrt(ncol(train[, -1]))
cols <- 8
rows <- floor((nrow(input) - 1) / cols) + 1
par(mfrow = c(rows, cols), mar = c(0.1, 0.1, 0.1, 0.1))
for (i in 1:nrow(input)) {
m1 <- matrix(input[i, 2:ncol(input)], nrow = dimmax, byrow = T)
m1 <- apply(m1, 2, as.numeric)
m1 <- t(apply(m1, 2, rev))
image(1:dimmax, 1:dimmax, m1, col = grey.colors(255), xaxt = "n", yaxt = "n")
text(3, 26, col = "black", cex = 1.2, train[i, 1])
}
}
vizTrain(train[1:32, ])We perform a grayscale normalization to reduce the effect of illumination’s differences. Moreover, the NN models we will use converge faster on [0..1] data than on [0..255]. To di this, simply divide each pixel value by 255. We also separate predictor and target from train and test dataset simultaneously, resulting in train_x, test_x, train_y, test_y.
train_x <- train %>%
select(-label) %>%
data.matrix()/255
test_x <- test %>%
select(-label) %>%
data.matrix()/255
train_y <- train %>%
select(label)
test_y <- test %>%
select(label)NN models don’t recognize categorical features. For that reason, we need to do one-hot encoding for the labels train_y and test_y. Basically, what one-hot encoding does is to generate columns of ones and zeros for each category. So in our case, the result will be a matrix with 24 columns in which each rows will all have zero values except at one cell which has value of 1. The column at which this value 1 occured corresponds to the label that column represents. For example, the first six observations in train_y are the label 3, 6, 2, 2, 12, and 15, as can be seen in the following table.
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,] 0 0 0 1 0 0 0 0 0 0 0 0 0 0
#> [2,] 0 0 0 0 0 0 1 0 0 0 0 0 0 0
#> [3,] 0 0 1 0 0 0 0 0 0 0 0 0 0 0
#> [4,] 0 0 1 0 0 0 0 0 0 0 0 0 0 0
#> [5,] 0 0 0 0 0 0 0 0 0 0 0 0 1 0
#> [6,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
#> [1,] 0 0 0 0 0 0 0 0 0 0
#> [2,] 0 0 0 0 0 0 0 0 0 0
#> [3,] 0 0 0 0 0 0 0 0 0 0
#> [4,] 0 0 0 0 0 0 0 0 0 0
#> [5,] 0 0 0 0 0 0 0 0 0 0
#> [6,] 0 1 0 0 0 0 0 0 0 0
We do the same to test_y.
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,] 0 0 0 0 0 0 1 0 0 0 0 0 0 0
#> [2,] 0 0 0 0 0 1 0 0 0 0 0 0 0 0
#> [3,] 0 0 0 0 0 0 0 0 0 1 0 0 0 0
#> [4,] 1 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [5,] 0 0 0 1 0 0 0 0 0 0 0 0 0 0
#> [6,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
#> [1,] 0 0 0 0 0 0 0 0 0 0
#> [2,] 0 0 0 0 0 0 0 0 0 0
#> [3,] 0 0 0 0 0 0 0 0 0 0
#> [4,] 0 0 0 0 0 0 0 0 0 0
#> [5,] 0 0 0 0 0 0 0 0 0 0
#> [6,] 0 0 0 0 0 0 1 0 0 0
Since we don’t prefer one class above the others, and all the classes are balanced as shown below, we will simply use accuracy as our metric. Also, since NN algorithm on image classification takes time to train, we want the model validation as simple as possible. In that scenario, we will simply use a separate test dataset to validate the model.
ggplot(train %>%
group_by(label) %>%
count(name = 'observation_count'),
aes(x = label, y = observation_count)) +
geom_bar(stat = 'identity') +
ggtitle("Number of Observations among Labels")The problem with previous models is that they tend to memorize the pictures in train dataset so that when new test dataset comes in they can’t recognize it. Data augmentation is one of many techniques to solve this problem. Given a picture, data augmentation will transform it slightly to create some new picture. These new pictures are then fitted into the model. This way, the model knows many versions of the original picture, hopefully understands what the picture means instead of memorizing it. We will only use some simple transformations:
We don’t use horizontal flip or vertical flip since in our case they can change the meaning of the image. This data augmentation can be done using image_data_generator() function. Save the generator to an object named datagen.
datagen <- image_data_generator(
rotation_range = 10,
zoom_range = 0.1,
width_shift_range = 0.1,
height_shift_range = 0.1
)This time we will do modeling in a slightly different way. Instead of fitting 32 rows consists of 784 pixel values to the model, we will fit 32 images of size \(28 \times 28\) pixels at a time. We can use flow_images_from_data() function by inserting datagen as generator. Now our generator is complete for train dataset, let’s call it train_generator. For validation, as before, we will use all test dataset observations at once for each epoch by reading rows of 784 pixel values.
Now, since train_generator takes images as inputs, we need to reshape the array of inputs from 784 to (28, 28, 1). The number 1 at the end is the number of channel, indicates that we use grayscale images. If the input images were colored, then the number of channels were usually 3 (for red, green, and blue).
train_x_keras <- train_x_keras %>%
array_reshape(dim = c(nrow(train_x), 28, 28, 1))
test_x_keras <- test_x_keras %>%
array_reshape(dim = c(nrow(test_x), 28, 28, 1))
train_generator <- flow_images_from_data(
x = train_x_keras,
y = train_y_keras,
generator = datagen,
batch_size = 32,
seed = 42
)Build the model by the following architecture:
Please note that we have the same hidden layers and output layer as model_3hidden.
tensorflow::tf$random$set_seed(42)
model_3hidden_aug <- keras_model_sequential()
model_3hidden_aug %>%
layer_flatten(input_shape = c(28, 28, 1)) %>%
layer_dense(units = 512,
activation = "relu",
name = "hidden1") %>%
layer_dense(units = 256,
activation = "relu",
name = "hidden2") %>%
layer_dense(units = 128,
activation = "relu",
name = "hidden3") %>%
layer_dense(units = 24,
activation = "softmax",
name = "output")
summary(model_3hidden_aug)#> Model: "sequential_2"
#> ________________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ================================================================================
#> flatten (Flatten) (None, 784) 0
#> ________________________________________________________________________________
#> hidden1 (Dense) (None, 512) 401920
#> ________________________________________________________________________________
#> hidden2 (Dense) (None, 256) 131328
#> ________________________________________________________________________________
#> hidden3 (Dense) (None, 128) 32896
#> ________________________________________________________________________________
#> output (Dense) (None, 24) 3096
#> ================================================================================
#> Total params: 569,240
#> Trainable params: 569,240
#> Non-trainable params: 0
#> ________________________________________________________________________________
Compile the model as before.
model_3hidden_aug %>%
compile(optimizer = optimizer_adam(lr=0.001),
loss = "categorical_crossentropy",
metrics = "accuracy")To train the model, we won’t use the usual fit() function. Instead, we’ll use fit_generator() function and insert train_generator as generator. We also need to specify steps_per_epoch parameter which is just the number of steps within one epoch, that is, the number of all train observations divided by batch size. Lastly, we will train the model for 70 epochs to sqeeze out as many information as possible. But please note that too many epochs may also lead to overfitting.
history <- model_3hidden_aug %>%
fit_generator(
generator = train_generator,
steps_per_epoch = nrow(train_x_keras) / 32,
epoch = 70,
validation_data = list(test_x_keras, test_y_keras))Now we’re talking! No more overfitting!
pred_3hidden_aug <- predict_classes(object = model_3hidden_aug, x = test_x_keras)
confusionMatrix(as.factor(pred_3hidden_aug), as.factor(test_y$label))#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
#> 0 331 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
#> 1 0 423 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
#> 2 0 0 310 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> 3 0 0 0 223 0 0 0 1 0 1 0 0 0 0 0 0 0
#> 4 0 0 0 0 474 0 19 0 0 0 0 0 0 0 0 0 0
#> 5 0 0 0 0 0 247 0 0 0 0 0 0 0 0 0 0 0
#> 6 0 0 0 21 0 0 329 1 0 0 0 0 0 0 0 0 0
#> 7 0 0 0 1 0 0 0 391 0 0 0 0 0 0 0 0 0
#> 8 0 9 0 0 0 0 0 0 283 0 0 0 0 0 0 0 5
#> 9 0 0 0 0 0 0 0 0 0 330 0 0 0 0 0 0 0
#> 10 0 0 0 0 0 0 0 0 0 0 209 0 0 0 0 0 0
#> 11 0 0 0 0 0 0 0 0 0 0 0 389 0 0 0 0 0
#> 12 0 0 0 0 3 0 0 0 0 0 0 0 265 0 0 0 0
#> 13 0 0 0 0 0 0 0 0 0 0 0 0 0 246 0 0 0
#> 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 319 0 0
#> 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 164 0
#> 16 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 99
#> 17 0 0 0 0 21 0 0 0 0 0 0 5 21 0 0 0 8
#> 18 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0
#> 19 0 0 0 0 0 0 0 18 0 0 0 0 0 0 0 0 31
#> 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> 21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28 0 0
#> 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> 23 0 0 0 0 0 0 0 20 5 0 0 0 0 0 0 0 0
#> Reference
#> Prediction 17 18 19 20 21 22 23
#> 0 0 0 0 0 0 0 0
#> 1 0 0 0 0 0 0 0
#> 2 0 0 0 0 0 0 0
#> 3 0 21 19 5 0 0 0
#> 4 32 0 0 0 0 0 0
#> 5 0 0 0 0 0 0 0
#> 6 0 0 0 0 0 0 0
#> 7 0 0 0 0 0 0 0
#> 8 0 0 0 0 0 0 26
#> 9 0 0 0 1 0 0 0
#> 10 0 1 0 0 0 0 0
#> 11 0 0 0 0 0 0 0
#> 12 0 0 0 0 0 0 0
#> 13 0 0 0 0 0 0 0
#> 14 0 0 0 0 0 0 0
#> 15 0 0 0 0 0 0 0
#> 16 0 0 20 21 0 0 0
#> 17 214 0 0 0 0 0 0
#> 18 0 226 0 0 0 0 0
#> 19 0 0 219 0 32 3 0
#> 20 0 0 0 297 0 0 0
#> 21 0 0 8 22 174 0 0
#> 22 0 0 0 0 0 264 0
#> 23 0 0 0 0 0 0 306
#>
#> Overall Statistics
#>
#> Accuracy : 0.9387
#> 95% CI : (0.9328, 0.9441)
#> No Information Rate : 0.0694
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.9358
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
#> Sensitivity 1.00000 0.97917 1.00000 0.91020 0.95181 1.00000
#> Specificity 0.99971 0.99985 1.00000 0.99321 0.99236 1.00000
#> Pos Pred Value 0.99399 0.99764 1.00000 0.82593 0.90286 1.00000
#> Neg Pred Value 1.00000 0.99867 1.00000 0.99681 0.99639 1.00000
#> Prevalence 0.04615 0.06023 0.04322 0.03416 0.06944 0.03444
#> Detection Rate 0.04615 0.05898 0.04322 0.03109 0.06609 0.03444
#> Detection Prevalence 0.04643 0.05912 0.04322 0.03765 0.07320 0.03444
#> Balanced Accuracy 0.99985 0.98951 1.00000 0.95171 0.97208 1.00000
#> Class: 6 Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
#> Sensitivity 0.94540 0.89679 0.98264 0.99698 1.00000 0.98731
#> Specificity 0.99678 0.99985 0.99419 0.99985 0.99986 1.00000
#> Pos Pred Value 0.93732 0.99745 0.87616 0.99698 0.99524 1.00000
#> Neg Pred Value 0.99721 0.99336 0.99927 0.99985 1.00000 0.99926
#> Prevalence 0.04852 0.06079 0.04016 0.04615 0.02914 0.05494
#> Detection Rate 0.04587 0.05452 0.03946 0.04601 0.02914 0.05424
#> Detection Prevalence 0.04894 0.05466 0.04504 0.04615 0.02928 0.05424
#> Balanced Accuracy 0.97109 0.94832 0.98841 0.99842 0.99993 0.99365
#> Class: 12 Class: 13 Class: 14 Class: 15 Class: 16
#> Sensitivity 0.91065 1.0000 0.91931 1.00000 0.68750
#> Specificity 0.99956 1.0000 1.00000 1.00000 0.99374
#> Pos Pred Value 0.98881 1.0000 1.00000 1.00000 0.69231
#> Neg Pred Value 0.99623 1.0000 0.99591 1.00000 0.99360
#> Prevalence 0.04057 0.0343 0.04838 0.02287 0.02008
#> Detection Rate 0.03695 0.0343 0.04448 0.02287 0.01380
#> Detection Prevalence 0.03737 0.0343 0.04448 0.02287 0.01994
#> Balanced Accuracy 0.95511 1.0000 0.95965 1.00000 0.84062
#> Class: 17 Class: 18 Class: 19 Class: 20 Class: 21
#> Sensitivity 0.86992 0.91129 0.82331 0.85838 0.84466
#> Specificity 0.99206 0.99928 0.98784 1.00000 0.99167
#> Pos Pred Value 0.79554 0.97835 0.72277 1.00000 0.75000
#> Neg Pred Value 0.99536 0.99683 0.99316 0.99287 0.99539
#> Prevalence 0.03430 0.03458 0.03709 0.04824 0.02872
#> Detection Rate 0.02984 0.03151 0.03054 0.04141 0.02426
#> Detection Prevalence 0.03751 0.03221 0.04225 0.04141 0.03235
#> Balanced Accuracy 0.93099 0.95528 0.90557 0.92919 0.91817
#> Class: 22 Class: 23
#> Sensitivity 0.98876 0.92169
#> Specificity 1.00000 0.99635
#> Pos Pred Value 1.00000 0.92447
#> Neg Pred Value 0.99957 0.99620
#> Prevalence 0.03723 0.04629
#> Detection Rate 0.03681 0.04267
#> Detection Prevalence 0.03681 0.04615
#> Balanced Accuracy 0.99438 0.95902
We can observe that many classes are predicted perfectly or almost perfectly. Some classes are still hard to differentiate such as 4=E and 17=S. This is due to similar hand gestures between alphabet E and S.
temp <- data.frame(
'train_acc' = tail(history$metrics$accuracy, n=1),
'test_acc' = tail(history$metrics$val_accuracy, n=1),
row.names = 'Dense with 3 hidden layers and data augmentation')
result <- rbind(result, temp)
result#> train_acc test_acc
#> Dense with 2 hidden layers 0.9635403 0.7039877
#> Dense with 3 hidden layers 1.0000000 0.8177635
#> Dense with 3 hidden layers and data augmentation 0.9593772 0.9386503
We obtain the best model with similar train and test accuracy, around 95-96%.
Neural Network (NN) is very suitable for image classification problem. This is because it’s hard to extract features from images manually and NN can do this internally without us worrying what features to be extracted. For our problem, we see that NN model alone may lead to overfitting. Hence, data augmentation is introduced and able to lift the model performance significantly and reduce overfitting. However, many things could still be improved: