Image Recognition with Keras in R - An Exercise in Transfer Learning

Author: Qasim Ahmed

Introduction

The purpose of this exercise is to get familiarity with implementing a Neural Net to see if we can train it for image recognition. The secondary objective is for getting exposure to the tools such as Keras and establish a way of getting images off of a hard drive or folder and breaking them down into a machine-readable format.

Problem Statement

To implement this learning, we focus on two main questions: 1. Can we train a model to accurately detect types of road signs? 2. Can we train this model in a timely manner - from a data to deployment perspective? 3. Can our trained model to carry out multi-label classification of different image types? ## Data With the above questions in mind we choose to look at traffic signs that appear on the road. ### Importance of Traffic Signs Traffic signs present several unique challenges that humans can take for granted: - The signs convey information in a variety of different ways. – They come is various colors; red, white, yellow, blue. – They can have numeric information on them, as is the case of speed limits. – They can have verbal / alphabetic information, like “stop”, or “go”. –They can have figures describing vital information, like “slippery road”, or “dead ends” etc. All these combined presents an interesting problem to try to solve with the help of a neural net and machine learning. ### Data Source To get started on this we found and then used the “German Traffic Sign Recognition Benchmark”. This is a large database that: - has 42 classes. - has more than 100,000 images in total. - And has data captured form driving around, giving us a real-world image database - with issues that are generally encountered in the real world, like various degrees of exposure. signs captured at different angles, blurry images, etc.

more information on the data can be found on the Institut fur Neuroinfomatik’s website.

Self-Placed Limitations on the Data

As we mention in the previous section, this is a huge dataset across multiple classes. We are restricted by hardware limitations - running our selected models on a CPU instead of a GPU. For this we prune the dataset and select 12 classes from amongst the 42 present in the data set.

This still leaves us with 400 * 12 = 4800 images to train our model.

Data Prep

Before starting in R, we have carried out some data organization at the operating system level. Namely, we have taken the GTSRB extract and: a) kept only 12 train folders. b) renamed the folders with labels. so “Stop Sign” instead of “6”. c) randomly selected 10 images form each folder to create a test directory. This gives us a sample of 120 images for testing. d) stored a list of test files in a “submission.csv” to document or predictions.

## Loading Libraries

library(keras)
library(stringr)
library(tidyverse)
library(data.table)

## Setting working directory
setwd("C:/Users/Qasim/OneDrive/Desktop/York U/5 - Advanced Methods of Data Analytics/Assignment 3")

## Setting image attributes

img_width <- 64
img_height <- 64
batch_size <- 16

## Pointing towards Data location

train_directory <- "Assignment_Dataset/GTSRB_Classification/train"
test_directory <- "Assignment_Dataset/GTSRB_Classification/test"
submission <- read.csv("Assignment_Dataset/GTSRB_Classification/sample_submission.csv")
colnames(submission) <- c("file", "sign_type")

Creating the Training Set.

To automate the process and create code that can be used and improved in the future, we transform our data as follows after loading files: 1. We start by converting all images to the same dimensions; 64 pixels by 64 pixels and RGB colors. this would be a [64 x 64 x 3] matrix 2. Next we pull the corresponding label for each image into a vector - this is pulled from the folder names we have used to organize our training data. 3. Next we transform all values to a 0-1 range 4. Make the labels numerical and make sure they are in range 0-11 (and not 1-12) - Keras is running Python code which starts vectors, arrays, etc. at 0. 5. Finally we one-hot encode the labels; we need a separate column for each label.

## Loading data and tranforming. we keep only first 12 labels.

files <- list.files(train_directory, recursive = TRUE)

train.label <- rep(c(0), times = length(files))
train.array <- array(NA, dim = c(length(files), img_height, img_width, 3))
for (i in 1:length(files)) {
  
  temp <- image_load(paste0(train_directory,"/",files[i]), 
                     target_size = c(img_height, img_width), 
                     grayscale = FALSE)
  
  temp.array <- image_to_array(temp, data_format = "channels_last")
  train.array[i,,,] <- temp.array
  train.label[i] <- sub("/(.+)","" ,files[i])
  
}

## Final training data and label
train.array <- train.array/255
label <- to_categorical(as.numeric(as.factor(train.label))-1)

Creating a Validation Set

Next, we create a small validation set from our training data. And remove it from the training array and label.

## Creating small validation set.

val.size = 0.05
val.sample = sample(nrow(train.array), val.size*nrow(train.array))

val.array = train.array[val.sample,,,]
train.array = train.array[-c(val.sample),,,]

## Validation set and label
val.label = label[val.sample,]
label = label[-c(val.sample),]

Creating the Test Set

Similarly, we create our test set for final prediction.

## Test Set

files <- list.files(test_directory, recursive = TRUE)
test.array <- array(NA, dim = c(length(files), img_height, img_width, 3))

for (i in 1:length(files)) {
  
  temp <- image_load(paste0(test_directory,"/",files[i]), 
                     target_size = c(img_height, img_width), 
                     grayscale = FALSE)
  
  temp.array <- image_to_array(temp, data_format = "channels_last")
  test.array[i,,,] <- temp.array
  
}

## Test Data - remember no label
test.array <- test.array/255

Creating a Data Generator

Another useful practice to augment our dataset is to create more datapoints that are slightly shifted. In simple terms the function is adding deliberate noise to the data by shifting images by + or - 10 degrees, flipping them on their axis, shifting expose etc. This will make the data appear even more diverse to the neural net.

## image_data_generator

datagen <- image_data_generator(
  rotation_range = 10,
  width_shift_range = 0.1,
  height_shift_range = 0.1,
  horizontal_flip = TRUE,
  vertical_flip = TRUE)

train_generator <- flow_images_from_data(
  x = train.array,
  y = label,
  generator = datagen,
  batch_size = batch_size, 
  shuffle = TRUE,
  seed = 123)

validation_generator <- flow_images_from_data(
  x = val.array,
  y = val.label,
  generator = datagen,
  batch_size = batch_size, 
  shuffle = TRUE,
  seed = 123)

## Set batch size to 1, since we want to predict 1 image at a time
test_generator <- flow_images_from_data(
  x = test.array,
  generator = image_data_generator(),
  batch_size = 1,
  shuffle = FALSE)

Model Selection

We have a ton of options here from pre-trained models available in Keras as well as in other packages. Training a model from scratch is not considered as we are limited by the number of images we have. Based on our compute limitations, time, and image size limitations, we select VGG16.

CNN

In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery. The name “convolutional neural network” indicates that the network employs a mathematical operation called convolution. A Convolution is a specialized kind of linear operation. Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers ### VGG16

VGG16 is a convolutional neural network that was trained on the 15 million - 1000 class that boasts a model achieves 92.7% top-5 test accuracy against ImageNet. More information and details can he found (here)[https://neurohive.io/en/popular-networks/vgg16/]

Cons

Due to its depth and number of fully-connected nodes, VGG16 is over 533MB. This makes deploying VGG a tiresome task.VGG16 is used in many deep learning image classification problems; however, smaller network architectures are often more desirable (such as SqueezeNet, GoogLeNet, etc.) #### Pros It is a great building block for learning purpose as it is easy to implement in Keras, PyTorch, and TensorFlow. ### Transfer Learning Transfer learning is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize cars could apply when trying to recognize trucks.

Basically we are “standing on the shoulders of giants” and trying to re-use and re-purpose a model that is trained on a large dataset and use the wieghts and biases it has allocated in previous layers to help solve our particular problem.

Model Deployment and Training

We select the model from Keras. and choose “ImageNet” as the weights

base_model <- application_vgg16(include_top = FALSE, 
                                weights = 'imagenet')

Creating Connected Layers

This is the most important part of transfer learning. We want to create layers and most importantly, freeze all layers that we do not want to re-train. Otherwise we would just end up training a new network. Creating layers and saving the model.

predictions <- base_model$output %>% 
  layer_global_average_pooling_2d(trainable = T) %>% 
  layer_dense(128, activation = "relu", trainable = T) %>%
  layer_dropout(0.2, trainable = T) %>%
  layer_dense(12, trainable=T) %>%   
  layer_activation("softmax", trainable=T)

model <- keras_model(inputs = base_model$input, outputs = predictions)

Freezing layers.

#This is important:
for (layer in base_model$layers) layer$trainable = FALSE

Compiling the final model.

model %>% compile(
  loss = "categorical_crossentropy",
  optimizer = optimizer_adam(lr = 0.01, decay = 1e-6),
  metrics = "accuracy"
)

Finally, we look at the summary of our model.

summary(model)
## Model: "model_6"
## __________________________________________________________________________________________________________________________________________________
## Layer (type)                                                     Output Shape                                               Param #               
## ==================================================================================================================================================
## input_8 (InputLayer)                                             [(None, None, None, 3)]                                    0                     
## __________________________________________________________________________________________________________________________________________________
## block1_conv1 (Conv2D)                                            (None, None, None, 64)                                     1792                  
## __________________________________________________________________________________________________________________________________________________
## block1_conv2 (Conv2D)                                            (None, None, None, 64)                                     36928                 
## __________________________________________________________________________________________________________________________________________________
## block1_pool (MaxPooling2D)                                       (None, None, None, 64)                                     0                     
## __________________________________________________________________________________________________________________________________________________
## block2_conv1 (Conv2D)                                            (None, None, None, 128)                                    73856                 
## __________________________________________________________________________________________________________________________________________________
## block2_conv2 (Conv2D)                                            (None, None, None, 128)                                    147584                
## __________________________________________________________________________________________________________________________________________________
## block2_pool (MaxPooling2D)                                       (None, None, None, 128)                                    0                     
## __________________________________________________________________________________________________________________________________________________
## block3_conv1 (Conv2D)                                            (None, None, None, 256)                                    295168                
## __________________________________________________________________________________________________________________________________________________
## block3_conv2 (Conv2D)                                            (None, None, None, 256)                                    590080                
## __________________________________________________________________________________________________________________________________________________
## block3_conv3 (Conv2D)                                            (None, None, None, 256)                                    590080                
## __________________________________________________________________________________________________________________________________________________
## block3_pool (MaxPooling2D)                                       (None, None, None, 256)                                    0                     
## __________________________________________________________________________________________________________________________________________________
## block4_conv1 (Conv2D)                                            (None, None, None, 512)                                    1180160               
## __________________________________________________________________________________________________________________________________________________
## block4_conv2 (Conv2D)                                            (None, None, None, 512)                                    2359808               
## __________________________________________________________________________________________________________________________________________________
## block4_conv3 (Conv2D)                                            (None, None, None, 512)                                    2359808               
## __________________________________________________________________________________________________________________________________________________
## block4_pool (MaxPooling2D)                                       (None, None, None, 512)                                    0                     
## __________________________________________________________________________________________________________________________________________________
## block5_conv1 (Conv2D)                                            (None, None, None, 512)                                    2359808               
## __________________________________________________________________________________________________________________________________________________
## block5_conv2 (Conv2D)                                            (None, None, None, 512)                                    2359808               
## __________________________________________________________________________________________________________________________________________________
## block5_conv3 (Conv2D)                                            (None, None, None, 512)                                    2359808               
## __________________________________________________________________________________________________________________________________________________
## block5_pool (MaxPooling2D)                                       (None, None, None, 512)                                    0                     
## __________________________________________________________________________________________________________________________________________________
## global_average_pooling2d_6 (GlobalAveragePooling2D)              (None, 512)                                                0                     
## __________________________________________________________________________________________________________________________________________________
## dense_14 (Dense)                                                 (None, 128)                                                65664                 
## __________________________________________________________________________________________________________________________________________________
## dropout_6 (Dropout)                                              (None, 128)                                                0                     
## __________________________________________________________________________________________________________________________________________________
## dense_15 (Dense)                                                 (None, 12)                                                 1548                  
## __________________________________________________________________________________________________________________________________________________
## activation_6 (Activation)                                        (None, 12)                                                 0                     
## ==================================================================================================================================================
## Total params: 14,781,900
## Trainable params: 67,212
## Non-trainable params: 14,714,688
## __________________________________________________________________________________________________________________________________________________

We can see that our model has a total of 14,781,900 parameters, and we are only training the last layer, which is a total of 67,212 parameters.

Training Our Model

We finally train our model.

## Code continuity because of how RMarkdown handles code chunks.
model %>% compile(
  loss = "categorical_crossentropy",
  optimizer = optimizer_adam(lr = 0.01, decay = 1e-6),
  metrics = "accuracy"
)

model %>% fit_generator(
  train_generator,
  steps_per_epoch = nrow(train.array)/batch_size, 
  epochs = 10,
  validation_data = validation_generator,
  validation_steps = nrow(val.array)/batch_size,
  verbose = 1)

After 10 epochs our model has the following values against various evaluation metics:

loss: 0.5524 - accuracy: 0.7887 - val_loss: 0.4342 - val_accuracy: 0.8250

So our off-the-shelf model is able to achieve and good accuracy rate of greater than 75%. Not too bad for a few hours of work on average hardware.

Note: We can speed up the training using a smaller number of epochs. The model we trained earlier during a test phase with only 1 epoch had an accuracy of 0.6301 and took about 50 minutes to train.

Prediction and Output

We classify the test data.

y <- predict_generator(model, 
                       test_generator, 
                       steps = nrow(test.array),
                       verbose = 1)
y <- as.data.frame(y)
colnames(y) <- unique(train.label)

answers <- data.frame(file = submission$file, sign_type = colnames(y)[max.col(y)])

write.table(answers, "solution.csv", sep = ",", row.names = FALSE)

as_tibble(head(answers, 5))
## # A tibble: 5 x 2
##   file                  sign_type
##   <fct>                 <fct>    
## 1 00000_00001_00014.png Speed_20 
## 2 00000_00001_00019.png Speed_30 
## 3 00000_00003_00016.png Speed_30 
## 4 00000_00005_00017.png Speed_30 
## 5 00000_00005_00019.png Speed_30

And store results in an easily accessible fashion.

Summary

We see that, with the help of the body of knowledge that exists out there in academia and the enthusiast community, we can easily develop, train, and deploy an imagine recognition model using the myriad tools available to us. Our model is trainable within an hour on low compute resources, still provides a high degree of accuracy, and is easily scalable to increase the number of classifiers we are testing against. ### Suggestions: To improve the model, it probably makes sense to look at the following to get better results: - Increase image size. - Increase epochs. - Try another pretrained model instead of VGG16. - Experiment with Changing the architecture of the trainable layers.

Based on the problem one is trying to solve, tuning the above will definitely result in better predictions.

Shiny App

The above model in presented in a simple app we deployed using the shiny framework. Shiny app is deployed at: https://qasimahmed.shinyapps.io/RTrafficSignRecognition