In this project we use a Convolutional Neural Network to solve the following problem:
Image classification has become one of the most influencial innovations in Computer Vision since the first digital image scanner. Developing models that can classify images has made tremendous advances in the way people interact (social media, search engines & image processing), retail (both in person and online), marketing, theatre & the performing arts, government, survelance, law enforcement, etc. Thanks to image classification algorithms we are able to recieve notifications on social media when someone has posted a picture that may look like us, or object recognition in self driving cars! The idea of a program being able to identify meaningful objects in an image and make a judgement as to what it is, what it’s connected with and where it belongs based on only the information found in an image has endless applications.
In this project we explore image classification via a Convolutional Neural Network (CNN) which has become the “gold standard” for solving image classification problems. A CNN is a class of deep learning neural networks that uses a series of filters to extract features from a particular data set, while keeping parameters relatively low.
Neural Networks are superior to other feature classification algorithms such as SIFT, FAST, SURF, BRIEF, etc because it solves problems associated with feature detectors being too simple or too complex to generalize categories. Using a neural network allows for models to learn features for particular objects (regardless of abstraction) and a system for feature learning can be developed to classify images in a way that bypasses explicitlydefined features. Traditional neural networks such as the Multiplayer Perceptron (MLP) are not a robust solution because nodes only allow one input
(pixels of an image multiplied by 3 for color is too large), weights become unmanageable and chances of overfitting increases, and are not translation invariant (you lose spatial information ie. location of an object is located on an image).
Convolutional Neural Networks (CNN) on the other hand solve all these issues. CNNs analyze pixels in groups with their neighbors by sliding filters (or convolving filters) across the pixels of an image. Each filter’s purpose can be to detect various patterns within images. For example, one filter can contribute to detecting eyes in a facial recognition model; another may be responsible for detecting a nose or a mouth. Each filter essentially executes an operation on pixel data and indicates how strongly a particular feature appears in an image, where it is located and it’s frequency. This process reduces the number of parameters the CNN must learn as compared to an MLP, and does not loose spatial information.
Filters change as a response to training and therefore initially begin with arbitrary values. Essentially what is being trained are these filters responsible for identifying unique features for each image or image category. Feature maps for each image are generated for each filter and provided to an activation function at the node which determines if a feature is present in a given location. This process is continued with multiple layers throughout the CNN. You can view the following resources for more in depth information on CNNs: Lecture describing concept behind CNN layers & filters and CNN article on Medium
In this project, the keras
package is used to contruct the model. Keras is a high level deep learning library that would allow the use of a fully connected neural network to train a CNN keras model to recognitize images that fall into one of three categories. Specifically, the functions that I will be using utilizes tensorflow
library. TensorFlow utilizes vectors as tensors to easily create the sturcture of the neural network that I will be using. This is actually a python
package, however keras
uses miniconda to access these powerful python tools in R. For more information refer to the Keras Documentation and Guide to Sequential Models in R.
In this project seven samples of images in three categories are imported from the internet and used to create testing and training data.
We can import images located in the current directory using the readImage()
function from the EBImage
library. Since we have multiple images, we can save them all in a list.
library(EBImage) #Load Library
#Save Image names in a vector
pics <- c("moto1.jpg","moto2.jpg","moto3.jpg","moto4.jpg","moto5.jpg", "moto6.jpg","moto7.jpg", "car1.jpg", "car2.jpg", "car3.jpg", "car4.jpg", "car5.jpg", "car6.jpg","car7.jpg", "bike1.jpg", "bike2.jpg", "bike3.jpg", "bike4.jpg", "bike5.jpg", "bike6.jpg", "bike7.jpg")
#Create list to save each image
mypic <- list()
#Load files into list using a for loop
for(x in 1:length(pics)){
mypic[[x]] <- readImage(pics[x])
}
After images are loaded into a list, we can split our training and testing data. We know that our images are organized in a list mypics
according to the following labels:
Considering the organization of images, we are going to select the first 5 images in each category to be our training data set and the last 2 to be in our testing set. This will ensure that our data has a split between 70/30 and 80/20 which is standard for training CNN models. To do this we simply create seperate lists by iterating across the list of images.
#training set of 16 images
trainX <- list(1:16)
#select first 5 images of each group
for(x in 1:5){
trainX[[x]] <- mypic[[x]] #motorcycles
trainX[[x+5]] <- mypic[[x+7]] #cars
trainX[[x+10]] <-mypic[[x+14]] #bikes
}
#test set of 6 images
testX <- list(1:6)
#select last two images in each group
for(x in 1:2){
testX[[x]] <- mypic[[x+5]] #motorcycles
testX[[x+2]] <- mypic[[x+12]] #cars
testX[[x+4]] <- mypic[[x+19]] #bikes
}
Lets take a look at what our training and testing data structure. For demonstration purposes I have only included the structure of the test data set.
#Inspect the structure for resizing
#str(trainX)
str(testX)
## List of 6
## $ :Formal class 'Image' [package "EBImage"] with 2 slots
## .. ..@ .Data : num [1:728, 1:485, 1:3] 0.796 0.796 0.796 0.796 0.796 ...
## .. ..@ colormode: int 2
## $ :Formal class 'Image' [package "EBImage"] with 2 slots
## .. ..@ .Data : num [1:800, 1:533, 1:3] 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ colormode: int 2
## $ :Formal class 'Image' [package "EBImage"] with 2 slots
## .. ..@ .Data : num [1:480, 1:300, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..@ colormode: int 2
## $ :Formal class 'Image' [package "EBImage"] with 2 slots
## .. ..@ .Data : num [1:2000, 1:1421, 1:3] 0.455 0.435 0.427 0.431 0.435 ...
## .. ..@ colormode: int 2
## $ :Formal class 'Image' [package "EBImage"] with 2 slots
## .. ..@ .Data : num [1:1000, 1:663, 1:3] 0.702 0.706 0.714 0.718 0.722 ...
## .. ..@ colormode: int 2
## $ :Formal class 'Image' [package "EBImage"] with 2 slots
## .. ..@ .Data : num [1:800, 1:533, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..@ colormode: int 2
We can see that these images are 3 dimensional but have varying lengths and widths. Taking a look at the @Data
category of the structure of our list of images we can see that the first two numbers (width and height) are different for each image, and all images are color (“3” stands for numbers in RGB format which indicate a color image).
To prepare the data for training and testing we want to keep all image dimensions consistant. To do this we resize all images so that they are 166 by 166 pixels using resize()
from the EBImage
Library. I chose 166 by 166 because it was smaller than the smallest height of all the images in the imported data. Images must also have equal width and height (square dimensions) because we want to end up with square matrices.
#Resizing Images (Train)
for(x in 1:length(trainX)){
trainX[[x]] <- resize(trainX[[x]], 166, 166)
}
#Resizing Images (test)
for(x in 1:length(testX)){
testX[[x]] <- resize(testX[[x]], 166, 166)
}
#Display new dimensions
#str(trainX)
str(testX)
## List of 6
## $ :Formal class 'Image' [package "EBImage"] with 2 slots
## .. ..@ .Data : num [1:166, 1:166, 1:3] 0.796 0.796 0.796 0.796 0.796 ...
## .. ..@ colormode: int 2
## $ :Formal class 'Image' [package "EBImage"] with 2 slots
## .. ..@ .Data : num [1:166, 1:166, 1:3] 0 0 0 0 0 ...
## .. ..@ colormode: int 2
## $ :Formal class 'Image' [package "EBImage"] with 2 slots
## .. ..@ .Data : num [1:166, 1:166, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..@ colormode: int 2
## $ :Formal class 'Image' [package "EBImage"] with 2 slots
## .. ..@ .Data : num [1:166, 1:166, 1:3] 0.415 0.43 0.43 0.431 0.426 ...
## .. ..@ colormode: int 2
## $ :Formal class 'Image' [package "EBImage"] with 2 slots
## .. ..@ .Data : num [1:166, 1:166, 1:3] 0.731 0.715 0.724 0.72 0.731 ...
## .. ..@ colormode: int 2
## $ :Formal class 'Image' [package "EBImage"] with 2 slots
## .. ..@ .Data : num [1:166, 1:166, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..@ colormode: int 2
#Display Resized Images (train)
trainX <- combine(trainX)
dis <- tile(trainX, 5)
display(dis, title = "Pictures")
#Display Resized Images (test)
testX <- combine(testX)
dis2 <- tile(testX, 2)
display(dis2, title = "Pictures")
We can see from the display above that we have 15 images in our training set and 6 images in our testing set, all with the same dimensions.
Before we can train and test our model, we need to convert the images back into the dimensions needed for input into the CNN. Looking at the sturcture we see that now instead of having a list of images, was have one matrix with dimensions (166 X 166 X 3 X ). The addiitonal 6 comes from the 6 images we combined above in our test set (and for the training set the 4th parameter is 15 which corresponds to the numer of images in the training set). We can reorder the dimensions (number of images, width, height, color) using the aperm()
R base function.
#Display Before permutation
#str(testX)
str(testX)
## Formal class 'Image' [package "EBImage"] with 2 slots
## ..@ .Data : num [1:166, 1:166, 1:3, 1:6] 0.796 0.796 0.796 0.796 0.796 ...
## ..@ colormode: int 2
#permute the dimensions
testX <- aperm(testX, c(4,1,2,3))
trainX <- aperm(trainX, c(4, 1, 2, 3))
#Display the change
#str(testX)
str(testX)
## num [1:6, 1:166, 1:166, 1:3] 0.796 0 1 0.415 0.731 ...
To create the labels for our data we gave each category a number between 0 and 2:
#Response Variable for the three categories
trainY <- c(rep(0, 5), rep(1, 5), rep(2,5))
testY <- c(0,0,1,1,2,2)
One hot encoding is a method used to label data sets that have multiple categories where order does not matter. The labels above indicate that a motorcycle is of category 0, a car is 1, and a bike is 2 however, the model we are using may interpret these categories in an ordinal way. The fact that 0 comes before 1 which comes before 2 on a number line will affect the model’s ability to predict categories. One-hot encoding is a way to transform labels into a binary matrix where 0 means that the particular image is not in a category and 1 means that it is. To do this we is the to_catgeorical()
function from the keras
library.
library(keras)
library(kableExtra) #library for dislaying tables
#One Hot Encoding
trainLabels <- to_categorical(trainY)
testLabels <- to_categorical(testY)
#Display matrix
kable(testLabels) %>%
kable_styling() %>%
scroll_box(width = "100%", height = "200px")
1 | 0 | 0 |
1 | 0 | 0 |
0 | 1 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |
0 | 0 | 1 |
We can see that in the matrix above of test labels, the first two images are of motorcycles, the next two are of cars and the last two are of bikes. The same encoding was also done for training labels.
In 2014 a CNN architecture called VGG16 was used to win the ImageNet Large Scale Visual Recognition Challenge (ILSVR) and is considered to be one of the top neural network architecture for image classification. The model implemented here is based on the VGG16 architecture as seen in the figure below.
In this model, layers are abbreviated; there are five different layers used:
Activation functions
A Rectified Linear Unit ReLU is used as an activation function in convolutional a layers and in the first fully connected layer. A Softmax activation function is used for the final fully connected layer to produce a probability distribution in output as per VGG16 architecture.
Layer One - Input Layer
Convolutional layer with 32 3x3 filters, ReLU activation function and input dimensions of 166 x 166 x 3.
Layer Two
Convolutional layer with 32 3x3 filters and ReLU activation function.
Layer Three
Pooling layer with size filter 2 x 2
Layer Four
Dropout layer with rate of 25%
Layer Five & Six
Convolutional layer with 64 3x3 filters and ReLU activation function.
Layer Seven
Pooling layer with size filter 2 x 2
Layer Eight
Dropout layer with rate of 25%
Layer Nine
Flattening layer transforms matrix into vector for fully connected layer
Layer Ten
Fully connected layer with 256 neurons and ReLU activation function
Layer Eleven
Dropout layer with rate of 25%
Layer Twelve - Output Layer
Fully connected layer with 3 neurons (because we have 3 categories) and a softmax activation function for probability output.
Using the compile()
function we can configure the CNN and specify the followng parameters:
Loss Function - categorical_crossentropy
is used because each image can only belong to one category. For more information see this source for more information
Optimizer - optimizer_sgd()
is a stochastic gradient descent optimizer which is currently the best choice for computer vision problems see this source for more information. There are four hyperparameters that this function takes which can be changed to optimize the model:
lr
indicates the learning rate which is set to 0.001 which is found to be the optimal learning rate for this data (a learning rate of 0.01 caused a wide range of accuracies with each run). See this source for more information
decay
indicates the adjustment to the learning rate after each interation. This is set to \(e^{-6}\) See this source for more information
momentum
indicates the moving average of our gradients which is set to 0.9 (standard setting). See this source for more information
nesterov
indicates type of momentum function.See source for more information
**Metrics* : here we specify that we want the model evaluated for accuracy of categorization.
#Model with a linear stack of layers
model <- keras_model_sequential()
#Layers within the model(as listed above)
model %>%
layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = 'relu', input_shape = c(166, 166, 3)) %>%
layer_conv_2d(filters = 32, kernel_size = c(3,3) , activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2,2)) %>%
layer_dropout(rate = 0.25) %>%
layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu')%>%
layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2,2)) %>%
layer_dropout(rate = 0.25) %>%
layer_flatten() %>%
layer_dense(units = 256, activation = 'relu') %>%
layer_dropout(rate = 0.25) %>%
layer_dense(units = 3, activation = "softmax") %>%
compile(loss = "categorical_crossentropy", optimizer =
optimizer_sgd(lr = 0.001, momentum = 0.9, decay = 1e-6, nesterov = T),
metrics = c('accuracy'))
#View the model
summary(model)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## conv2d (Conv2D) (None, 164, 164, 32) 896
## ________________________________________________________________________________
## conv2d_1 (Conv2D) (None, 162, 162, 32) 9248
## ________________________________________________________________________________
## max_pooling2d (MaxPooling2D) (None, 81, 81, 32) 0
## ________________________________________________________________________________
## dropout (Dropout) (None, 81, 81, 32) 0
## ________________________________________________________________________________
## conv2d_2 (Conv2D) (None, 79, 79, 64) 18496
## ________________________________________________________________________________
## conv2d_3 (Conv2D) (None, 77, 77, 64) 36928
## ________________________________________________________________________________
## max_pooling2d_1 (MaxPooling2D) (None, 38, 38, 64) 0
## ________________________________________________________________________________
## dropout_1 (Dropout) (None, 38, 38, 64) 0
## ________________________________________________________________________________
## flatten (Flatten) (None, 92416) 0
## ________________________________________________________________________________
## dense (Dense) (None, 256) 23658752
## ________________________________________________________________________________
## dropout_2 (Dropout) (None, 256) 0
## ________________________________________________________________________________
## dense_1 (Dense) (None, 3) 771
## ================================================================================
## Total params: 23,725,091
## Trainable params: 23,725,091
## Non-trainable params: 0
## ________________________________________________________________________________
From the above summary we can see each layer in our model, and the number of parameters that each layer introduces. We can see that with each layer the shape of our output changes until we have an output with 3 units. In total we see that we have 23, 725, 091 parameters.
fit()
function from keras
with the training data and traning labels. Some hyperparameters that are defined are:epochs = 60
indicates the number times the data gets passed through the CNN for optimization (this can be changed to optimize the model)
batch_size = 32
indicates the subset of the data set that will be passed through the CNN at one time
validation_split = 0.2
percent of training data used for validation source
validation_data = list(testX, testLabels)
use test data as validation data (when this line is added, the validation split will be ignored)
For more information about the significance of these hyperparameters view this source.
#Fit the model to the training set
history <- model %>%
fit(trainX, trainLabels, epochs = 60, batch_size = 32, validation_split = 0.2)
#Plot the epochs
plot(history)
From the plot above we can see that with each iteration the loss decreasing and the accuracy increasing.
Now that we have trained our model we can adjust hyperparameters and other factors to optimize the accuracy.
We can use evaluate()
to calculate loss and accuarcy. This particular model had to run more than once to get the highest accuracy. On the third try this model had a 93.333% accuracy for predicting the training set with loss of 30.16%.
# Loss/Accuarcy
evTrain <- model %>%
evaluate(trainX, trainLabels)
We can call predict_classes()
to use the model to make a prediction of the training set and create a confusion matrix.
#make a prediction of the classes
pred <- model %>%
predict_classes(trainX)
#Create the confusion matrix
table(Predicted = pred, Actual = trainY)
## Actual
## Predicted 0 1 2
## 0 5 0 0
## 1 0 5 1
## 2 0 0 4
We can further evaluate the model by looking at the porbability the model assigned to each image for each category. The first column indicates probability for motorcycle, the second car, and third bicycle. The last two columns compare predicted and actual categories for each image
#calculate the probabilities of each category (train)
prob <- model %>%
predict_proba(trainX)
cbind(prob, Predicted_class = pred, Actual = trainY)
## Predicted_class Actual
## [1,] 9.999985e-01 5.645261e-07 9.784324e-07 0 0
## [2,] 9.999856e-01 1.294885e-05 1.470450e-06 0 0
## [3,] 9.999607e-01 2.929269e-05 1.003435e-05 0 0
## [4,] 9.964097e-01 3.076105e-03 5.141911e-04 0 0
## [5,] 9.994032e-01 1.490755e-04 4.477720e-04 0 0
## [6,] 3.117176e-04 9.996881e-01 1.587317e-07 1 1
## [7,] 1.058510e-07 9.999214e-01 7.847859e-05 1 1
## [8,] 2.270291e-04 9.996426e-01 1.303163e-04 1 1
## [9,] 1.607486e-06 9.999963e-01 2.103938e-06 1 1
## [10,] 4.785108e-04 9.994412e-01 8.026355e-05 1 1
## [11,] 3.016197e-03 8.402288e-04 9.961436e-01 2 2
## [12,] 4.595066e-03 5.544942e-04 9.948505e-01 2 2
## [13,] 1.502816e-02 4.483731e-02 9.401345e-01 2 2
## [14,] 2.514178e-01 1.940434e-01 5.545388e-01 2 2
## [15,] 1.141389e-01 7.238862e-01 1.619749e-01 1 2
We can use the above matrix to inform adjustments made forsee from the above probability matrix that the model is having issues predicting images with bicycles correctly.
Repeating thes steps above for test data we can perform a similar analysis.
evTest <- model %>%
evaluate(testX, testLabels)
evTest
## $loss
## [1] 1.47326
##
## $accuracy
## [1] 0.8333333
predTest <- model %>%
predict_classes(testX)
#Create the confusion matrix
table(Predicted = predTest, Actual = testY)
## Actual
## Predicted 0 1 2
## 0 2 1 0
## 1 0 1 0
## 2 0 0 2
probTest <- model %>%
predict_proba(testX)
cbind(probTest, Predicted_class = predTest, Actual = testY)
## Predicted_class Actual
## [1,] 0.9985579 1.050861e-03 0.0003912439 0 0
## [2,] 0.7808689 1.174439e-02 0.2073866576 0 0
## [3,] 0.9991227 6.932112e-04 0.0001841404 0 1
## [4,] 0.3921849 5.946720e-01 0.0131430719 1 1
## [5,] 0.2320748 2.602178e-01 0.5077074170 2 2
## [6,] 0.1121178 7.501818e-05 0.8878071904 2 2
For test data our accuracy decreased possibly due to variations in images and background, and the fact that 5 images per group for test and train data set is way too small of a sample. In addition accuracies fluctuate every time the model runs.
library(ggplot2)
ggplot() +
geom_col(aes(x = c("Training", "Testing"), y = c(evTrain$accuracy, evTest$accuracy)), fill = c("pink", "purple")) +
geom_text(aes(x = c("Training", "Testing"), y = c(evTrain$accuracy + 0.1 , evTest$accuracy + 0.1), label = c(round(evTrain$accuracy, 2), round(evTest$accuracy, 2)))) +
labs(y = "Accuracy", x ="Data", title = "Accuracy of Train & Test Data ") +
theme(plot.title = element_text(hjust = 0.5), legend.position = "top")
As shown in this project, CNN implementation for image classification for arbitrary images of cars, bikes and motorcycles is possible with few lines of code. Keras provides a simple way to implement multiple types of CNN architectures and facilitates easy fine tunning of hyperparameters so that models can be easily optimized.
As we can see from the differences in accuracy by the training and test data set this model does well with predicting the training set but not as accurate with test data set. Further research can be conducted to optimize accuracy and minimize loss.
To build this model and achieve a higher accuracy we can do a few things:
keras
has a vgg16
function for this reason) to increase accuracy and may take care of the issue of small image sample size
During the creation of this R Markdown file the accuracy of my model fluctuated every time I ran the model. Accuracies ranged from 30% to 93% when the learning rate was set to 0.01. When I decreased the learning rateto 0.001 the loss decreased drastically and the model improved as well. This is because the model was trading optimization for a faster training time. When the learning rate was decreased, the model took longer to train but the accuracy did not fluctuate everytime I ran the program. According to this source,, “At extremes, a learning rate that is too large will result in weight updates that will be too large and the performance of the model (such as its loss on the training dataset) will oscillate over training epochs. Oscillating performance is said to be caused by weights that diverge (are divergent). A learning rate that is too small may never converge or may get stuck on a suboptimal solution.” Click for more information about improving accuracy in CNN models. This would explain why the accuracy has such a wide range when the learning rate was set to 0.01`as opposed to 0.001.
Although using neural networks to classify images and objects within images is an effective way of abstracting features for thousands of categories, it is far from perfect. As we can already see from this simple walkthrough, the features “discovered” and identified by the model is only as good as the data that is provided to it. For example, if the next model of car looks more like an airplane than a car (lets say with wings, jet engines or something else that is super rediculous), then this model will not be able to identify it correctly. This brings up the bigger issue about implicit bias in algorithms associated with image classification (or object classification for that matter) in the field of Computer Vision.
Joy Buolamwini talks about algorithmic bias in this talk. She talks about how some algorithms are not able to detect certain object due to the data used to train models and the algorithm architecture itself. Understanding the way these algorithms and the data that is used to train these models is the first step to contributing the decreasing algorithmic bias.