Improving Computer Vision Accuracy using Convolutions: An R Version (co-authors: Ian, Sam, Eric)
A high level overview of Convolution and Pooling
A convolution
is a filter that passes over an image, processing it, and extracting features that show a commonolatity in the image such that if an image has certain features, it belongs to a particular class. At its heart, convolution is really simple. It involves scanning every pixel in the image, looking at it’s neighboring pixels, multiplying these pixels by their corresponding weight in a filter and then summing this all up to obtain a new pixel value. This can be shown as below:

## [1] Image source: https://colab.research.google.com/github/lmoroney/mlday-tokyo/blob/master/Lab3-What-Are-Convolutions.ipynb#scrollTo=xF0FPplsgHNh
The previous Dense Neural Network
that we created simply learned from the raw pixels what made up a sweater or what made up a boot. This in itself is quite a limitation.
Ultimately the goal of trying to understand what an item is, isn’t just matching the raw pixels to labels like we did in the previous exercises.
What if we could extract features from the image instead and when an image has some specific features, it belongs to a particular class. This is the heart of Convolution Neural Networks do.
This key characteristic gives convnets two interesting properties:
The patterns they learn are translation-invariant: This means that after learning a certain pattern, a covnet can recognize it anywhere else in the image as opposed to a DNN which would have to learn the pattern a new if it appeared at a new location. For this reason, covnets require few training samples.
They can learn spatial hierarchies of patterns: A first convolution layer will learn small local patterns such as edges, a second convolution layer will learn larger patterns made of the features of the first layers, and so on. For this reason, covnets can learn increasingly complex and abstract features.
If you’ve ever done image processing using a filter (like this: https://en.wikipedia.org/wiki/Kernel_(image_processing)) then convolutions will look very familiar.
In short, you take an array (usually 3x3 or 5x5) and pass it over the image. By changing the underlying pixels based on the formula within that matrix, you can do things like edge detection. So, for example, if you look at the above link, you’ll see a 3x3 that is defined for edge detection where the middle cell is 8, and all of its neighbors are -1. In this case, for each pixel, you would multiply its value by 8, then subtract the value of each neighbor. Do this for every pixel, and you’ll end up with a new image that has the edges enhanced.
That’s the concept of Convolutional Neural Networks. Add some layers to do convolution before you have the dense layers, and then the information going to the dense layers is more focussed, and possibly more accurate.
Pooling
reduces the amount of irrelevant information in an image while maintaining the features that are detected. It does so by looking at a pixel and its immediate neighbours to the right, beneath and right-beneath, takes the largest hence the name Max pooling, and loads it into a new image. It thus reduces the amount of information that a model has to process while still maintaining the prominent features.

## [1] Image source: https://colab.research.google.com/github/lmoroney/mlday-tokyo/blob/master/Lab3-What-Are-Convolutions.ipynb#scrollTo=xF0FPplsgHNh
Building Convolution Neural Networks.
Gathering the Data
Let’s start by loading the libraries required for this session.
We’ll be requiring some packages in the Tidyverse and Keras(a framework for defining a neural network as a set of Sequential layers). You can have them installed as follows
suppressMessages(install.packages("tidyverse"))
suppressMessages(install.packages("keras"))
suppressMessages(install_keras())
Ps: it could take a while
Once installed, let’s get rolling:
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Reshaping our image arrays and normalizing them
training_images <- array_reshape(training_images, c(60000, 28, 28, 1))
training_images <- training_images/255
test_images <- array_reshape(test_images, c(10000, 28, 28, 1))
test_images <- test_images/255
# the values 60,000 and 10,000 are not arbitrary, we obtained them using
dim(training_images)
## [1] 60000 28 28 1
Why are we adding one more dimension? That’s an important question. Convolutions operate over 3D tensors, called feature maps, with two spatial axes (height width) as well as a depth axis. For an RGB depth channels image, the dimension of the depth axis is 3, because the image has 3 color channels: red, green, and blue. For a black-and-white picture, like the Fashion MNIST dataset, the depth is 1.
Instantiating a Convolution
Next is to define your model. Now instead of the input layer at the top, we’re going to add a Convolution. The parameters are:
The number of convolutions you want to generate. Purely arbitrary, but good to start with something in the order of 32
The size of the Convolution, in this case a 3x3 grid
The activation function to use – in this case we’ll use relu, which you might recall is the equivalent of returning x when x>0, else returning 0
We’ll follow the Convolution with a MaxPooling layer which is then designed to compress the image, while maintaining the content of the features that were highlighted by the convlution. By specifying (2,2) for the MaxPooling, the effect is to quarter the size of the image. Without going into too much detail here, the idea is that it creates a 2x2 array of pixels, and picks the biggest one, thus turning 4 pixels into 1. It repeats this across the image, and in so doing halves the number of horizontal, and halves the number of vertical pixels, effectively reducing the image by 25%. These concepts are clearly explained in Episode 3.
model <- keras_model_sequential() %>%
# adding the first convolution layer with 64 3by3 filter
# we add a color depth of 1 since convolutions operate over 3D tensors
layer_conv_2d(input_shape = c(28, 28, 1), filters = 6, kernel_size = c(3,3), activation = 'relu') %>%
# adding a max pooling layer which halves the dimensions
layer_max_pooling_2d(pool_size = c(2,2)) %>%
# adding a second convolution layer which filters the results
# from the previous layer
layer_conv_2d(filters = 16, kernel_size = c(3,3), activation = 'relu') %>% # adding a pooling layer
layer_max_pooling_2d(pool_size = c(2,2))
Adding a classifier to the covnet
Convolutional layers learn the features and pass these to the dense layers which map the learned features to the given labels. Therefore, the next step is to feed the last output tensor into a densely connected classifier network like those we’re already familiar with: a stack of dense layers. These classifiers process vectors, which are 1D, whereas the current output is a 3D tensor. First we have to flatten the 3D outputs to 1D, and then add a few dense layers on top.
Compile: Configuring a Keras model for training
## Model: "sequential"
## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## conv2d (Conv2D) (None, 26, 26, 6) 60
## ___________________________________________________________________________
## max_pooling2d (MaxPooling2D) (None, 13, 13, 6) 0
## ___________________________________________________________________________
## conv2d_1 (Conv2D) (None, 11, 11, 16) 880
## ___________________________________________________________________________
## max_pooling2d_1 (MaxPooling2D) (None, 5, 5, 16) 0
## ___________________________________________________________________________
## flatten (Flatten) (None, 400) 0
## ___________________________________________________________________________
## dense (Dense) (None, 128) 51328
## ___________________________________________________________________________
## dense_1 (Dense) (None, 10) 1290
## ===========================================================================
## Total params: 53,558
## Trainable params: 53,558
## Non-trainable params: 0
## ___________________________________________________________________________
From the summary, we expect output of the first convolution to be a 28x28 but we obtain a 26x26. If you have watched the episode, you probably know why. A 3 by 3 filter requiring a neighbour on all sides can’t work on the pixels around the edges of the picture. You effectively have to remove one pixel from the top,bottom left and rignt and this reduces your dimension by 2 on each axis. So a 28 by 28 becomes a 26 by 26. Also, the pooling layer clearly halves the dimensions of each axis.
Displaying the first 25 images from the testing set:
par(mfrow = c(5,5))
par(mar=c(0, 0, 1.5, 0), xaxs='i', yaxs='i')
for(i in 1:50){
img = test_images[i, , ,1]
img = t(apply(img, 2, rev))
image(1:28, 1:28, img, col = gray((0:255)/255), xaxt = 'n',
yaxt = 'n')
}


Visualizing every filter output in each convolution and pooling layer
For this step, I took a little detour from what was done in the Python Notebook but it illustrates the same concept: how the convolutions apply different filters to extract features from our input images.
## [1] 1 18 27 35 37 42 61 65 71 76

let’s extract the first 4 convolution and pooling layers
## [1] 1 26 26 6
Next we define a function that will help us visualise the result of each filter in each of the layer activations above.
Visualizing the convolutions and pooling on our first test image
The code below saves the images in your current directory.
images_per_row <- 3
for (i in 1:2) {
layer_activation <- activations[[i]]
layer_name <- model$layers[[i]]$name
n_features <- dim(layer_activation)[[4]]
n_cols <- n_features %/% images_per_row
png(paste0("mnist_seven_activations_", i, "_", layer_name, ".png"),
width = 500,
height = 500)
op <- par(mfrow = c(n_cols, images_per_row), mai = rep_len(0.02, 4))
for (col in 0:(n_cols-1)) {
for (row in 0:(images_per_row-1)) {
channel_image <- layer_activation[1,,,(col*images_per_row) + row + 1]
plot_channel(channel_image)
}
}
par(op)
dev.off()
}
images_per_row <- 4
for (i in 3:4) {
layer_activation <- activations[[i]]
layer_name <- model$layers[[i]]$name
n_features <- dim(layer_activation)[[4]]
n_cols <- n_features %/% images_per_row
png(paste0("mnist_seven_activations_", i, "_", layer_name, ".png"),
width = 500,
height = 500)
op <- par(mfrow = c(n_cols, images_per_row), mai = rep_len(0.02, 4))
for (col in 0:(n_cols-1)) {
for (row in 0:(images_per_row-1)) {
channel_image <- layer_activation[1,,,(col*images_per_row) + row + 1]
plot_channel(channel_image)
}
}
par(op)
dev.off()
}
The output of each filter in each layer is as shown row-wise.
How to make a raster plot of the pixel values we just saved before

Truly, Eric, Ian, Sam
