Improving Computer Vision Accuracy using Convolutions: An R Version (co-authors: Ian, Sam, Eric)

A high level overview of Convolution and Pooling

A convolution is a filter that passes over an image, processing it, and extracting features that show a commonolatity in the image such that if an image has certain features, it belongs to a particular class. At its heart, convolution is really simple. It involves scanning every pixel in the image, looking at it’s neighboring pixels, multiplying these pixels by their corresponding weight in a filter and then summing this all up to obtain a new pixel value. This can be shown as below:

## [1] Image source: https://colab.research.google.com/github/lmoroney/mlday-tokyo/blob/master/Lab3-What-Are-Convolutions.ipynb#scrollTo=xF0FPplsgHNh

The previous Dense Neural Network that we created simply learned from the raw pixels what made up a sweater or what made up a boot. This in itself is quite a limitation.

Ultimately the goal of trying to understand what an item is, isn’t just matching the raw pixels to labels like we did in the previous exercises.

What if we could extract features from the image instead and when an image has some specific features, it belongs to a particular class. This is the heart of Convolution Neural Networks do.

This key characteristic gives convnets two interesting properties:

  • The patterns they learn are translation-invariant: This means that after learning a certain pattern, a covnet can recognize it anywhere else in the image as opposed to a DNN which would have to learn the pattern a new if it appeared at a new location. For this reason, covnets require few training samples.

  • They can learn spatial hierarchies of patterns: A first convolution layer will learn small local patterns such as edges, a second convolution layer will learn larger patterns made of the features of the first layers, and so on. For this reason, covnets can learn increasingly complex and abstract features.

If you’ve ever done image processing using a filter (like this: https://en.wikipedia.org/wiki/Kernel_(image_processing)) then convolutions will look very familiar.

In short, you take an array (usually 3x3 or 5x5) and pass it over the image. By changing the underlying pixels based on the formula within that matrix, you can do things like edge detection. So, for example, if you look at the above link, you’ll see a 3x3 that is defined for edge detection where the middle cell is 8, and all of its neighbors are -1. In this case, for each pixel, you would multiply its value by 8, then subtract the value of each neighbor. Do this for every pixel, and you’ll end up with a new image that has the edges enhanced.

That’s the concept of Convolutional Neural Networks. Add some layers to do convolution before you have the dense layers, and then the information going to the dense layers is more focussed, and possibly more accurate.

Pooling reduces the amount of irrelevant information in an image while maintaining the features that are detected. It does so by looking at a pixel and its immediate neighbours to the right, beneath and right-beneath, takes the largest hence the name Max pooling, and loads it into a new image. It thus reduces the amount of information that a model has to process while still maintaining the prominent features.

## [1] Image source: https://colab.research.google.com/github/lmoroney/mlday-tokyo/blob/master/Lab3-What-Are-Convolutions.ipynb#scrollTo=xF0FPplsgHNh

Building Convolution Neural Networks.

Gathering the Data

Let’s start by loading the libraries required for this session.

We’ll be requiring some packages in the Tidyverse and Keras(a framework for defining a neural network as a set of Sequential layers). You can have them installed as follows

suppressMessages(install.packages("tidyverse"))
suppressMessages(install.packages("keras"))
suppressMessages(install_keras())

Ps: it could take a while

Once installed, let’s get rolling:

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Reshaping our image arrays and normalizing them

## [1] 60000    28    28     1

Why are we adding one more dimension? That’s an important question. Convolutions operate over 3D tensors, called feature maps, with two spatial axes (height width) as well as a depth axis. For an RGB depth channels image, the dimension of the depth axis is 3, because the image has 3 color channels: red, green, and blue. For a black-and-white picture, like the Fashion MNIST dataset, the depth is 1.

Instantiating a Convolution

Next is to define your model. Now instead of the input layer at the top, we’re going to add a Convolution. The parameters are:

  1. The number of convolutions you want to generate. Purely arbitrary, but good to start with something in the order of 32

  2. The size of the Convolution, in this case a 3x3 grid

  3. The activation function to use – in this case we’ll use relu, which you might recall is the equivalent of returning x when x>0, else returning 0

We’ll follow the Convolution with a MaxPooling layer which is then designed to compress the image, while maintaining the content of the features that were highlighted by the convlution. By specifying (2,2) for the MaxPooling, the effect is to quarter the size of the image. Without going into too much detail here, the idea is that it creates a 2x2 array of pixels, and picks the biggest one, thus turning 4 pixels into 1. It repeats this across the image, and in so doing halves the number of horizontal, and halves the number of vertical pixels, effectively reducing the image by 25%. These concepts are clearly explained in Episode 3.

Adding a classifier to the covnet

Convolutional layers learn the features and pass these to the dense layers which map the learned features to the given labels. Therefore, the next step is to feed the last output tensor into a densely connected classifier network like those we’re already familiar with: a stack of dense layers. These classifiers process vectors, which are 1D, whereas the current output is a 3D tensor. First we have to flatten the 3D outputs to 1D, and then add a few dense layers on top.

Compile: Configuring a Keras model for training

## Model: "sequential"
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## conv2d (Conv2D)                  (None, 26, 26, 6)             60          
## ___________________________________________________________________________
## max_pooling2d (MaxPooling2D)     (None, 13, 13, 6)             0           
## ___________________________________________________________________________
## conv2d_1 (Conv2D)                (None, 11, 11, 16)            880         
## ___________________________________________________________________________
## max_pooling2d_1 (MaxPooling2D)   (None, 5, 5, 16)              0           
## ___________________________________________________________________________
## flatten (Flatten)                (None, 400)                   0           
## ___________________________________________________________________________
## dense (Dense)                    (None, 128)                   51328       
## ___________________________________________________________________________
## dense_1 (Dense)                  (None, 10)                    1290        
## ===========================================================================
## Total params: 53,558
## Trainable params: 53,558
## Non-trainable params: 0
## ___________________________________________________________________________

From the summary, we expect output of the first convolution to be a 28x28 but we obtain a 26x26. If you have watched the episode, you probably know why. A 3 by 3 filter requiring a neighbour on all sides can’t work on the pixels around the edges of the picture. You effectively have to remove one pixel from the top,bottom left and rignt and this reduces your dimension by 2 on each axis. So a 28 by 28 becomes a 26 by 26. Also, the pooling layer clearly halves the dimensions of each axis.

Displaying the first 25 images from the testing set:

Visualizing every filter output in each convolution and pooling layer

For this step, I took a little detour from what was done in the Python Notebook but it illustrates the same concept: how the convolutions apply different filters to extract features from our input images.

##  [1]  1 18 27 35 37 42 61 65 71 76

let’s extract the first 4 convolution and pooling layers

## [1]  1 26 26  6

Next we define a function that will help us visualise the result of each filter in each of the layer activations above.

Visualizing the convolutions and pooling on our first test image

The code below saves the images in your current directory.

The output of each filter in each layer is as shown row-wise.

Transformation by second pooling layer

Great! Now there are a few things to note:

  • The first layer acts as a collection of various filters. At that stage, the output of the layer seems to retain almost all of the information present in the initial picture.

  • As you go higher, the outputs of the layers become increasingly abstract and less visually interpretable. They begin to encode higher-level concepts such as the “heel” and “vamp” of the shoe. Higher presentations carry increasingly less information about the visual contents of the image, and increasingly more information related to the class of the image.

Yes! We dit it!! 🤩 We successfully implemented convolutions and pooling to improve the accuracy of computer vision using R. Indeed R, at its core is a beautiful and elegant language, well designed for Data Science 💖.

For the guys at the back:

Reference Material

