Deep Learning Computer Vision Cheat Sheet

Author

R

require(reticulate)
require(knitr)

1 Convolutional Neural Networks (CNN)

1.1 CNN Intuition

Convolutional Neural Networks (CNN) are a type of neural network that is used to model and solve complex problems. CNNs are inspired by the structure and function of the human visual system and are capable of learning from large amounts of data. CNNs are used in computer vision tasks such as image classification, object detection, and image segmentation.

CNN Flow

Usage of CNN

Computer read images as pixels of colors. Such as B/W image only has 1 layer of pixel value of 0 to 255. RGB image has 3 layers of pixel value of 0 to 255. The computer will read the image as a matrix of pixel values.

Images Read by Computer

In here, we assume that 0 is white and 1 is black.

Example of Computer Read Image

1.2 Step 1 - Convolution Operation

The first step in a CNN is the convolution operation. The convolution operation is a mathematical operation that is used to extract features from an image.

The convolution formula is: \[ (f \ast g)(t):=\int_{-\infty}^{\infty} f(\tau) g(t-\tau) d \tau \] Where:

  • \(f\) is the input image
  • \(g\) is the filter
  • \(t\) is the output feature map

The convolution operation works by

  1. Taking an input image.
  2. Applying a feature detector or filter or kernel to the image. The filter is a small matrix of weights that is applied to the image to extract features such as edges, corners, and textures.
  3. Producing a feature map or convolved feature or activation map.
  4. The filter slides along the image by a certain number of pixels/stride and in this case the stride is 1
  5. The calculation in feature map is as follow: if 1 and 1 match then we get 1, if 1 and 0 match then we get 0, if 0 and 0 match then we get 0. This is the essence of the convolution operation. We sum all the 1s and result in a feature map.
  6. The result or feature map is a reduced version of the input image. When looking at an image, we don’t look at pixel level but we look at the features (edges, corners, textures) in the image. The filter is used to extract these features from the image.

0 match

1 match

4 matches

Convolution Result

Example of different feature detector:

Original Image

Sharpen

Blur

Edge Enhance

Edge Detect (The Most Important for CNN)

Emboss

The NN will decide the feature detector combination and might not be visible to the human eye. The filter will be learned by the NN during the training process.

In summary, the primary purpose of CNN is to find features in your image using feature detector, put them into feature map, and by having them in a feature map, it preserves the spatial relationship between the pixels in the image. Most of the time, the feature a CNN detect and use to recognize certain images and classes will mean nothing to humans, but nevertheless they work to the computer.

So far, our current process has reached to: After Convolution

1.3 Step 1(b) - ReLU Layer

ReLU is an additional step on top of our convolution. The function os applying ReLU is to increase the non-linearity of the image or in the network. Image itself is highly non-linear as it includes various information (elements, borders, colors, etc.) in one frame and when we run the convolution/feature map, we risk creating something linear. Which is why we need to apply ReLU to increase the non-linearity or break up the linearity of the image. Example given:

Original Image

This is the original image

Image After Feature Detector

Feature detector itself can have a negative value on it and so after it stride on an image, the resulting feature map could have negative and positive value. The black area is negative and the white area is positive. This will be a common

Image After ReLU Function

The ReLU function will convert the negative area to 0 and keep the positive area as it is.

Image After Feature Detector Explained

After feature detector are applied, we could see that the highlighted area have “linear” color gradation from white to several shades grey to black. This is a linear gradation and we need to break it up.

Image After ReLU Function Explanained

After ReLU function are applied, we could see that the highlighted area have “non-linear” color gradation from grey to black directly without any shades and only with abrupt change. This is a non-linear gradation and this is what we want.

\[ \phi(x) = \max(0, x) \] Where:

  • \(x\) is the sum of the input and weights
  • \(\phi(x)\) is the output of the activation function

So far, our current process has reached to:

After ReLU

1.4 Step 2 - Max Pooling

There are sevaral types of pooling: max pooling, average pooling, sum pooling, etc. The most common is max pooling. The purpose of pooling is to reduce the size of the feature map and to extract the most important features from the feature map. The pooling operation works by:

  1. Taking a feature map.
  2. Applying a pooling operation to the feature map. The pooling operation is a mathematical operation that is used to reduce the size of the feature map and to extract the most important features from the feature map.
  3. Producing a pooled feature map.

Imagine these images of a cheetah

Image of One Cheetah

In this image is one same cheetah where the image is normal, rotated, and squeezed. The network will read this as 3 different images and we do this because we want the network to be able to recognize the cheetah in any form. The network will learn that the cheetah is the same in all 3 images.

Image of 2 Cheetah

In this image is 6 different cheetah where all the image is normal but the cheetah is in different position. They are all looking at different angles, positioned differently, in different sizes, and in different textures. If the network is looking for distinctive features of a cheetah, which in this case is the black “tear” looking pattern from the eye to the mouth, then the network will have a hard time to recognize the other cheetahs because that feature might be in different position in these 6 images.

To solve this, we need to ensure the neural network has a property called spatial invariance so that the network have some flexibility and does not care if the feature is slightly tilted, rotated, squeezed, closer, or further.

Example of how pooling works:

Max Pooling Step 1

Here we have a feature map that we are going to apply pooling to. The pooling operation is a mathematical operation that is used to reduce the size of the feature map and to extract the most important features from the feature map. In this example, we are using Max Pooling of 2x2 pixels where we select the maximum value found inside the pixels.

Max Pooling Step 2

Using stride of 2 we scan the feature map to create pooled feature map.

Max Pooling Step 3

It does not matter if we pass the edge, just ignore it and continue to the next stride.

Max Pooling Step 4

With max pooling, this area will result in the maximum value of 4.

Max Pooling Step 5

This is the result after max pooling. We can see that the size has been reduced to only 25% (1 out of 4 pixels) but still retain the most important features from the feature map.

Original Cheetah Tears Position

Now with this, it does not matter where the “tears” of the cheetah position is. Say the “tears” position is originally in this area of the image.

Tilted Cheetah Tears Position

The image then slightly rotated so the tears will move to this area of the image. However, since we are using max pooling, the resulting number would still be 4 in the pooled feature map. This way we are accounting for possible spatial or textural changes in the image.

After the max pooling applied, we are:

  1. Preserving the features
  2. Introducing spatial variance to the network
  3. Reducing the size of the original feature map by 75% (1 out of 4 pixels)
  4. By reducing the size, we are reducing the number of parameters that goes to the next / final layer.
  5. Preventing overfitting because we remove unnecessary information from the feature map.

So far, our current process has reached to:

After Max Pooling

1.5 Step 3 - Flattening

After the convolution and max pooling steps, the feature map is flattened into a single column/vector.

Flattening to Single Vector/Column

The purpose of flattening is to convert the feature map into a format that can be used as input to a fully connected layer. So far, our current process has reached to: After Flattening

1.6 Step 4 - Full Connection

The fully connected layers are layers of neurons that is connected to every neuron in the previous layer. The fully connected layer is used to classify the input image into different classes. The fully connected layer works by:

  1. Taking the flattened feature map.
  2. Applying a fully connected layer to the feature map. The fully connected layer is a layer of neurons that is connected to every neuron in the previous layer.
  3. Producing an output.

Fully Connected Layer

Fully Connected Layer with Target Example

Imagine the example of a CNN network trying to predict between dog or cat

Network Minimizes the Loss

As with ANN, the network will try to minimize the loss between the predicted output and the actual output. The network will adjust the weights of the neurons in the fully connected layer to minimize the loss through backpropagation. The only difference is that CNN will go all the way back to the convolved layer to adjust the weights where ANN will go all the way back to the input layer.

Process of how CNN learn to differentiate between dog and cat:

Target Example of Dog

Target Example of Cat

Example of a New Dog

Example of a New Cat
  1. The network will learn that the dog has certain features.
  2. The last layer in the fully connected layer got to vote the importance of the features or whether they found the features (e.g. 0.9 for eyebrows, 1 for nose, 1 for pointy ears) and send / fire that to the output layer.
  3. The output layer will compare with their label and decide that the image is a dog.
  4. Because the dog output layer now knows that the input from those neurons (3 neurons) are the features of a dog.
  5. Through multiple iterations (samples and epochs), the output layer will know that the input from those neurons are really contributing to the features of a dog and will trust those neurons more. Finally, the final layer of fully connected layers likely to have lots of features or combination of features that are indeed representative and descriptive of the output layer.
  6. On the other hand, the cat output layer will know that those neurons are not the features of a cat and will trust those neurons less.
  7. That is how features are propagated through the network and and conveyed to the output layer.
  8. If the last fully-connected layers are not contributing distinctive features to the output layer, then the network will backpropagate to adjust the weights of the neurons starting in the convolved layer.
  9. After we have trained the model, then a new input can come in.
  10. The output layer have no idea whether it is a dog or a cat, but they have learned to listen to the neurons that fire up the most indicating a dog or a cat.
  11. The dog output layer will look at the input from the 3 neurons that it believes and read that the value from each neurons are high and then the output layer will decide that 95% chance that this is a dog. Similarly, the cat output layer will look at the input from the 3 neurons that it believes and read that the value from each neurons are low and then the output layer will decide that 5% chance that this is a cat. The output layer will decide that this is a dog.
  12. Same process will happen to the cat output layer.

1.7 Softmax & Cross-Entropy

1.7.1 Softmax

Softmax Function

In practice, the output layer won’t directly calculate that the probability of dog is 0.95 and the probability of cat is 0.05 which summed up to 1. Instead, the output layer will produce something like 0.80 for dog and 0.40 for cat which won’t add up to 1. This is where the softmax function comes in.

The output layer of a CNN is a type of softmax layer with softmax function. The softmax layer is a layer of neurons that is used to produce a probability distribution over the classes. The softmax layer works by:

  1. Taking the output of the fully connected layer.
  2. Applying a softmax function to the output. The softmax function is a function that takes the output of the fully connected layer and produces a probability distribution over the classes.
  3. Producing a probability distribution over the classes.

The softmax function is defined as: \[ f_j(z) = \frac{e^{z_j}}{\sum_{j=1}^{n} e^{z_k}} \] Where:

  • \(z\) is the output of the fully connected layer
  • \(f_j(z)\) is the output of the softmax function

1.7.2 Cross-Entropy

In an ANN, we are using cost function in backpropagating, whereas in CNN because we also use softmax function then it is called loss function (basically the same just different terminology). The cross-entropy loss function is a function that is used to measure the difference between the predicted output and the actual output. The cross-entropy loss function is defined as:

\[ H(p, q) = -\sum_{i=1}^{n} p(x) \log q(x) \] Where:

  • \(p\) is the actual output
  • \(q\) is the predicted output
  • \(H(p, q)\) is the output of the cross-entropy loss function

Assume that we have two NN predicting dog and cat with the following softmax result: Softmax Result

We then calculate the classification error, mean squared error, and cross entropy of the prediction Cross Entropy Calculation

We can calculate the MSE as follow

MSE NN1

Row 1:
- Dog: (0.9 - 1)² = 0.01
- Cat: (0.1 - 0)² = 0.01

Row 2:
- Dog: (0.1 - 0)² = 0.01
- Cat: (0.9 - 1)² = 0.01

Row 3:
- Dog: (0.4 - 1)² = 0.36
- Cat: (0.6 - 0)² = 0.36

Total squared error = 0.01 + 0.01 + 0.01 + 0.01 + 0.36 + 0.36 = 0.76

Considering each row as a single sample with two outputs:
MSE = 0.76 ÷ 3 = 0.2533 ≈ 0.25

We can calculate the cross-entropy as follow

Cross-Entropy NN1

Row 1:
- Dog: -(1) * log(0.9) = 0.1054
- Cat: -(0) * log(0.1) = 0

Row 2:
- Dog: -(0) * log(0.1) = 0
- Cat: -(1) * log(0.9) = 0.1054

Row 3:
- Dog: -(1) * log(0.4) = 0.9163
- Cat: -(0) * log(0.6) = 0

Total cross-entropy = 0.1054 + 0 + 0 + 0.1054 + 0.9163 + 0 = 1.1271

Considering each row as a single sample with two outputs:
Cross-entropy = 1.1271 / 3 = 0.3757 ≈ 0.38

In this case, it would be better to take the value of cross-entropy instead of mean squared error or classification error. This is because in the early or first result of forward propagation, the voting of the neurons are not accurate and might present very small number such as 0.00001 or 0.99999 and if the second forward propagation result is 0.001 or 0.999, then the mean squared error or classification error will be small and looks like the network are not improving that much. But when we are looking at the cross-entropy value, even though this is a small improvement but it is in a good direction and adjust the gradient descent accordingly.

Moreover, cross-entropy is a preferred method for classification problems because it is a measure of the difference between probability distributions (CNN). While mean squared error is better for ANN in regression problems.

1.8 Summary

CNN Summary
  1. We started with an input image.
  2. Apply multiple feature detectors to create feature maps (convolution).
  3. On top of the convolution layer, we apply ReLU to increase the non-linearity of the image.
  4. Apply max pooling to the feature maps (the number will be the same as the feature maps). Max pooling ensures we have spatial invariance, reduce image size, extract the most important features, and prevent overfitting from reading unnecessary features.
  5. Flatten the pooled layers into a single column/vector.
  6. Input them to the ANN by applying fully connected layers to classify the image into different classes.
  7. Then we have final layer of the fully connected layers will vote the importance of the features and send them to the output layer to decide.
  8. The output layer will decide the image is a dog or a cat.
  9. After this one cycle of forward propagation finished, the network will backpropagate to adjust the weights of the neurons starting in the convolved layer.

2 Face Detection with OpenCV

Like all other ML algorithms, the first step is to preprocess the data. In this case, we need to detect the face first before we can do anything else. The process consist of training and detection phases. We will start on detection phase first, although counter intuitive, it is easier to understand what the algorithm does then we can understand how it works.

2.1 Detection

2.1.1 Viola-Jones Algirothm

The algorithm that lies at the foundation of OpenCV library proposed by Paul Viola and Michael Jones in 2001. It is a machine learning algorithm that is capable of detecting objects in images and videos. The algorithm is trained on a large number of positive and negative images to detect objects in images.

The algorithm will first convert the image into greyscale in the background. And so in this example we are directly using greyscale image as the input. The algorithm will create a small box inside the image and scan through it from left-right and top-bottom. It will search for features of a face such as 2 eyebrows, 2 eyes, 1 nose, 1 mouth to be present in one single box. If found then box change to green, if not then keep moving. The algorithm will try different box sizes and box steps to find the face. The area where most boxes overlap then it means it has high likeliness that is a face.

Original Image

Image in Greyscale

Image with First Box

Image with Second Box

Box Found Features of a Face

Multiple Boxes Found Features of Face

Conclude All the Boxes as a Face

Take the Box Location and Put it Back Into Colored Image

2.1.2 Haar-Like Features

Haar-like features are digital image features used in object recognition. They are named after Hungarian mathematician Alfred Haar. The Haar-like features are used in the Viola-Jones object detection framework (inside the box that scans through the image in the algorithm). The features are simple rectangular filters that are applied to the image as following:

We use these filters to detect edges, lines, and other simple shapes to achieve the following example:

Original Image

Image Overlayed with Haar Features

Calculation of Haar-like Features

When this filter is running in the image, it will calculate the sum of pixel intensities in both white (W) and black (B) rectangle where 0 is white and 1 is black. e.g. the W average intensity is 0.166 and B is 0.568. Then we find the difference in means of B - W = 0.402. This is the value of the feature. The features are used to detect edges, lines, and other simple shapes. There would be a treshold to determine the finding of each filter can be categorized as nose, lips, eyes, etc. e.g. feature value > 0.5 then it is a nose.

2.1.3 Integral Image

However, calculating with Haar-like features is computationally expensive. The integral image is a technique or hack to speed up the calculation of Haar-like features. The integral image is a 2D array where each element is the sum of all the pixels above and to the left of it.

  1. Imagine we have an image with this intensity value (everything multiplied by 10 to avoid decimal) Original Image

  2. Normally, if we want to calculate the features, we will need to sum up all the values in this box / area, e.g. 10+4+9+8+0+0… etc. This is computationally expensive and in real-time computer vision, we want the result to be instant. Haar-like Feature Calculation

  3. Integral image works by calculating the sum of all the pixels above and to the left of it. 25 is the calculation of 1+2+5+9+9+0, 134 is the calculation of 1+2+5+7+2+9+ etc.

Sum 1 Part

Sum Another Part

Sum All Part
  1. Then let’s compare the original image with integral image result. Instead of summing up all the pixels in the box,we can use the integral image to calculate it. Haar-like Feature Calculation with Integral Image:
    1. exact bottom right corner
    2. (-) 1 step above top right corner
    3. (+) 1 step above top left corner
    4. (-) **1 step left* bottom left corner

Original with Integral Image

Exact Bottom Right Corner

1 Step Above Top Right Corner

1 Step Above Top Left Corner

1 Step Left Bottom Left Corner

The Remaining is Our Feature Sum

Hence, we got 235 - 83 + 47 - 134 = 65. This is the sum of the box with only 4 calculations. The calculation is much faster than summing up all the pixels in the box.

  1. This is only possible with Harr-like features since it only consist of rectangles. Although Harr-like feature might not be the best since it only has rectangles which will fail in a circle or other shapes, they make it up with the speed of calculation.

2.2 Training classifiers

The training phase is where the algorithm learns to detect the object. What we will be training are these 5 features to learn the descriptive or a common feature of a face. 5 Features to Train

Training Steps:

  1. Identify the features. We know the bridge of a nose will have darker center and brighter sides but the algorithm won’t. We need to train the algorithm to know that.
  2. Set the threshold. The training will help the algo to understand those thresholds of range of intensity of B and W to be considered as a nose. e.g. if the feature value is > 0.5 then it is a nose.

Training Requirement:

  1. Image scaling. The algorithm will scale the images to a fixed size to make the training process faster at 24x24 pixels (making a proxy image). This is to reduce the number of features sizes to be calculated. Once the feature found in the 24x24 pixels, the feature will scale back to the original image size.
  2. Face and non-face images. The algorithm will be trained on a large number of positive images where it contains the object we want to detect (faces) and negative images where it does not contains the object (non-face images). The non-face images does not have to be 24x24 pixels, it can be any size and we can take subwindows (partial) from these images and treat them as individual image.
    • Positive (face) images will help the algo to understand which features are important. This initially find features that are good for faces without treshhold.
    • Negative (non-face) images will help the algo to understand which features that it found are good for faces are also leading to false positive in non-face images. e.g. it found certain feature in a face (eye) but also found in non-face images (dogs’ eyes). But because of these images are labelled as negative, the algo will learn that this feature is not a good feature to detect faces. This will set the threshold to be more strict.
  3. Calculate the error. The algorithm will calculate the error of the features in the positive and negative images. The error is the difference between the actual value of the feature and the threshold value. The algorithm will try to minimize the error by adjusting the threshold value.
  4. Select the best features. The algorithm will select the features that have the lowest error and use them to detect the object. 6 . Combine the features. The algorithm will combine the features to create a classifier that can detect the object in the image.

### Adaptive Boosting (AdaBoost) We know that we are using Viola-Jones scanning algorithm with Harr-like feature to detect the face with the help for Integral Image, but still the number of action one 24x24 pixels image require with 5 features is around 180.000+ due to features will be scalable Adaboost

Adaboost is one of the hack working by combining several weak classifier as an ensamble method to create a strong classifier. The weak classifier is a simple classifier that can only classify the data with a small error rate. Every weak classifier is one of 5 Harr-like feature classifier but in different size. The weak classifier will be trained on the training data and the error rate will be calculated. The algorithm will then select the weak classifier with the lowest error rate and use it to classify the data. The process will be repeated several times until the error rate is minimized. The final classifier will be a combination of all the weak classifiers. The classifier formula: \[ F(x) = \sum_{t=1}^{T} \alpha_t f_t(x) \] where:

  • \(F(x)\) is the final classifier
  • \(T\) is the number of weak classifiers
  • \(\alpha_t\) is the weight of the weak classifier
  • \(f_t(x)\) is the weak classifier

The formula works by: 1. \(f_1\) will find its strong one feature with the highest weight. 2. \(f_2\) will find its will complement the area where \(f_1\) features falling behind/making false predictions. 3. \(f_3\) will find its will complement the area where \(f_1\) and \(f_2\) features falling behind/making false predictions, etc.

Let’s say we have 5 faces and 5 non-faces, the steps is as follow:

  1. We try with one feature, the feature result in 3 correct faces, 2 wrong faces, 3 correct non-faces, and 2 wrong non-faces. The error rate is \(\frac{4}{10} = 0.4\). The classifier weight formula is: \[ \alpha_t = 0.5 \times \ln\left(\frac{1-Error}{Error}\right) \] Hence, the weight is: \[ 0.5 \times \ln\left(\frac{1-0.4}{0.4}\right) \approx 0.2027 \] End of 1st Round

  2. For the 2nd round, the incorrect predictions or where the feature cannot detect the face will be given more weight. (e.g. the 2 wrong faces and 2 wrong non-faces). The algorithm will try to find the feature that can detect these faces. (images are enlargen for illustrative purpose)

Start of 2nd round

End of 2nd round
  1. For the 3rd round, the incorrect predictions or where the feature cannot detect the face will be given more weight. (e.g. the 1 wrong faces and 1 wrong non-faces). The algorithm will try to find the feature that can detect these faces. (images are enlargen for illustrative purpose) Start of 3rd round

  2. In the end, the algorithm will combine all the features to create a strong classifier that can detect the object in the image. It will never be ideal, but the point is to keep on building this classifier until the error rate is minimized. End of Rounds

    After our classifier reaches satisfactory error rate, we might skip the rest of the training set and the strong classifier is ready to detect the face in the image.

2.2.1 Cascading Classifiers

Cascading is the second hack in Viola-Jones algorithm. This works by creating an if else condition where if in a subwindow it found the \(f_1\) feature then it continues to \(f_2\) feature, if it found \(f_2\) feature then it continues to \(f_3\) feature, etc. If it fails in any of the features, then it will stop and move to the next subwindow. This is to reduce the number of calculations required to detect the object in the subwindow.

Steps:

  1. Divide the image into subwindows. The algorithm will divide the image into subwindows of different sizes and scan each subwindow to detect the object.

  2. Using \(f_1\) classifier which usually fit for nose. If there is no nose, then reject the subwindow, if there is nose then move to the next classifier. In reality, \(f_1\) might not only 1 classifier, it could batch of 5 or 12 classifiers. 1st Round Cascading

  3. Using \(f_2\) classifier, if it fails then move to the next subwindow. If it passes then move to the next classifier. 2nd Round Cascading

  4. Using \(f_3\) classifier, if it fails then move to the next subwindow. If it passes then move to the next classifier. 3rd Round Cascading

  5. The example is using this picture and 5 classifiers.

    Example in an Image

    In this subwindow, \(f_1\) could be found in eyebrow, \(f_2\) could be found in eyes, \(f_3\) could be found in eyebrow, but \(f_4\) could not be found anywhere then the subwindow will be rejected. The algorithm will move to the next subwindow.

3 Object Detection with SSD

3.1 Single Shot MultiBox Detector (SSD) Works

3.2 MultiBox Concept

3.3 Predicting the Location of the Object

3.4 The Scale Problem

4 Generative Adversarial Networks (GANs)

4.1 Idea Behind GANs

GANs are a type of neural network that is used to generate new data. GANs are inspired by the structure and function of the human brain and are capable of learning from large amounts of data. GANs are used in generative tasks such as image generation, text generation, and music generation. It can even create things that have never existed before, so GANs learn our objects and then create something new. GANs work with 2 component of Generator and Discriminator.

The generator generates images and the discriminator asses those images and tell the generator whether those images are likely be similar or similar to the real images. The generator will then adjust the images to make it more similar to the real images. The discriminator will then assess the images again and the process will be repeated until the generator can generate images that are indistinguishable from the real images.

When we are creating a GAN, we are creating the generator and discriminator from scratch and both will learn together. Example:

  1. The generator creates a picture of a table
  2. The discriminator will assess the picture by comparing it with real picture of a table (labelled data)
  3. The discriminator will learn itself what is a table and what is not a table
  4. The discriminator will feedback the generator that the picture is not a table because of this and this and this is easily distinguishable

4.1.1 Generative / Generators

The generator generates images. The generator is a neural network that takes random noise as input and produces an image as output. This part is also called deconvolutional neural network because it works in reverse of convolutional neural network. Generator

4.1.2 Adversarial / Discriminators

The discriminator assesses the images generated by the generator by comparing it with the real images of a object that it has learned before. The discriminator is a neural network that takes an image as input and produces a probability value as output. The probability value indicates the likelihood that the image is real or fake.

Discriminator Learn

Discriminator Predict 0

Discriminator Predict 1

4.1.3 Network

The generator and discriminator are trained together in a neural network. The generator generates images and the discriminator assesses the images. The generator then adjusts the images to make it more similar to the real images. The discriminator then assesses the images again and the process is repeated until the generator can generate images that are indistinguishable from the real images.

Generator and Discriminator

Both Visualized as Network

4.2 How GANs work

4.2.1 Step 1: Initial Training of Discriminator and Generator

  1. The generator generates images from random noise. Training 1

We starts by feeding a random noise to the generator

  1. The generator creates an image out of the random noise Training 2

  2. Now we are first going to train the discriminator Training 3

  3. In additional to giving the random noise images, we also give the discriminator the real images of dogs Training 4

  4. The discriminator will assess the images and produce a probability value Training 5

At the initial stage, the probability value will be random because the discriminator has not learned anything yet. So the random noise might get 0.3, 0.8, and 0.5 while the real dog image might get 0.9, 0.1, 0.2.

  1. We know that the images from the random noise should be valued 0 while the real dog images should be valued 1 in an ideal scenario. Our model is not ideal at fist so it need to learn from this. Training 6

The error is then calculated by substracting the output value to the ideal value. e.g. \(0.3 - 0 = 0.3, 0.8 - 0 = 0.8, 0.5 - 0 = 0.5\) and \(0.9 - 1 = -0.1, 0.1 - 1 = -0.9, 0.2 - 1 = -0.8\).

  1. The calculated error is then backpropagated through the network of the discriminator and the weight of the neurons are adjusted accordingly. Training 7

This part is basically the end of learning process for the discriminator

  1. Next we train the generator Training 8

There are 2 versions of how we train the generator

1). Ian Goodfellow in his paper recommends us to create a new noise signal and feed it to the generator, and

2). In this tutorial will use the same image that has been used to train the discriminator and feed it to the generator.

  1. Feed the noise images to the discriminator Training 9

We feed the previously generated random images to the discriminator again but this time without the dog images.

  1. The discriminator will assess the images and produce a probability value Training 10

The discriminator will provide an output and this time the output will be better than before. The random noise will get lower probability of 0.1, 0.2, and 0.1 because it has been trained before.

The value in here might not be 0 0 0 because that would indicate that the discriminator has been fully trained before and it would not be fair for the generator to have a “more powerful” discriminator. The value in here should be low but not 0.

  1. We calculate the error by substracting the output value to the ideal value. e.g. \(0.1 - 1 = -0.9, 0.2 - 1 = -0.8, 0.1 - 1 = -0.9\). Training 11

This time we substract by 1 because the target is to train the generator and the generator should achieve or target 1 in the output.

  1. The calculated error is then backpropagated through the network of the generator and the weight of the neurons are adjusted accordingly. Training 12

4.2.2 Step 2: Reiteration of the Training Process

  1. The generator creates an image out of the random noise Reiteration 1

Noise gets into the generator and the generator will create images that are less random and clearer than before.

  1. We will train the discriminator again Reiteration 2

  2. We will feed the images from the generator and different batch of real dog images to the discriminator Reiteration 3

  3. The discriminator will assess the images and produce a probability value Reiteration 4

  4. We calculate the error by substracting the output value to the ideal value. e.g. \(0.4 - 0 = 0.6, 0.9 - 0 = 0.1, 0.2 - 0 = 0.8\) and \(0.3 - 1 = -0.7, 0.9 - 1 = -0.1, 0.7 - 1 = -0.3\) Reiteration 5

  5. The calculated error is then backpropagated through the network of the discriminator and the weight of the neurons are adjusted accordingly. Reiteration 6

  6. Next we train the generator Reiteration 7

  7. We feed the previously generated random images to the discriminator again but this time without the dog images. Reiteration 8

  8. The discriminator will assess the images and produce a probability value Reiteration 9

  9. We calculate the error by substracting the output value to the ideal value. e.g. \(0.5 - 1 = -0.5, 0.2 - 1 = -0.8, 0.1 - 1 = -0.9\) Reiteration 10

  10. The calculated error is then backpropagated through the network of the generator and the weight of the neurons are adjusted accordingly. Reiteration 11

Example in here is the backpropagation process inform the generator that dog usually has eyes

4.2.3 Step 3: Final Training of Discriminator and Generator

  1. The generator creates an image out of the random noise Final Training 1

This time the generator generates dogs with eyes, following the update from the discriminator in step 2. Shows that it is learning from its mistake.

  1. We will train the discriminator again Final Training 2

  2. We will feed the images from the generator and another different batch of real dog images to the discriminator Final Training 3

  3. The discriminator will assess the images and produce a probability value Final Training 4

It is clear that now the probability / value from both dogs and fake dogs are getting closer and closer to converge or equal. This is the sign that the generator has learned to generate images that are slowly indistinguishable from the real images.

  1. We calculate the error by substracting the output value to the ideal value. e.g. \(0.6 - 0 = 0.4, 0.4 - 0 = 0.7, 0.3 - 0 = 0.7\) and \(0.4 - 1 = -0.6, 0.7 - 1 = -0.3, 0.5 - 1 = -0.5\) Final Training 5

  2. The calculated error is then backpropagated through the network of the discriminator and the weight of the neurons are adjusted accordingly. Final Training 6

  3. Next we train the generator Final Training 7

  4. We feed the previously generated random images to the discriminator again but this time without the dog images. Final Training 8

  5. The discriminator will assess the images and produce a probability value Final Training 9

We can see that the values are no longer as low as the previous steps and are increasing to the generator target of 1.

  1. We calculate the error by substracting the output value to the ideal value. e.g. \(0.4 - 1 = -0.6, 0.2 - 1 = -0.8, 0.2 - 1 = -0.8\) Final Training 10

  2. The calculated error is then backpropagated through the network of the generator and the weight of the neurons are adjusted accordingly. Final Training 11

4.3 Application

GANs can be used for:

  1. Generating images GAN Application 1 - Bedroom after 1 epoch

GAN Application 1 - Bedroom after 15 epoch
  1. Image modification GAN Application 2 - Image Modification 1

Substract ‘smiling woman’ from ‘neutral woman’ and add ‘neutral man’ result to ‘smiling man’

GAN Application 2 - Image Modification 2

Substract ‘man with glasses’ from ‘man without glasses’ and add ‘woman without glasses’ result in ‘woman with glasses’

GAN Application 2 - Image Modification 3

This would not work just by doing normal arithmetic operation.

  1. Super resolution GAN Application 3 - Super Resolution

  2. Photo-realistic images GAN Application 4 - Assisting Artists

Convert hand pencil drawing to actual image

  1. Face ageing GAN Application 5 - Face Ageing