Deep Learning Cheat Sheet

Author

require(reticulate)
require(knitr)

1 Neural Network

2 Supervised Neural Networks

2.1 Artificial Neural Networks

2.1.1 ANN Intuition

2.1.1.1 ANN in General

Input value: A neuron receives input X1, X2, and X3 from different independent variable. It is recommended for the input to be normalized (0 - 1) and standardized (-1 - 1)

Output value: can be continuous, binary, or categorical

Note that the input for one row output the result for that one row only.

Weights are the component that are adjusted in each training.

The steps are as follow:

Values from each input neuron are multiplied by the weight and then summed up \[ \sum_{i=1}^{m} W_i \cdot X_i \]
Apply activation function \[ \phi\left( \sum_{i=1}^{m} W_i \cdot X_i \right) \]

It is a function that is assigned to the neuron to determine whether the neuron should pass the signal/result or modify the result.

The neuron passes the signal to the next neuron down the line based on the decision after activation

2.1.1.2 The Activation Function

4 major activation type:

Threshold Function

If the value is less than 0 then 0, if the value is 0 or bigger than 0 then 1

\[ \phi(x) = \begin{cases} 1 & \text{if } x \geq 0 \\ 0 & \text{if } x < 0 \end{cases} \]

Sigmoid Function

Sigmoid is usfeul in the last layer before the output if we are trying to predict probabilities

\[ \phi(x) = \frac{1}{1 + e^{-x}} \] Where:

\({e^{-x}}\) is e to the power of minus weighted sums

Rectifier Function

Anything below 0 is zero, but anything above 0 is presented as it is

\[ \phi(x) = max(x, 0) \]

Hyperbolic Tangent (tanh)

It ranges from approximately -1 to 1

\[ \phi(x) = \frac{1 - e^{-2x}}{1 + e^{-2x}} \]

Generally speaking, we will apply rectifier function in the hidden layers and sigmoid function in the output layer

2.1.1.3 How Neural Network Works

Assuming we have a fully-trained neural network predicting property prices. We have 4 input layers:

\(X_1\): Area (\(feet^2\))
\(X_2\): Number of Bedrooms
\(X_3\): Distance to city (Miles)
\(X_4\): Age of the house (Years)

We have 1 hidden layer with 5 neurons:

1st neuron focuses on \(x_1\) and \(x_3\) where it might be solely focusing into properties that has bigger area than average but not that far away from the city. It might be due to bigger properties are usually further from the city. It does not care about the other features.

1st Neuron
2nd neuron focuses on \(x_2\) and \(x_3\) where it might be solely focusing into properties with more bedrooms in close proximity to the city. It might be due to properties with more bedrooms are usually farther to the city. It does not care about the other features.

2nd Neuron
3rd neuron focuses on \(x_1\), \(x_3\), and \(x_4\) where it might be solely focusing into properties that has bigger area, multiple bedrooms, but newer than average. It might be due to bigger properties are usually older, but this one is looking into the uncommon feature. It does not care about the other features.

3rd Neuron
4th neuron focuses on all four features.

4th Neuron
5th neuron focuses on \(x_4\) where it might be solely focusing into properties has older age (e.g. 100 years) but high value. It might be due to the property has historical value. Once a property hit certain age then it would be deem a historical value. It does not care about the other features. This is a perfect example of ReLU function.

5th Neuron

The combinations of hidden layer increase the flexibility of neural network and allow the look into specific things and learn from it. The output layer will then combine all the hidden layer to produce the final output.

2.1.1.4 How do Neural Networks Learn

There are two approaches to get a program do something:

Rule-based / hard-coded. We tell the program what to do. e.g. cats have these features and dogs have these features.
Neural network. We create a facility for the program to learn from the data. We give the program a lot of input and output and let the program figure out the rule. The program will learn the rule by adjusting the weights of the neurons. e.g. we give the program a lot of pictures of cats and dogs and let the program figure out the difference between them.

Example in 1 variable:

We have a dataframe of exam score:

data.frame(
  'Row ID' = c(1),
  'Study Hrs' = c(12),
  'Sleep Hrs' = c(6),
  'Quiz' = c(78),
  'Exam' = c(90)
) |> 
  kable()

Row.ID	Study.Hrs	Sleep.Hrs	Quiz	Exam
1	12	6	78	90

We feed this data into the neural network resulting in \(\hat{y}\) and it gets compared with the actual output of 90. The difference between the actual output and the predicted output is the error or cost function denoted as \(C = \frac{1}{2}(\hat{y} - y)^2\).

The error is then feed back (backpropagated) to the neural network to adjust the weights of the neurons. The weights are adjusted to minimize the error between the predicted output and the actual output. When the whole training set passed through the ANN, this is the end of 1st epoch.

These two processes will go multiple times (multiple epochs) until the error/cost function is minimized or 0. The neural network will learn the relationship between the input and output and will be able to predict the output for new input. Repeated Process until C Minimzed

Example in 8 variable:

We have a dataframe of exam score:

data.frame(
  'Row ID' = c(seq(1, 8)),
  'Study Hrs' = c(12, 22, 115, 31, 0, 5, 92, 57),
  'Sleep Hrs' = c(6, 6.5, 4, 9, 10, 8, 6, 8),
  'Quiz' = c(78, 24, 100, 67, 58, 78, 82, 91),
  'Exam' = c(93, 68, 95, 75, 51, 60, 89, 97)
) |> 
  kable()

Row.ID	Study.Hrs	Sleep.Hrs	Quiz	Exam
1	12	6.0	78	93
2	22	6.5	24	68
3	115	4.0	100	95
4	31	9.0	67	75
5	0	10.0	58	51
6	5	8.0	78	60
7	92	6.0	82	89
8	57	8.0	91	97

1st Backpropagation and Complete 1 Epoch

—-The weights are adjusted using the gradient descent algorithm. The gradient descent algorithm is an optimization algorithm that is used to minimize the error between the predicted output and the actual output. The gradient descent algorithm works by calculating the gradient of the error with respect to the weights and adjusting the weights in the opposite direction of the gradient.

2.1.1.5 Gradient Descent

Previously, we know that in order for a neural to learn, it needs to back propagate the error / cost to the neuron to adjusts its weight accordingly. The way that weights are adjusted is by using the gradient descent algorithm. The gradient descent algorithm is an optimization algorithm that is used to minimize the error between the predicted output and the actual output. The goal is to find the smallest error.

The first red circle will look at which way the slope goes down and then adjust the weight accordingly. The second red circle will look at which way the slope goes down and then adjust the weight accordingly. The process will be repeated until the error is minimized. Each red circle is one epoch.

The gradient descent algorithm works by calculating the gradient of the error with respect to the weights and adjusting the weights in the opposite direction of the gradient. Defined as:

\[ w = w - \alpha \frac{\partial C}{\partial w} \] Where:

\(w\) is the weight of the neuron
\(\alpha\) is the learning rate
\(C\) is the cost function
\(\frac{\partial C}{\partial w}\) is the gradient of the cost function with respect to the weight

2.1.1.6 Stochastic Gradient Descent

Gradient descent works best if the cost function is convex such as \(C = \frac{1}{2}(\hat{y} - {y})^2\), But what if the cost formula is as below: Stochastic Gradient Descent

Then we might fall into the local best and but global best solution. The solution is to use stochastic gradient descent.

Difference of Stochastic Gradient Descent

While normal gradient descent / batch gradient descent will adjust after calculating all the data, stochastic gradient descent will adjust after calculating each data. This will make the process faster and more likely to fall into global best solution.

A middle solution for these two called Mini-batch Gradient Descent where we determine the batch to calculate at once (e.g. 5, 10 or 25). This will make the process faster and more likely to fall into global best solution.

2.1.1.7 Backpropagation

Backpropagation is the process of adjusting the weights of the neurons in a neural network to minimize the error between the predicted output and the actual output. The backpropagation algorithm works by calculating the gradient of the error with respect to the weights and adjusting the weights in the opposite direction of the gradient. The backpropagation algorithm is used in the training phase of a neural network to learn the relationship between the input and output.

In summary, the whole ANN training process is as follow: Complete ANN Training Flow

2.1.2 Building an ANN

2.1.2.1 ANN Data Preprocessing

dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:-1].values
y = dataset.iloc[:, -1].values

# Label Encoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X[:, 2] = le.fit_transform(X[:, 2])

# One Hot Encoding
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

# Dataset Splitting
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

2.1.2.2 ANN Model Building

ann = tf.keras.models.Sequential()
ann.add(tf.keras.layers.Dense(units=6, activation='relu'))
ann.add(tf.keras.layers.Dense(units=6, activation='relu'))
ann.add(tf.keras.layers.Dense(units=1, 
ann.compile(optimizer='adam',
            loss='binary_crossentropy',
            metrics=['accuracy',
                     'Precision',
                     'Recall',
                     'AUC',
                     'mae'])
ann.fit(x=x_train,
        y=y_train,
        batch_size=32,
        epochs=100)

2.2 Convolutional Neural Networks

2.2.1 CNN Intuition

Convolutional Neural Networks (CNN) are a type of neural network that is used to model and solve complex problems. CNNs are inspired by the structure and function of the human visual system and are capable of learning from large amounts of data. CNNs are used in computer vision tasks such as image classification, object detection, and image segmentation.

Computer read images as pixels of colors. Such as B/W image only has 1 layer of pixel value of 0 to 255. RGB image has 3 layers of pixel value of 0 to 255. The computer will read the image as a matrix of pixel values.

In here, we assume that 0 is white and 1 is black.

2.2.2 Step 1 - Convolution Operation

The first step in a CNN is the convolution operation. The convolution operation is a mathematical operation that is used to extract features from an image.

The convolution formula is: \[ (f \ast g)(t):=\int_{-\infty}^{\infty} f(\tau) g(t-\tau) d \tau \] Where:

\(f\) is the input image
\(g\) is the filter
\(t\) is the output feature map

The convolution operation works by

Taking an input image.
Applying a feature detector or filter or kernel to the image. The filter is a small matrix of weights that is applied to the image to extract features such as edges, corners, and textures.
Producing a feature map or convolved feature or activation map.
The filter slides along the image by a certain number of pixels/stride and in this case the stride is 1
The calculation in feature map is as follow: if 1 and 1 match then we get 1, if 1 and 0 match then we get 0, if 0 and 0 match then we get 0. This is the essence of the convolution operation. We sum all the 1s and result in a feature map.
The result or feature map is a reduced version of the input image. When looking at an image, we don’t look at pixel level but we look at the features (edges, corners, textures) in the image. The filter is used to extract these features from the image.

Example of different feature detector:

Edge Detect (The Most Important for CNN)

The NN will decide the feature detector combination and might not be visible to the human eye. The filter will be learned by the NN during the training process.

In summary, the primary purpose of CNN is to find features in your image using feature detector, put them into feature map, and by having them in a feature map, it preserves the spatial relationship between the pixels in the image. Most of the time, the feature a CNN detect and use to recognize certain images and classes will mean nothing to humans, but nevertheless they work to the computer.

So far, our current process has reached to: After Convolution

2.2.3 Step 1(b) - ReLU Layer

ReLU is an additional step on top of our convolution. The function os applying ReLU is to increase the non-linearity of the image or in the network. Image itself is highly non-linear as it includes various information (elements, borders, colors, etc.) in one frame and when we run the convolution/feature map, we risk creating something linear. Which is why we need to apply ReLU to increase the non-linearity or break up the linearity of the image. Example given:

This is the original image

Feature detector itself can have a negative value on it and so after it stride on an image, the resulting feature map could have negative and positive value. The black area is negative and the white area is positive. This will be a common

The ReLU function will convert the negative area to 0 and keep the positive area as it is.

After feature detector are applied, we could see that the highlighted area have “linear” color gradation from white to several shades grey to black. This is a linear gradation and we need to break it up.

After ReLU function are applied, we could see that the highlighted area have “non-linear” color gradation from grey to black directly without any shades and only with abrupt change. This is a non-linear gradation and this is what we want.

\[ \phi(x) = \max(0, x) \] Where:

\(x\) is the sum of the input and weights
\(\phi(x)\) is the output of the activation function

So far, our current process has reached to:

2.2.4 Step 2 - Max Pooling

There are sevaral types of pooling: max pooling, average pooling, sum pooling, etc. The most common is max pooling. The purpose of pooling is to reduce the size of the feature map and to extract the most important features from the feature map. The pooling operation works by:

Taking a feature map.
Applying a pooling operation to the feature map. The pooling operation is a mathematical operation that is used to reduce the size of the feature map and to extract the most important features from the feature map.
Producing a pooled feature map.

Imagine these images of a cheetah

In this image is one same cheetah where the image is normal, rotated, and squeezed. The network will read this as 3 different images and we do this because we want the network to be able to recognize the cheetah in any form. The network will learn that the cheetah is the same in all 3 images.

In this image is 6 different cheetah where all the image is normal but the cheetah is in different position. They are all looking at different angles, positioned differently, in different sizes, and in different textures. If the network is looking for distinctive features of a cheetah, which in this case is the black “tear” looking pattern from the eye to the mouth, then the network will have a hard time to recognize the other cheetahs because that feature might be in different position in these 6 images.

To solve this, we need to ensure the neural network has a property called spatial invariance so that the network have some flexibility and does not care if the feature is slightly tilted, rotated, squeezed, closer, or further.

Example of how pooling works:

Here we have a feature map that we are going to apply pooling to. The pooling operation is a mathematical operation that is used to reduce the size of the feature map and to extract the most important features from the feature map. In this example, we are using Max Pooling of 2x2 pixels where we select the maximum value found inside the pixels.

Using stride of 2 we scan the feature map to create pooled feature map.

It does not matter if we pass the edge, just ignore it and continue to the next stride.

With max pooling, this area will result in the maximum value of 4.

This is the result after max pooling. We can see that the size has been reduced to only 25% (1 out of 4 pixels) but still retain the most important features from the feature map.

Now with this, it does not matter where the “tears” of the cheetah position is. Say the “tears” position is originally in this area of the image.

The image then slightly rotated so the tears will move to this area of the image. However, since we are using max pooling, the resulting number would still be 4 in the pooled feature map. This way we are accounting for possible spatial or textural changes in the image.

After the max pooling applied, we are:

Preserving the features
Introducing spatial variance to the network
Reducing the size of the original feature map by 75% (1 out of 4 pixels)
By reducing the size, we are reducing the number of parameters that goes to the next / final layer.
Preventing overfitting because we remove unnecessary information from the feature map.

So far, our current process has reached to:

2.2.5 Step 3 - Flattening

After the convolution and max pooling steps, the feature map is flattened into a single column/vector.

The purpose of flattening is to convert the feature map into a format that can be used as input to a fully connected layer. So far, our current process has reached to: After Flattening

2.2.6 Step 4 - Full Connection

The fully connected layers are layers of neurons that is connected to every neuron in the previous layer. The fully connected layer is used to classify the input image into different classes. The fully connected layer works by:

Taking the flattened feature map.
Applying a fully connected layer to the feature map. The fully connected layer is a layer of neurons that is connected to every neuron in the previous layer.
Producing an output.

Fully Connected Layer with Target Example

Imagine the example of a CNN network trying to predict between dog or cat

As with ANN, the network will try to minimize the loss between the predicted output and the actual output. The network will adjust the weights of the neurons in the fully connected layer to minimize the loss through backpropagation. The only difference is that CNN will go all the way back to the convolved layer to adjust the weights where ANN will go all the way back to the input layer.

Process of how CNN learn to differentiate between dog and cat:

The network will learn that the dog has certain features.
The last layer in the fully connected layer got to vote the importance of the features or whether they found the features (e.g. 0.9 for eyebrows, 1 for nose, 1 for pointy ears) and send / fire that to the output layer.
The output layer will compare with their label and decide that the image is a dog.
Because the dog output layer now knows that the input from those neurons (3 neurons) are the features of a dog.
Through multiple iterations (samples and epochs), the output layer will know that the input from those neurons are really contributing to the features of a dog and will trust those neurons more. Finally, the final layer of fully connected layers likely to have lots of features or combination of features that are indeed representative and descriptive of the output layer.
On the other hand, the cat output layer will know that those neurons are not the features of a cat and will trust those neurons less.
That is how features are propagated through the network and and conveyed to the output layer.
If the last fully-connected layers are not contributing distinctive features to the output layer, then the network will backpropagate to adjust the weights of the neurons starting in the convolved layer.
After we have trained the model, then a new input can come in.
The output layer have no idea whether it is a dog or a cat, but they have learned to listen to the neurons that fire up the most indicating a dog or a cat.
The dog output layer will look at the input from the 3 neurons that it believes and read that the value from each neurons are high and then the output layer will decide that 95% chance that this is a dog. Similarly, the cat output layer will look at the input from the 3 neurons that it believes and read that the value from each neurons are low and then the output layer will decide that 5% chance that this is a cat. The output layer will decide that this is a dog.
Same process will happen to the cat output layer.

2.2.7 Softmax & Cross-Entropy

2.2.7.1 Softmax

In practice, the output layer won’t directly calculate that the probability of dog is 0.95 and the probability of cat is 0.05 which summed up to 1. Instead, the output layer will produce something like 0.80 for dog and 0.40 for cat which won’t add up to 1. This is where the softmax function comes in.

The output layer of a CNN is a type of softmax layer with softmax function. The softmax layer is a layer of neurons that is used to produce a probability distribution over the classes. The softmax layer works by:

Taking the output of the fully connected layer.
Applying a softmax function to the output. The softmax function is a function that takes the output of the fully connected layer and produces a probability distribution over the classes.
Producing a probability distribution over the classes.

The softmax function is defined as: \[ f_j(z) = \frac{e^{z_j}}{\sum_{j=1}^{n} e^{z_k}} \] Where:

\(z\) is the output of the fully connected layer
\(f_j(z)\) is the output of the softmax function

2.2.7.2 Cross-Entropy

In an ANN, we are using cost function in backpropagating, whereas in CNN because we also use softmax function then it is called loss function (basically the same just different terminology). The cross-entropy loss function is a function that is used to measure the difference between the predicted output and the actual output. The cross-entropy loss function is defined as:

\[ H(p, q) = -\sum_{i=1}^{n} p(x) \log q(x) \] Where:

\(p\) is the actual output
\(q\) is the predicted output
\(H(p, q)\) is the output of the cross-entropy loss function

Assume that we have two NN predicting dog and cat with the following softmax result:

We then calculate the classification error, mean squared error, and cross entropy of the prediction Cross Entropy Calculation

We can calculate the MSE as follow

MSE NN1

Row 1:
- Dog: (0.9 - 1)² = 0.01
- Cat: (0.1 - 0)² = 0.01

Row 2:
- Dog: (0.1 - 0)² = 0.01
- Cat: (0.9 - 1)² = 0.01

Row 3:
- Dog: (0.4 - 1)² = 0.36
- Cat: (0.6 - 0)² = 0.36

Total squared error = 0.01 + 0.01 + 0.01 + 0.01 + 0.36 + 0.36 = 0.76

Considering each row as a single sample with two outputs:
MSE = 0.76 ÷ 3 = 0.2533 ≈ 0.25

We can calculate the cross-entropy as follow

Cross-Entropy NN1

Row 1:
- Dog: -(1) * log(0.9) = 0.1054
- Cat: -(0) * log(0.1) = 0

Row 2:
- Dog: -(0) * log(0.1) = 0
- Cat: -(1) * log(0.9) = 0.1054

Row 3:
- Dog: -(1) * log(0.4) = 0.9163
- Cat: -(0) * log(0.6) = 0

Total cross-entropy = 0.1054 + 0 + 0 + 0.1054 + 0.9163 + 0 = 1.1271

Considering each row as a single sample with two outputs:
Cross-entropy = 1.1271 / 3 = 0.3757 ≈ 0.38

In this case, it would be better to take the value of cross-entropy instead of mean squared error or classification error. This is because in the early or first result of forward propagation, the voting of the neurons are not accurate and might present very small number such as 0.00001 or 0.99999 and if the second forward propagation result is 0.001 or 0.999, then the mean squared error or classification error will be small and looks like the network are not improving that much. But when we are looking at the cross-entropy value, even though this is a small improvement but it is in a good direction and adjust the gradient descent accordingly.

Moreover, cross-entropy is a preferred method for classification problems because it is a measure of the difference between probability distributions (CNN). While mean squared error is better for ANN in regression problems.

2.2.8 Summary

We started with an input image.
Apply multiple feature detectors to create feature maps (convolution).
On top of the convolution layer, we apply ReLU to increase the non-linearity of the image.
Apply max pooling to the feature maps (the number will be the same as the feature maps). Max pooling ensures we have spatial invariance, reduce image size, extract the most important features, and prevent overfitting from reading unnecessary features.
Flatten the pooled layers into a single column/vector.
Input them to the ANN by applying fully connected layers to classify the image into different classes.
Then we have final layer of the fully connected layers will vote the importance of the features and send them to the output layer to decide.
The output layer will decide the image is a dog or a cat.
After this one cycle of forward propagation finished, the network will backpropagate to adjust the weights of the neurons starting in the convolved layer.

2.2.9 Building a CNN

2.2.9.1 CNN Data Preprocessing

# Step 1 - Preprocessing the Training Set
train_datagen = ImageDataGenerator(rescale = 1./255,
                                   shear_range = 0.2,
                                   zoom_range = 0.2,
                                   horizontal_flip = True)
training_set = train_datagen.flow_from_directory(
    'dataset/training_set',
     target_size = (64, 64),
     batch_size = 32,
     class_mode = 'binary'
)

# Step 2 - Preprocessing the Test Set
test_datagen = ImageDataGenerator(rescale = 1./255)
test_set = test_datagen.flow_from_directory(
    'dataset/test_set',
    target_size = (64, 64),
    batch_size = 32,
    class_mode = 'binary'
)

2.2.9.2 CNN Model Building

# Step 1 - Initialising the CNN
cnn = tf.keras.models.Sequential()

# Step 2 - Convolution Layer
cnn.add(tf.keras.layers.Conv2D(
    filters=32,
    kernel_size=3,
    activation='relu',
    input_shape=[64, 64, 3],
))
# Output is 32 feature maps of 62x62 pixels [62, 62, 32]

# Step 3 - Pooling Layer
cnn.add(tf.keras.layers.MaxPool2D(
    pool_size=2,
    strides=2,
))
# Output is 32 feature maps of 31x31 pixels [31, 31, 32]

# Step 4 - Convolution Layer
cnn.add(tf.keras.layers.Conv2D(
    filters=32,
    kernel_size=3,
    activation='relu',
))
# Output is 32 feature maps of 29x29 pixels [29, 29, 32]

# Step 5 - Flattening
cnn.add(tf.keras.layers.Flatten())
# Output is 29*29*32 = 26912

# Step 6 - Full Connection
cnn.add(tf.keras.layers.Dense(
    units=128,
    activation='relu',
))
# Even though the result of flatten layers is 26912, each of the 128 neurons in the dense layer is connected to all 26912 neurons in the flatten layer

# Step 7 - Output Layer
cnn.add(tf.keras.layers.Dense(
    units=1,
    activation='sigmoid',
))

# Step 8 - Compiling the CNN
cnn.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy',
             'Precision',
             'Recall',
             'AUC',]
)

# Step 9 - Fitting the CNN to the images
cnn.fit(x=train_set,
        validation_data=test_set,
        epochs=100)

2.3 Recurrent Neural Networks

2.3.1 RNN Intuition

As part of unsupervised NN, RNN is useful for time series data.

To transform a normal ANN to RNN, we squash the input and output layer and assume we see all of them from below instead of the side. The number of neurons every layer is still the same.

It is represented along the time axis

Represented as a loop for the next hidden layer after getting an input.

Don’t forget that all the layers are still there, it is because we see from bottom that we only see one. Each one of these circles is not one neuron, instead it is a layer of neurons.

The temporal loop means that the RNN does not only ouput a result, but it also feed the output back to itself. Moreover, the neurons are connecting to themselves through time. Hence, it allows the concept that each neuron have a short term memory that they remember what was in that neuron just previously.

Long term memory is achieved through trained weights. The weights are trained to remember the important information and forget the unimportant information. But, short term memory is achieved through the temporal loop.

Examples cover:

One to Many where there is a single input and multiple outputs such as image captioning.
- The input is an image and the output is a sequence of words describing the image
- The model takes a fixed input and generates a variable-length sequence
Many to One where there are multiple inputs and a single output such as sentiment analysis
- The input is a sequence of words and the output is a single label (correct)
- The model processes a sequence and produces a fixed output classification
Many to Many (synchronized) where there are multiple inputs and multiple outputs with aligned sequences, such as video classification frame by frame
- The input is a sequence and the output is a sequence of the same length
- Each output element directly corresponds to an input element
Many to Many (delayed) where there are multiple inputs and multiple outputs such as language translation
- The input is a sequence of words in one language and the output is a sequence of words in another language
- The model processes the entire input sequence before generating output elements

2.3.2 Vanishing Gradient in RNN

2.3.2.1 Vanishing Gradient Problem

In a normal ANN, the weight is adjusted based on the error of the output neuron. The error is calculated by comparing the output of the neuron with the expected output. The weight is adjusted by multiplying the error with the input and then applying the activation function.

However, RNN is not the same as ANN since we have output going to the output layer and the next hidden layer.

We can calculate the error / cost function at any given timepoint. The cost function compares the output which is in the right circle with the expected output which is in the left circle. The cost function is calculated at each timepoint.

Assume that we are using Et as an example and we have calculated the cost or error of Et and now we want to backpropagate the error to the previous timepoint. Every single neuron which participated in the calculation of the output of Et need its weights to be updated.

Not only the weight in that timestamp that need to be updated, but all the previous weight in hidden and input layers need to be updated.

The errors are backpropagated to all the previous nodes.

As we backpropagate, the error is multiplied by the weights. If the weights are small, the error will be smaller and smaller as we go back in time. This is called the vanishing gradient problem. So Xt will be updated easily but Xt-3 will be updated very small. At the end of the epochs, Xt could be well-trained but Xt-3 could be not trained at all. This is the problem of RNN.

If Wrec is small then we have vanishing gradient problem. If Wrec is big then we have exploding gradient problem. The exploding gradient problem is when the weights are so big that the error becomes too big and the model diverges.

2.3.2.2 Vanishing Gradient Solution

Assume two conditions:

Exploding Gradient
- Truncated backpropagation: stop the backpropagation at a certain time step but it is not ideal since not all the weights are updated. However, the intuition is that rather than having a big and irrelevant gradient, just stop the backpropagation at a certain time step.
- Penalties: add penalties to the weights to prevent them from exploding. This is done by adding a regularization term to the cost function to artificially reduce the weights.
- Gradient clipping: set a threshold for the gradient. If the gradient is bigger than the threshold, then set it to the threshold. This is done by normalizing the gradient.
Vanishing Gradient:
- Weight Initialization: initialize the weights to be bigger than 1. This is done by using a uniform distribution with a certain range.
- Echo State Networks: this is a type of RNN that has a special architecture that allows it to remember the past inputs. It is not a standard RNN but it is a solution to the vanishing gradient problem.
- Long Short Term Memory (LSTM): this is a type of RNN that has a special architecture that allows it to remember the past inputs. It is not a standard RNN but it is a solution to the vanishing gradient problem.

2.3.3 LSTM Intuition

We will be talking about LSTM, a type of RNN that is used to solve the vanishing gradient problem.

Generally speaking, if Wrec is small or <1 then we have vanishing gradient problem. If Wrec is big or >1 then we have exploding gradient problem. The solution in LSTM is to set the Wrec to be 1.

The structure of an LSTM network comprises memory cells, input gates, forget gates, and output gates. Memory cells serve as the long-term storage, input gates control the flow of new information into the memory cells, forget gates regulate the removal of irrelevant information, and output gates determine the output based on the current state of the memory cells

where:

\(c{_{t-1}}\) is the memory from previous neuron
\(c{_t}\) is the memory from the current neuron
\(h{_{t-1}}\) is the output from previous neuron
\(h{_t}\) is the output from the current neuron
\(X{_t}\) is the input from the current neuron
\(x\) is the operation to remove
\(+\) is the operation to add

The step-by-step is as follow:

Value from previous node with input from current node coming together through Sigmoid to determine whether the value should be passed or not to the memory pipeline.

Value from previous node with input from current node coming together through Sigmoid and Tanh to determine whether the value should be passed or not to the memory pipeline.

Memory from previous node will have its component removed (forget) or added (input) after the Sigmoid and Tanh function. If the \(X\) on the left is open and the \(X\) on the right is closed, then it means that the memory won’t be updated. If the \(X\) on the left is closed and the \(X\) on the right is open, then it means that the memory will be updated.

Value from previous node with input from current node through Sigmoid will decide which part of the memory pipeline will be the output of this neuron.

2.3.4 LSTM Practical Intuition

In this example we are using LSTM looking at the tanh function in a long text where +1 tanh will be blue and -1 tanh will be red. The RNN is trained to look at text and predict what text will come next.

In “sensitivity to position in line” test, the early words are blue while the end are red. In “turns on inside quote”, the one inside the quote are blue.

This example keeps track of the depth of the if else condition

In this example, the websites are highlighted in green, the top line is the current word, and the dark red are the next word prediction.

2.3.5 LSTM Variations

This is the standard LSTM implementation.

In Peephole LSTM, we add peephole connections – the lines that feed additional input about the current state of the memory cell to the sigmoid activation functions.

In Combined Gates LSTM, we connect forget valve and memory valve. So, instead of having separate decisions about opening and closing the forget and memory valves, we have a combined decision here. Basically, whenever you close the memory off (forget valve = 0), you have to put something in (memory valve = 1 – 0 = 1), and vice versa.

In Gated Reccurent Unit LSTM, this modification completely gets rid of the memory cell and replaces it with the hidden pipeline. So, here instead of having two separate values – one for the memory and one for the hidden state – you have only one value.

2.3.6 Bulding a RNN

2.3.6.1 RNN Data Preprocessing

# Step 1 - Data Import
training_set = dataset_train.iloc[:, 1:2].values

# Step 2 - Feature Scaling
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = sc.fit_transform(training_set)

# Step 3 - Creating a data structure with 60 timesteps and 1 output
X_train = []
y_train = []
for i in range(60, 1258):
    X_train.append(training_set_scaled[i-60:i, 0])
    y_train.append(training_set_scaled[i, 0])
X_train, y_train = np.array(X_train), np.array(y_train)

# Step 4 - Reshaping
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))

2.3.6.2 RNN Model Building

# Step 1 - Initialising the RNN
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
regressor = Sequential()

# Step 2 - Adding the first LSTM layer and some Dropout regularisation
regressor.add(LSTM(units = 50, return_sequences = True, input_shape = (X_train.shape[1], 1)))
regressor.add(Dropout(0.2))

# Step 3 - Adding a second LSTM layer and some Dropout regularisation
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.2))

# Step 4 - Adding a third LSTM layer and some Dropout regularisation
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.2))

# Step 5 - Adding a fourth LSTM layer and some Dropout regularisation
regressor.add(LSTM(units = 50))
regressor.add(Dropout(0.2))

# Step 6 - Adding the output layer
regressor.add(Dense(units = 1))

# Step 7 - Compiling the RNN
regressor.compile(optimizer = 'adam', loss = 'mean_squared_error')

# Step 8 - Fitting the RNN to the Training set
regressor.fit(X_train, y_train, epochs = 100, batch_size = 32)

3 Unsupervised Neural Networks

3.1 Self-Organizing Maps

SOM SOM is a type of unsupervised neural network that is used to cluster data. It is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional representation of the input space (reduce the amount of column from many into just two-dimensional). The SOM is trained using a competitive learning algorithm, where the neurons compete to be activated by the input data.

In the given picture, the input layers of lots and lots of input are condensed into 2 dimensions. The input layer is a 2D grid of neurons, and each neuron is connected to the input data. The output layer is a 2D grid of neurons, and each neuron is connected to the input data. The output layer is a low-dimensional representation of the input data.

Example:

This SOM depict the different states of prosperity and poverty. Assume that the initial data is 200 countries with 39 categories, which would be difficult to visualize at once.

Furthermore, we can take it into a world map

3.1.1 Revisiting K-Means

The similarity between K-Means and SOM is that both are used to cluster data.

3.1.2 SOMs Intuition

Assuming the following input and output. The input has three features/column where the output has multiple neurons. The output neurons are arranged in a 2D grid, and each neuron is connected to the input data. The output neurons are trained to represent the input data in a low-dimensional space.

3.1.3 SOMs Learns

Let’s highlight the first neuron in the output layer which is connected by the three input neurons. Instead of treating the synapses as weight similar to ANN, CNN, or RNN.

There is no activation function, hence weights are the characteristic of the output neurons itself represented as coordinates. Initially, these weights are assigned at random hence each node has its own imaginary place in the input space.

Then we can do the same for the remaining of Node2, Node3, etc.

Now, assume we input the value for Row 1, then we need to calculate the distance between the input and the weight in the output neuron with Euclidean Distance \(\text{Distance} = \sqrt{\Sigma(x_i - w_1,_i)^2}\). Assume for Node1 we got 1.2 (we should get value close to 1 since the inputs are either standardized or normalization. Then Node2 we got 0.8, Node3 we got 0.4, and so on. We can see that the distance to Node3 is the closest by 3x closer than Node1. Hence, we can say that Node3 is the Best Matching Unit (BMU).

Then we can update the weights of the BMU and its neighbors. The update is done by moving the weights of the BMU and its neighbors closer to the input data. The amount of movement is determined by a learning rate, which decreases over time. The radius of affected neighbors is determined by a neighborhood function, which also decreases over time. The learning rate and neighborhood function are used to control the amount of change in the weights during training.

Lets assume Row 2 as input and it found the BMU. Then we can update the weights of the BMU and its neighbors again.

Now we have both green and blue BMU, now they fight with each other assuming any of the point is closer to them. Assume that we have a red BMU in this point and it might not even fall within green BMU’s radius, hence it pulled much harder by the blue and therefore becomes like the blue BMU.

Now if we have a new input with BMU in the are in-between blue and green, then it will have a weight that is in-between blue and green.

Afterall, the fight might look like this

In the first epoch, the radius will be bigger and the learning rate will be higher.

As the epochs go by, the radius will be smaller and the learning rate will be smaller. Hence, the BMU will be updated less than the first epoch. The same goes for the third epoch and so on. Hence, the process is more and more accurate as we go through the dataset again and again.

After repeated epochs, then our SOM might look like this

Some important notes to know:

SOMs retain topology of the input set

The distance between the neurons in the output layer is preserved in the input space. This means that similar input data will be mapped to nearby neurons in the output layer. The Map will try its best to follow the topology of the input space.
SOMs reveal correlation that are not easily identified

The Map will reveal the correlation between the input data and the output neurons. This means that similar input data will be mapped to nearby neurons in the output layer.
SOMs classify data without supervision

The Map will classify the input data without supervision. This means that the Map will not require any labeled data to classify the input data.
SOMs has no target feature (Y) to compare, hence no backpropagation

The Map will not require any target feature (Y) to compare. This means that the Map will not require any labeled data to classify the input data.
No lateral connections between output nodes

The only connection between the output nodes is through the BMU radius pulling the nodes instead of NN-type of connection such as activation function etc.

3.1.4 Building a SOM

3.1.4.1 SOM Data Preprocessing

# Step 1 - Data Import
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

# Step 2 - Feature Scaling
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
X = sc.fit_transform(X)

3.1.4.2 SOM Model Building

# Step 1 - Training the SOM
from minisom import MiniSom
som = MiniSom(x = 10, y = 10, input_len = 15, sigma = 1.0, learning_rate = 0.5)
som.random_weights_init(X)
som.train_random(data = X, num_iteration = 100)

# Step 2 - Visualizing the Results
from pylab import bone, pcolor, colorbar, plot, show
bone()
pcolor(som.distance_map().T)
colorbar()
markers = ['o', 's']
colors = ['r', 'g']
for i, x in enumerate(X):
    w = som.winner(x)
    plot(w[0] + 0.5,
         w[1] + 0.5,
         markers[y[i]],
         markeredgecolor = colors[y[i]],
         markerfacecolor = 'None',
         markersize = 10,
         markeredgewidth = 2)
show()

# Step 3 - Finding the Frauds
mappings = som.win_map(X)
print("map_8_1 shape:", map_8_1.shape)
print("map_6_9 shape:", map_6_9.shape)
frauds = np.concatenate((mappings[(8,1)], mappings[(6,9)]), axis = 0)
frauds = sc.inverse_transform(frauds)
print(frauds)

3.2 The Boltzman Machine

3.2.1 The Boltzman Machine Intuition

The difference between Boltzman machine with ANN, CNN, RNN, and SOM is that all of them have a direction of the flow of data input to nodes and then to output map. The Boltzman Machine is undirected or bidirectional model where the input and output are not separated.

A Boltzman machine has these characteristics:

Has input layers and hidden layers but there is no output layers
Every single node is connected to every other node
There is no directon in these connections, they are both bidirectional

If we look at it, we can notice that the input nodes are all connected which is unusual since the input data are given and connecting them to each other is not necessary. The reason is that the Boltzman machine is a generative model, meaning that it can generate new data points that are similar to the training data. The input nodes are connected to each other to allow the model to learn the relationships between the input features.

Jeffrey Hinston once gave an example of a nuclear power plant Nuclear Power Plant Illustration

In this example, we are measuring a lot of things such as:

The pressure of the pump
The temperature of the water
The temperature of the reactor

However, there are also things that we do not measure such as:

The wind speed
The soil moisture

And all of these parameters they work together to create a nuclear power plant. The Boltzman machine is trying to learn the relationship between these parameters and how they work together to create a nuclear power plant where the visible nodes are the one we can and do measure but the hidden nodes are the one we can’t and don’t measure.

Instead of waiting for us to input values into the visible nodes, the Boltzman machine is able to generate all of the values in all of the nodes on its own and it does not need any inputs. For instance, it generates state/condition where the wind is at certain speed, then another state where the soil is at certain humidity etc. That is what makes the Boltzman machine a probabilistic instead of deterministic model

3.2.2 The Boltzman Machine Learns

We input our thousands of rows of data to the visible nodes
The whole machine will adjust each weight for each input so that it resembles “our system”.
This system will learn all the possible connections and constraints between all of these parameters to know what is normal
After we have a Boltzman machine that is aware of what is normal, we can then input a new data point and the Boltzman machine will be able to tell us whether this data point is normal or abnormal
This is useful in a condition where in supervised learning we need to have data for both normal and abnormal. In this case, we only need to have data for normal and the Boltzman machine will be able to tell us whether the new data point is normal or not. e.g. data for nuclear meltdown which would be unlikely to be available.

3.2.3 The Bolzman Machine as an Energy-Based Models

The Boltzman machine is an energy-based model which is based from Boltzman Distribution. The energy of a configuration is a measure of how well the configuration fits the data. The lower the energy, the better the fit. The Boltzman machine learns to minimize the energy of the configuration by adjusting the weights of the connections between the nodes.

Assume that we are in a room with air molecules spread evenly in the room

The molecules in theory does not have any obligation to be spread evenly in the room, they could also be in a corner. There is nothing preventing that from hapenning. Statistically, they could end-up in the corner but the probability of that happening is very low. But if all the energy is concentrated in the corner, then the energy is very high.

The Boltzman Machine is designed so that once all the weights have been trained, the energy of the configuration is low when the configuration is similar to the training data and high when the configuration is dissimilar to the training data. The Boltzman machine learns to minimize the energy of the configuration by adjusting the weights of the connections between the nodes.

3.2.4 The Boltzmann Machine Variations

3.2.4.1 The Restricted Boltzman Machine (RBM) Intuition

In the stadard or full Boltzman machine, all the nodes are connected to each other. This means that the model is fully connected and can learn complex relationships between the input features. However, this also means that the model is computationally expensive and difficult to train. Therefore, a modified architecture is proposed under the name of Restriced Boltzman Machines (RBM) where the input and hidden layers are fully connected but among the input and hidden nodes are not connected.

3.2.4.1.1 The Restricted Boltzman Machine (RBM) Example

Assuming we are doing movie recommendation based on previous history

The user has watched the following movies and rate accordingly

For “Drama” hidden node, the watched movies are Forest Gump and Titanic and the user liked both. Hence, it lights green.

For “Action” hidden node, the watched movies are The Matrix and Pulp Finction and the user did not liked both. Hence, it lights red.

For “Dicaprio” hidden node, the watched movie is Titanic and the user liked it. Hence, it lights green.

For “Oscar” hidden node, the watched movies are Forest Gump and Titanic and the user liked both. Hence, it lights green.

For “Tarantino” node, the watched movie is Pulp Fiction and the user did not like it. Hence, it lights red.

Now based on all the results, the Boltzmann machine will try to construct our input. Assuming we input a new movie name “Fight Club” with category of Drama only. The Drama nodes is currently at red and we can assume that the user will not like Fight Club

Another example for the movie “The Departed” which are connected to Drama, Action, Dicaprio, and Oscar. The weight to Tarantino is low or insignificant. With 3 yes and 1 no, it can be assumed that the user might like it.

3.2.4.1.2 The Restrictive Boltzman Machine (RBM) Learns

We have discussed how the to supply input to Restrictive Boltzmann Machine, how it looks at them, and looks for feature and then assign certain nodes to improve our overall system. But we still don’t know how the RBM adjust its weights since we don’t have gradient descend as in supervised.

The RBM uses an algorithm called Contrastive Divergence to learn or adjust the weights. The steps are following:

We assume the following initial values
- Visible nodes:
  - \(V_1\) = 0.5
  - \(V_2\) = 0.3
  - \(V_3\) = 0.7
- Weight matrix \[ \begin{bmatrix} 0.1 & 0.2 & -0.1 & 0.3 \\ 0.3 & -0.2 & 0.4 & -0.1 \\ -0.3 & 0.1 & 0.5 & 0.2 \\ \end{bmatrix} \] Reads as follow:
  - Row 1 (Visible node 1) connects to:
    - Hidden node 1 with weight 0.1
    - Hidden node 2 with weight 0.2
    - Hidden node 3 with weight -0.1
    - Hidden node 4 with weight 0.3
  - Row 2 (Visible node 2) connects to:
    - Hidden node 1 with weight 0.3
    - Hidden node 2 with weight -0.2
    - Hidden node 3 with weight 0.4
    - Hidden node 4 with weight -0.1
  - Column 1 (Hidden node 1) receives connections from:
    - Visible node 1 with weight 0.1
    - Visible node 2 with weight 0.3
    - Visible node 3 with weight -0.3
  - Column 2 (Hidden node 2) receives connections from:
    - Visible node 1 with weight 0.2
    - Visible node 2 with weight -0.2
    - Visible node 3 with weight 0.1
- Hidden biases:
  - \(b_1\) = 0.1
  - \(b_2\) = -0.2
  - \(b_3\) = 0.3
  - \(b_4\) = -0.1
- Visible biases:
  - \(a_1\) = -0.1
  - \(a_2\) = 0.2
  - \(a_3\) = 0.0
Forward Pass (Visible -> Hidden)

For each hidden node, calculate input = weighted sum + bias, then apply sigmoid
- Hidden node 1:
  - Input = (0.5×0.1) + (0.3×0.3) + (0.7×-0.3) + 0.1
  - Input = 0.05 + 0.09 - 0.21 + 0.1 = 0.03
  - \(h_1\) = sigmoid(0.03) ≈ 0.5075
- Hidden node 2:
  - Input = (0.5×0.2) + (0.3×-0.2) + (0.7×0.1) + (-0.2)
  - Input = 0.1 - 0.06 + 0.07 - 0.2 = -0.09
  - \(h_2\) = sigmoid(-0.09) ≈ 0.4775
- Hidden node 3:
  - Input = (0.5×-0.1) + (0.3×0.4) + (0.7×0.5) + 0.3
  - Input = -0.05 + 0.12 + 0.35 + 0.3 = 0.72
  - \(h_3\) = sigmoid(0.72) ≈ 0.6729
- Hidden node 4 (new):
  - Input = (0.5×0.3) + (0.3×-0.1) + (0.7×0.2) + (-0.1)
  - Input = 0.15 - 0.03 + 0.14 - 0.1 = 0.16
  - \(h_4\) = sigmoid(0.16) ≈ 0.5399

Backward Pass (Hidden -> Visible)

Now using the hidden values to reconstruct visible nodes:
- Visible node 1:
  - Input = (0.5075×0.1) + (0.4775×0.2) + (0.6729×-0.1) + (0.5399×0.3) + (-0.1)
  - Input = 0.05075 + 0.0955 - 0.06729 + 0.16197 - 0.1 = 0.14093
  - \(V_1'\) = sigmoid(0.14093) ≈ 0.5352
- Visible node 2:
  - Input = (0.5075×0.3) + (0.4775×-0.2) + (0.6729×0.4) + (0.5399×-0.1) + 0.2
  - Input = 0.15225 - 0.0955 + 0.26916 - 0.05399 + 0.2 = 0.47192
  - \(V_2'\) = sigmoid(0.47192) ≈ 0.6159
- Visible node 3:
  - Input = (0.5075×-0.3) + (0.4775×0.1) + (0.6729×0.5) + (0.5399×0.2) + 0.0
  - Input = -0.15225 + 0.04775 + 0.33645 + 0.10798 = 0.33993
  - \(V_3'\) = sigmoid(0.33993) ≈ 0.5842

Second Forward Pass (Reconstructed Visible -> Hidden)

Using the reconstructed visible values for another forward pass:
- Hidden node 1:
  - Input = (0.5352×0.1) + (0.6159×0.3) + (0.5842×-0.3) + 0.1
  - Input = 0.05352 + 0.18477 - 0.17526 + 0.1 = 0.16303
  - \(h_1'\) = sigmoid(0.16303) ≈ 0.5407
- Hidden node 2:
  - Input = (0.5352×0.2) + (0.6159×-0.2) + (0.5842×0.1) + (-0.2)
  - Input = 0.10704 - 0.12318 + 0.05842 - 0.2 = -0.15772
  - \(h_2'\) = sigmoid(-0.15772) ≈ 0.4606
- Hidden node 3:
  - Input = (0.5352×-0.1) + (0.6159×0.4) + (0.5842×0.5) + 0.3
  - Input = -0.05352 + 0.24636 + 0.2921 + 0.3 = 0.78494
  - \(h_3'\) = sigmoid(0.78494) ≈ 0.6869
- Hidden node 4:
  - Input = (0.5352×0.3) + (0.6159×-0.1) + (0.5842×0.2) + (-0.1)
  - Input = 0.16056 - 0.06159 + 0.11684 - 0.1 = 0.11581
  - \(h_4'\) = sigmoid(0.11581) ≈ 0.5289

Now, Jeffrey Hinton said we don’t actually have to go through all the steps above. We can just take the first two passes and this is sufficient how to adjust your curve. The first two passes can be called CD1

Initially this is how the curve looks like this and we know that the ball is roling down

Instead of letting the ball roll, we want to move the curve instead.

The system works by moving the curve to reach the minimum energy.

3.2.4.2 The Deep Belief Network (DBN) Intuition

The Deep Belief Network (DBN) is a type of deep learning model that is composed of multiple layers of RBMs.

There are two algorithm to train the DBN:

The greedy layer-wise algorithm, where each layer is trained as an RBM and then the weights are used to initialize the next layer.
The wake-sleep algorithm, where the model is trained in The process goes from the bottom to the top, where the first RBM is trained on the input data, then the second RBM is trained on the output of the first RBM, and so on.

In DBN, the last two top layers are undirected, while everything below are directed.

3.2.4.3 The Deep Boltzmann Machine (DBM) Intuition

The Deep Boltzmann Machine (DBM) is a type of deep learning model that is composed of multiple layers of Boltzmann machines. The DBM and DBN are not the same, it is similar, but DBM is a generative model that can generate new data points that are similar to the training data. The DBM is trained using the wake-sleep algorithm, where the model is trained in two phases: the wake phase and the sleep phase.

In DBM, all of the layers are undirected

3.2.5 Building the Boltzman Machine

# Step 1 - Create the Architecture
class RBM():
    def __init__(self, nv, nh):
        self.W = torch.randn(nh, nv)
        self.a = torch.randn(1, nh)
        self.b = torch.randn(1, nv)
    def sample_h(self, x):
        wx = torch.mm(x, self.W.t())
        activation = wx + self.a.expand_as(wx)
        p_h_given_v = torch.sigmoid(activation)
        return p_h_given_v, torch.bernoulli(p_h_given_v)
    def sample_v(self, y):
        wy = torch.mm(y, self.W)
        activation = wy + self.b.expand_as(wy)
        p_v_given_h = torch.sigmoid(activation)
        return p_v_given_h, torch.bernoulli(p_v_given_h)
    def train(self, v0, vk, ph0, phk):
        self.W += torch.mm(v0.t(), ph0).t() - torch.mm(vk.t(), phk).t()
        self.b += torch.sum((v0 - vk), 0)
        self.a += torch.sum((ph0 - phk), 0)
nv = len(training_set[0])
nh = 100
batch_size = 100
rbm = RBM(nv, nh)

# Step 2 - Training the RBM
nb_epoch = 100
for epoch in range(1, nb_epoch + 1):
    train_loss = 0
    s = 0.
    for id_user in range(0, nb_users - batch_size, batch_size):
        vk = training_set[id_user:id_user+batch_size]
        v0 = training_set[id_user:id_user+batch_size]
        ph0,_ = rbm.sample_h(v0)
        for k in range(10):
            _,hk = rbm.sample_h(vk)
            _,vk = rbm.sample_v(hk)
            vk[v0<0] = v0[v0<0]
        phk,_ = rbm.sample_h(vk)
        rbm.train(v0, vk, ph0, phk)
        train_loss += torch.mean(torch.abs(v0[v0>=0] - vk[v0>=0]))
        s += 1.
    print('epoch: '+str(epoch)+' loss: '+str(train_loss/s))
    
# Step 3 - Testing the RBM
test_loss = 0
s = 0.
for id_user in range(nb_users):
    v = training_set[id_user:id_user+1]
    vt = test_set[id_user:id_user+1]
    if len(vt[vt>=0]) > 0:
        _,h = rbm.sample_h(v)
        _,v = rbm.sample_v(h)
        test_loss += torch.mean(torch.abs(vt[vt>=0] - v[vt>=0]))
        s += 1.
print('test loss: '+str(test_loss/s))

3.3 Auto Encoder

3.3.1 Auto Encoder Intuition

Auto Encoder is a type of neural network that is used to learn efficient representations of data, typically for the purpose of dimensionality reduction or feature learning. The network consists of two main parts: an encoder and a decoder. The encoder maps the input data to a lower-dimensional representation, while the decoder maps the lower-dimensional representation back to the original input space and aim for the output to be as close as possible to the input.

Auto Encoder is not a pure type of unsupvervised, it is a type of self-supervised deep learning. In both SOM and Boltzmann, we do not have an output to compare but Auto Encoder has an output which is the same as the input. The difference is that Auto Encoder is not trying to learn the topology of the input space, but rather trying to learn a compressed representation of the input data.

Auto Encoder is useful for:

Feature detection: once we encode the data, the hidden nodes will represent important features.
Recommender system: we can use the hidden nodes to represent the user and item features, and then use these features to make recommendations.
Encoding: we can use the hidden nodes to represent the data in a lower-dimensional space, which can be useful for visualization or for training other models.

Assume the following input and output

With weights of * +1 for blue and * -1 for black

Assuming we have the following input. The person only like Movie 1 and dislike all others. Multiply the input with the weights and put into hidden nodes.

Now we have to decode the hidden nodes to get the output. Multiple the hidden nodes with the weights and put into output nodes. But those are preliminary output, we need to apply the activation function (soft max) to get the final output.

After softmax, we can see that the output is the same as the output. It shows that we can reconstruct into fewer nodes of only two hidden nodes as long as we have the hidden nodes and weights.

We might also have a bias represented as either the following:

3.3.2 Auto Encoder Learns

The Auto Encoder learns by minimizing the difference between the input and output. This is done by adjusting the weights and biases of the network using a loss function, typically mean squared error (MSE) or binary cross-entropy. The loss function measures the difference between the input and output, and the network adjusts its weights and biases to minimize this difference.

AE representation of the inputs and outputs

We start with the input in format where lines correspond users, columns correspond to movies, and the values are the ratings.
The data is inserted user-by-user into the Auto Encoder
The input is encoded into a lower dimension with activation function
Z or hidden nodes are decoded into the output vector
Calculate the error
The error is backpropagated to the hidden nodes and then to the input nodes
Repeat 1 to 6, update after every observation or every batch of observations
Repeat more epochs

3.3.3 Overcomplete Hidden Layers

The Auto Encoder can have more hidden nodes than input nodes. This is called overcomplete. The Auto Encoder will learn to ignore the extra hidden nodes and only use the relevant ones to reconstruct the input. This can be useful for feature learning or extraction, as the Auto Encoder will learn to extract the most important features from the input data.

But the problem is, because the hidden nodes > input nodes, some hidden nodes will pass through the input nodes without being activated. The Auto Encoder will learn to memorize the input data instead of learning the underlying patterns and we will have unused hidden nodes. The following section will discuss how to overcome this problem.

3.3.4 Auto Encoders Variants

3.3.4.1 Sparse Auto Encoder Intuition

The Sparse Auto Encoder is a type of Auto Encoder that uses a sparsity constraint on the hidden nodes. This means that only a small number of hidden nodes are activated at any given time, while the rest are inactive. This forces the Auto Encoder to learn a more compact representation of the input data and helps to prevent overfitting. The turned-off hidden nodes output a low value close to 0.

3.3.4.2 Denoising Auto Encoder Intuition

Denoising Auto Encoder is a type of Auto Encoder that is trained to reconstruct the input data from a corrupted version of the input where we turn some of them to 0. This means that the Auto Encoder is trained to ignore the noise in the input data and learn the underlying patterns. In the end, the AE compares the output nodes with the original input nodes and not the corrupted input nodes.

3.3.4.3 Contractive Auto Encoder Intuition

The Contractive Auto Encoder is a type of Auto Encoder that is trained to learn a robust representation of the input data by adding a penalty term to the loss function during backpropagation. This penalty term encourages the Auto Encoder to learn a representation that is invariant to small changes in the input data. The penalty term is based on the Jacobian matrix of the encoder function, which measures how much the output of the encoder changes with respect to small changes in the input.

3.3.4.4 Stacked Auto Encoder Intuition

The Stacked Auto Encoder is a type of Auto Encoder that consists of multiple hidden layers stacked on top of each other. Each layer is trained as an Auto Encoder, and the output of one layer is used as the input to the next layer. This allows the Stacked Auto Encoder to learn a hierarchical representation of the input data, where each layer learns to extract more complex features from the input data.

3.3.4.5 Deep Auto Encoder Intuition

The Deep Auto Encoder is a type of Auto Encoder that consists of multiple hidden layers stacked on top of each other. Each layer is trained as an Auto Encoder, and the output of one layer is used as the input to the next layer. This allows the Deep Auto Encoder to learn a hierarchical representation of the input data, where each layer learns to extract more complex features from the input data. Stacked AE <> Deep AE however, it has relation with Deep Belief Network (DBN).

DBNs is used to pre-train Deep Auto Encoders:

Train a stack of RBMs layer by layer (forming a DBN)
Use the weights from this DBN to initialize both the encoder and decoder portions of a Deep Auto Encoder
Then “fine-tune” the entire Deep Auto Encoder using backpropagation

This pre-training approach allowed Deep Auto Encoders to learn much better representations than was previously possible with random initialization.