Stats 102B Discussion Section Week 5

Motivation & Problem Setup

The Data: We are working with a dataset defined as \(\{(x_i, y_i)\}_{i=1}^N\).
Features and Labels: The input features are vectors \(x_i \in \mathbb{R}^p\), and the labels indicate binary classes where \(y_i \in \{0,1\}\).
Our Goal: We need to train a model \(f(x; \theta)\) to accurately predict the probability \(\mathbb{P}(y_i = 1 | x_i)\).
Modeling Approaches: While logistic regression is a standard approach for this problem, the Multilayer Perceptron (MLP) provides a powerful alternative.

Visualizing a Single Layer MLP

graph LR
    subgraph Input Layer
    x1((x_1))
    x2((x_2))
    x3((x_3))
    end

    subgraph Hidden Layer
    h1((h_1))
    h2((h_2))
    h3((h_3))
    h4((h_4))
    end

    subgraph Output Layer
    y((y_hat))
    end

    x1 --> h1 & h2 & h3 & h4
    x2 --> h1 & h2 & h3 & h4
    x3 --> h1 & h2 & h3 & h4

    h1 & h2 & h3 & h4 --> y

Parts of MLP

Input Layer: Takes in the features \(x_1, x_2, x_3\).
Hidden Layer: Intermediate nodes \(h_1, h_2, h_3, h_4\) process the inputs.
Output Layer: Produces the final prediction \(\hat{y}\).

MLP Architecture Details

Input: \(x \in \mathbb{R}^p\) (treated as a column vector in our derivations).
Hidden Layer: Calculated as \(h = \sigma(W_1x + b_1)\).
- Weights: \(W_1 \in \mathbb{R}^{q \times p}\).
- Biases: \(b_1 \in \mathbb{R}^q\). Output dimension: \(h \in \mathbb{R}^q\).
Output Layer: Calculated as \(\hat{y} = \sigma(W_2h + b_2)\).
- Weights: \(W_2 \in \mathbb{R}^{1 \times q}\).
- Bias: \(b_2 \in \mathbb{R}\).
- Output dimension: \(\hat{y} \in \mathbb{R}\).
Activation \(\sigma(\cdot)\): We typically use ReLU for the hidden layer and a sigmoid function for the output layer.

Objective Function: Logistic Regression to MLP

Recall Logistic Regression: The objective function we minimized was \(-l(\beta) = -\frac{1}{m}\sum_{i=1}^{m}[y_i(x_i^\top \beta) - \log(1 + \exp(x_i^\top \beta))]\).
MLP Binary Classification: We use the Binary Cross-Entropy (BCE) loss function, defined as \(\mathcal{L}_{BCE} = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})]\).
Variables: \(y \in \{0,1\}\) is the true label, and \(\hat{y} \in (0,1)\) is the predicted probability.

Intuition Behind BCE Loss

When \(y = 1\): We want our prediction \(\hat{y}\) to be close to 1, making the loss small because \(\log(\hat{y})\) approaches 0.
When \(y = 0\): We want our prediction \(\hat{y}\) to be close to 0, making the loss small because \(\log(1-\hat{y})\) approaches 0.
The Penalty: BCE heavily penalizes predictions that are highly confident but entirely incorrect (e.g., predicting \(\hat{y} = 0.01\) when the true label is \(y = 1\)).

The Optimization Problem

Full BCE Loss: \(\mathcal{L}_{BCE}(\hat{y}(\theta), y) = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})]\).
Objective Function: We want to minimize the average loss over our sample size \(m\). \[ \min_\theta \mathcal{L}_{BCE}(\theta) = \min_\theta -\frac{1}{m}\sum_{i=1}^{m}[y_i \log(\hat{y}(\theta)) + (1-y_i)\log(1-\hat{y}(\theta))] \]
Model Parameters: For a single layer MLP, we are optimizing \(\theta = (W_1, b_1, W_2, b_2)\)

Training an MLP

Training a neural network requires two main phases:
1. Forward Pass - Computes the predictions from the inputs.
- Data flows sequentially: input \(\rightarrow\) hidden layers \(\rightarrow\) output.
- The loss is calculated by comparing these predictions against the actual targets.
2. Backward Pass - Computes the gradients of the loss function with respect to the parameters \(\theta\) using the chain rule. - These gradients are then used to update the weights via an iterative optimization algorithm (like SGD or Adam).

Forward Pass

For a single observation, the computation flows as follows:

Hidden Layer Pre-activation: \(a = W_1x + b_1\).
Hidden Layer Activation: \(h = \sigma(a)\), where \(\sigma(\cdot)\) is the ReLU function \(\max(0, t)\).
Output Layer Pre-activation: \(z = W_2h + b_2\). Output Layer Activation: \(\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}\).
Loss Computation: \(\mathcal{L}_{BCE} = -(y \log\hat{y} + (1-y)\log(1-\hat{y}))\).

Backward Pass

To apply Gradient Descent (or variants like SGD/Adam), we must calculate the gradient of the BCE loss with respect to our model parameters: \(\nabla_\theta \mathcal{L}_{BCE}\).

Specifically, we need to find:

\(\nabla_{W_1} \mathcal{L}_{BCE}\)
\(\nabla_{b_1} \mathcal{L}_{BCE}\)
\(\nabla_{W_2} \mathcal{L}_{BCE}\)
\(\nabla_{b_2} \mathcal{L}_{BCE}\)

Because these gradients depend on intermediate parameters, we calculate them starting from the output layer backwards to the input layer.

Backward Pass: Output Activation

Derivative w.r.t Output: \(\frac{d\mathcal{L}}{d\hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\).
Derivative of Sigmoid: Since \(\hat{y} = \sigma(z) = \frac{1}{1+e^{-z}}\), algebra shows \(\frac{d\hat{y}}{dz} = \hat{y}(1-\hat{y})\).
Applying the Chain Rule: \[ \frac{d\mathcal{L}}{dz} = \frac{d\mathcal{L}}{d\hat{y}} \cdot \frac{d\hat{y}}{dz} \]

Backward Pass pt2

\[ \frac{d\mathcal{L}}{dz} = \left(-\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\right) \cdot \hat{y}(1-\hat{y}) \] - Simplified Result: \(\frac{d\mathcal{L}}{dz} = (\hat{y} - y)\)

Backward Pass: Hidden Layer

We continue propagating the error backward to the hidden layer \(h\). - Recall: \(z = W_2h + b_2\), which implies \(\frac{\partial z}{\partial h} = W_2^\top\).

Chain Rule:\[ \frac{\partial\mathcal{L}}{\partial h} = \frac{d\mathcal{L}}{dz} \cdot \frac{\partial z}{\partial h} \]
Result:\[ \frac{\partial\mathcal{L}}{\partial h} = (\hat{y}-y) \cdot W_2^\top \]

Backward Pass: Pre-Activation

Recall: \(h = \text{ReLU}(a)\) and \(a = W_1x + b_1\).

The derivative of the ReLU function is applied element-wise: \[ \frac{\partial h}{\partial a} = 1_{a>0} \] (A vector of ones where \(a_j > 0\) and zeros otherwise).
Chain Rule: The gradient of the loss with respect to \(a\) uses the element-wise (Hadamard) product \(\odot\): \[ \frac{\partial\mathcal{L}}{\partial a} = \frac{\partial\mathcal{L}}{\partial h} \odot 1_{\text{ReLU}}(a) \]

Backward Pass: Input Parameters

Recall: \(a = W_1x + b_1\).
We use the chain rule one final time to get the gradients for our input weights and biases:
For \(W_1\):\[ \frac{\partial\mathcal{L}}{\partial W_1} = \frac{\partial\mathcal{L}}{\partial a} \cdot \frac{\partial a}{\partial W_1} = \frac{\partial\mathcal{L}}{\partial a} \cdot x^\top \]
For \(b_1\):\[ \frac{\partial\mathcal{L}}{\partial b_1} = \frac{\partial\mathcal{L}}{\partial a} \cdot \frac{\partial a}{\partial b_1} = \frac{\partial\mathcal{L}}{\partial a} \]
These calculated gradients are then used to update the network parameters via gradient descent.

Recap: Forward Pass Computations

For a single observation \(x \in \mathbb{R}^{p \times 1}\) with hidden units \(h \in \mathbb{R}^{q \times 1}\) and scalar output \(\hat{y}\):

Step	Computation	Output Dimension
1. Pre-activation (Hidden)	\(a = W_1 x + b_1\)	\((q \times 1)\)
2. Activation (Hidden)	\(h = \text{ReLU}(a)\)	\((q \times 1)\)

Recap: FP2

Step	Computation	Output Dimension
3. Pre-activation (Output)	\(z = W_2 h + b_2\)	\((1 \times 1)\)
4. Activation (Output)	\(\hat{y} = \sigma(z)\)	\((1 \times 1)\)
5. BCE Loss	\(\mathcal{L} = -y \log(\hat{y}) - (1-y)\log(1-\hat{y})\)	\((1 \times 1)\)

Recap: Backward Pass (Output Layer)

Starting from the loss and moving backward to the output layer parameters:

Gradient	Computation	Dimension
Output Pre-activation	\(\frac{d\mathcal{L}}{dz} = \hat{y} - y\)	\((1 \times 1)\)
Output Weights	\(\frac{\partial\mathcal{L}}{\partial W_2} = (\hat{y} - y)h^\top\)	\((1 \times q)\)
Output Bias	\(\frac{d\mathcal{L}}{\partial b_2} = \hat{y} - y\)	\((1 \times 1)\)

Recap: Backward Pass (Hidden Layer)

Continuing backward to the hidden layer and input parameters:

Gradient	Computation	Dimension
Hidden Activation	\(\frac{\partial\mathcal{L}}{\partial h} = W_2^\top(\hat{y} - y)\)	\((q \times 1)\)
Hidden Pre-activation	\(\frac{\partial\mathcal{L}}{\partial a} = \frac{\partial\mathcal{L}}{\partial h} \odot 1_{a>0}\)	\((q \times 1)\)
Input Weights	\(\frac{\partial\mathcal{L}}{\partial W_1} = \frac{\partial\mathcal{L}}{\partial a} \cdot x^\top\)	\((q \times p)\)
Input Bias	\(\frac{\partial\mathcal{L}}{\partial b_1} = \frac{\partial\mathcal{L}}{\partial a}\)	\((q \times 1)\)

The Parameter Update Step

Once all gradients are calculated via the backward pass, we apply Stochastic Gradient Descent (SGD) to update our model parameters.

For a given step size (or learning rate) \(\eta\), the parameters at iteration \(k+1\) are updated as follows:

Output Layer Updates: \[W_2^{k+1} = W_2^k - \eta \cdot \frac{\partial\mathcal{L}}{\partial W_2}(W_2^k)\] \[b_2^{k+1} = b_2^k - \eta \cdot \frac{\partial\mathcal{L}}{\partial b_2}(b_2^k)\]

Parameter Update Step 2

Input Layer Updates: \[W_1^{k+1} = W_1^k - \eta \cdot \frac{\partial\mathcal{L}}{\partial W_1}(W_1^k)\] \[b_1^{k+1} = b_1^k - \eta \cdot \frac{\partial\mathcal{L}}{\partial b_1}(b_1^k)\]

Implementation Note: Matrix Operations in R

When programming this in R, calculating these gradients observation-by-observation using loops is highly inefficient.

For a mini-batch of size \(s\), we stack our inputs into a matrix \(X \in \mathbb{R}^{s \times p}\) and use matrix-vector operations.

Forward Pass (Matrix Form):

\(A = X W_1 + 1_s b_1^\top\)
\(H = \text{ReLU}(A)\)
\(Z = H W_2 + 1_s b_2\)
\(\hat{Y} = \sigma(Z)\)

(Where \(1_s\) is a column vector of ones of size \(s\))

Implementation Note: Matrix Backward Pass

To execute the backward pass over a mini-batch \(s\) without loops:

Error Terms:

\(\delta_2 = \hat{Y} - Y \quad \in \mathbb{R}^{s \times 1}\)
\(\delta_1 = (\delta_2 W_2^\top) \odot 1_{H>0} \quad \in \mathbb{R}^{s \times q}\)

Cont.

Gradient Averaging:

\(\nabla_{W_2}\mathcal{L} = \frac{1}{s} H^\top \delta_2\)
\(\nabla_{b_2}\mathcal{L} = \frac{1}{s} \sum_{i=1}^{s} \delta_2(i)\)
\(\nabla_{W_1}\mathcal{L} = \frac{1}{s} X^\top \delta_1\)
\(\nabla_{b_1}\mathcal{L} = \frac{1}{s} \sum_{i=1}^{s} \delta_1(i)\)

Terminology: What is an Epoch?

In machine learning—especially when training neural networks—an epoch is a standard unit of measure.
It refers to one complete pass through the entire training data set by the optimization algorithm.
For Stochastic Gradient Descent (SGD), one epoch equals one full pass through all the mini-batches in your training set.

Epochs: Gradient Descent vs. SGD

Standard Gradient Descent (GD):
- 1 epoch = 1 gradient iteration.
- Because it uses the whole dataset at once to make a single update.
Stochastic Gradient Descent (SGD):
- The training data (size \(m\)) is divided into \(r = m/s\) mini-batches, where \(s\) is the mini-batch size.
- 1 epoch = \(r\) mini-batch gradient updates.
- Example: If you specify 100 epochs, SGD performs \(100r\) iterations; i.e., it updates the model parameters \(100r\) times.

Data Splits: Train, Val, and Test

When training our MLP, we must partition our data to ensure the model generalizes well to new, unseen data.

Training Set: Used to actually fit the model parameters (\(W\) and \(b\)). The size of this set is \(m\).
Validation Set: Used to tune hyperparameters (like our step size \(\eta\) and hidden layer dimension \(q\)) and to prevent overfitting.
Test Set: Used strictly at the end to evaluate the final model’s generalization performance.
A Typical Split: 70% Train, 15% Validation, 15% Test.