Stats 102B Discussion Section Week 5

Motivation & Problem Setup

  • The Data: We are working with a dataset defined as \(\{(x_i, y_i)\}_{i=1}^N\).

  • Features and Labels: The input features are vectors \(x_i \in \mathbb{R}^p\), and the labels indicate binary classes where \(y_i \in \{0,1\}\).

  • Our Goal: We need to train a model \(f(x; \theta)\) to accurately predict the probability \(\mathbb{P}(y_i = 1 | x_i)\).

  • Modeling Approaches: While logistic regression is a standard approach for this problem, the Multilayer Perceptron (MLP) provides a powerful alternative.

Visualizing a Single Layer MLP

graph LR
    subgraph Input Layer
    x1((x_1))
    x2((x_2))
    x3((x_3))
    end

    subgraph Hidden Layer
    h1((h_1))
    h2((h_2))
    h3((h_3))
    h4((h_4))
    end

    subgraph Output Layer
    y((y_hat))
    end

    x1 --> h1 & h2 & h3 & h4
    x2 --> h1 & h2 & h3 & h4
    x3 --> h1 & h2 & h3 & h4

    h1 & h2 & h3 & h4 --> y

Parts of MLP

  • Input Layer: Takes in the features \(x_1, x_2, x_3\).
  • Hidden Layer: Intermediate nodes \(h_1, h_2, h_3, h_4\) process the inputs.
  • Output Layer: Produces the final prediction \(\hat{y}\).

MLP Architecture Details

  • Input: \(x \in \mathbb{R}^p\) (treated as a column vector in our derivations).
  • Hidden Layer: Calculated as \(h = \sigma(W_1x + b_1)\).
    • Weights: \(W_1 \in \mathbb{R}^{q \times p}\).
    • Biases: \(b_1 \in \mathbb{R}^q\). Output dimension: \(h \in \mathbb{R}^q\).
  • Output Layer: Calculated as \(\hat{y} = \sigma(W_2h + b_2)\).
    • Weights: \(W_2 \in \mathbb{R}^{1 \times q}\).
    • Bias: \(b_2 \in \mathbb{R}\).
    • Output dimension: \(\hat{y} \in \mathbb{R}\).
  • Activation \(\sigma(\cdot)\): We typically use ReLU for the hidden layer and a sigmoid function for the output layer.

Objective Function: Logistic Regression to MLP

  • Recall Logistic Regression: The objective function we minimized was \(-l(\beta) = -\frac{1}{m}\sum_{i=1}^{m}[y_i(x_i^\top \beta) - \log(1 + \exp(x_i^\top \beta))]\).
  • MLP Binary Classification: We use the Binary Cross-Entropy (BCE) loss function, defined as \(\mathcal{L}_{BCE} = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})]\).
  • Variables: \(y \in \{0,1\}\) is the true label, and \(\hat{y} \in (0,1)\) is the predicted probability.

Intuition Behind BCE Loss

  • When \(y = 1\): We want our prediction \(\hat{y}\) to be close to 1, making the loss small because \(\log(\hat{y})\) approaches 0.
  • When \(y = 0\): We want our prediction \(\hat{y}\) to be close to 0, making the loss small because \(\log(1-\hat{y})\) approaches 0.
  • The Penalty: BCE heavily penalizes predictions that are highly confident but entirely incorrect (e.g., predicting \(\hat{y} = 0.01\) when the true label is \(y = 1\)).

The Optimization Problem

  • Full BCE Loss: \(\mathcal{L}_{BCE}(\hat{y}(\theta), y) = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})]\).
  • Objective Function: We want to minimize the average loss over our sample size \(m\). \[ \min_\theta \mathcal{L}_{BCE}(\theta) = \min_\theta -\frac{1}{m}\sum_{i=1}^{m}[y_i \log(\hat{y}(\theta)) + (1-y_i)\log(1-\hat{y}(\theta))] \]
  • Model Parameters: For a single layer MLP, we are optimizing \(\theta = (W_1, b_1, W_2, b_2)\)

Training an MLP

Training a neural network requires two main phases:
1. Forward Pass - Computes the predictions from the inputs.
- Data flows sequentially: input \(\rightarrow\) hidden layers \(\rightarrow\) output.
- The loss is calculated by comparing these predictions against the actual targets.
2. Backward Pass - Computes the gradients of the loss function with respect to the parameters \(\theta\) using the chain rule. - These gradients are then used to update the weights via an iterative optimization algorithm (like SGD or Adam).

Forward Pass

For a single observation, the computation flows as follows:

  • Hidden Layer Pre-activation: \(a = W_1x + b_1\).

  • Hidden Layer Activation: \(h = \sigma(a)\), where \(\sigma(\cdot)\) is the ReLU function \(\max(0, t)\).

  • Output Layer Pre-activation: \(z = W_2h + b_2\). Output Layer Activation: \(\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}\).

  • Loss Computation: \(\mathcal{L}_{BCE} = -(y \log\hat{y} + (1-y)\log(1-\hat{y}))\).

Backward Pass

To apply Gradient Descent (or variants like SGD/Adam), we must calculate the gradient of the BCE loss with respect to our model parameters: \(\nabla_\theta \mathcal{L}_{BCE}\).

Specifically, we need to find:

  • \(\nabla_{W_1} \mathcal{L}_{BCE}\)

  • \(\nabla_{b_1} \mathcal{L}_{BCE}\)

  • \(\nabla_{W_2} \mathcal{L}_{BCE}\)

  • \(\nabla_{b_2} \mathcal{L}_{BCE}\)

Because these gradients depend on intermediate parameters, we calculate them starting from the output layer backwards to the input layer.

Backward Pass: Output Activation

  • Derivative w.r.t Output: \(\frac{d\mathcal{L}}{d\hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\).
  • Derivative of Sigmoid: Since \(\hat{y} = \sigma(z) = \frac{1}{1+e^{-z}}\), algebra shows \(\frac{d\hat{y}}{dz} = \hat{y}(1-\hat{y})\).
  • Applying the Chain Rule: \[ \frac{d\mathcal{L}}{dz} = \frac{d\mathcal{L}}{d\hat{y}} \cdot \frac{d\hat{y}}{dz} \]

Backward Pass pt2

\[ \frac{d\mathcal{L}}{dz} = \left(-\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\right) \cdot \hat{y}(1-\hat{y}) \] - Simplified Result: \(\frac{d\mathcal{L}}{dz} = (\hat{y} - y)\)

Backward Pass: Hidden Layer

We continue propagating the error backward to the hidden layer \(h\). - Recall: \(z = W_2h + b_2\), which implies \(\frac{\partial z}{\partial h} = W_2^\top\).

  • Chain Rule:\[ \frac{\partial\mathcal{L}}{\partial h} = \frac{d\mathcal{L}}{dz} \cdot \frac{\partial z}{\partial h} \]

  • Result:\[ \frac{\partial\mathcal{L}}{\partial h} = (\hat{y}-y) \cdot W_2^\top \]

Backward Pass: Pre-Activation

Recall: \(h = \text{ReLU}(a)\) and \(a = W_1x + b_1\).

  • The derivative of the ReLU function is applied element-wise: \[ \frac{\partial h}{\partial a} = 1_{a>0} \] (A vector of ones where \(a_j > 0\) and zeros otherwise).

  • Chain Rule: The gradient of the loss with respect to \(a\) uses the element-wise (Hadamard) product \(\odot\): \[ \frac{\partial\mathcal{L}}{\partial a} = \frac{\partial\mathcal{L}}{\partial h} \odot 1_{\text{ReLU}}(a) \]

Backward Pass: Input Parameters

  • Recall: \(a = W_1x + b_1\).

  • We use the chain rule one final time to get the gradients for our input weights and biases:

  • For \(W_1\):\[ \frac{\partial\mathcal{L}}{\partial W_1} = \frac{\partial\mathcal{L}}{\partial a} \cdot \frac{\partial a}{\partial W_1} = \frac{\partial\mathcal{L}}{\partial a} \cdot x^\top \]

  • For \(b_1\):\[ \frac{\partial\mathcal{L}}{\partial b_1} = \frac{\partial\mathcal{L}}{\partial a} \cdot \frac{\partial a}{\partial b_1} = \frac{\partial\mathcal{L}}{\partial a} \]

  • These calculated gradients are then used to update the network parameters via gradient descent.

Recap: Forward Pass Computations

For a single observation \(x \in \mathbb{R}^{p \times 1}\) with hidden units \(h \in \mathbb{R}^{q \times 1}\) and scalar output \(\hat{y}\):

Step Computation Output Dimension
1. Pre-activation (Hidden) \(a = W_1 x + b_1\) \((q \times 1)\)
2. Activation (Hidden) \(h = \text{ReLU}(a)\) \((q \times 1)\)

Recap: FP2

Step Computation Output Dimension
3. Pre-activation (Output) \(z = W_2 h + b_2\) \((1 \times 1)\)
4. Activation (Output) \(\hat{y} = \sigma(z)\) \((1 \times 1)\)
5. BCE Loss \(\mathcal{L} = -y \log(\hat{y}) - (1-y)\log(1-\hat{y})\) \((1 \times 1)\)

Recap: Backward Pass (Output Layer)

Starting from the loss and moving backward to the output layer parameters:

Gradient Computation Dimension
Output Pre-activation \(\frac{d\mathcal{L}}{dz} = \hat{y} - y\) \((1 \times 1)\)
Output Weights \(\frac{\partial\mathcal{L}}{\partial W_2} = (\hat{y} - y)h^\top\) \((1 \times q)\)
Output Bias \(\frac{d\mathcal{L}}{\partial b_2} = \hat{y} - y\) \((1 \times 1)\)

Recap: Backward Pass (Hidden Layer)

Continuing backward to the hidden layer and input parameters:

Gradient Computation Dimension
Hidden Activation \(\frac{\partial\mathcal{L}}{\partial h} = W_2^\top(\hat{y} - y)\) \((q \times 1)\)
Hidden Pre-activation \(\frac{\partial\mathcal{L}}{\partial a} = \frac{\partial\mathcal{L}}{\partial h} \odot 1_{a>0}\) \((q \times 1)\)
Input Weights \(\frac{\partial\mathcal{L}}{\partial W_1} = \frac{\partial\mathcal{L}}{\partial a} \cdot x^\top\) \((q \times p)\)
Input Bias \(\frac{\partial\mathcal{L}}{\partial b_1} = \frac{\partial\mathcal{L}}{\partial a}\) \((q \times 1)\)

The Parameter Update Step

Once all gradients are calculated via the backward pass, we apply Stochastic Gradient Descent (SGD) to update our model parameters.

For a given step size (or learning rate) \(\eta\), the parameters at iteration \(k+1\) are updated as follows:

  • Output Layer Updates: \[W_2^{k+1} = W_2^k - \eta \cdot \frac{\partial\mathcal{L}}{\partial W_2}(W_2^k)\] \[b_2^{k+1} = b_2^k - \eta \cdot \frac{\partial\mathcal{L}}{\partial b_2}(b_2^k)\]

Parameter Update Step 2

  • Input Layer Updates: \[W_1^{k+1} = W_1^k - \eta \cdot \frac{\partial\mathcal{L}}{\partial W_1}(W_1^k)\] \[b_1^{k+1} = b_1^k - \eta \cdot \frac{\partial\mathcal{L}}{\partial b_1}(b_1^k)\]

Implementation Note: Matrix Operations in R

When programming this in R, calculating these gradients observation-by-observation using loops is highly inefficient.

For a mini-batch of size \(s\), we stack our inputs into a matrix \(X \in \mathbb{R}^{s \times p}\) and use matrix-vector operations.

Forward Pass (Matrix Form):

  • \(A = X W_1 + 1_s b_1^\top\)

  • \(H = \text{ReLU}(A)\)

  • \(Z = H W_2 + 1_s b_2\)

  • \(\hat{Y} = \sigma(Z)\)

(Where \(1_s\) is a column vector of ones of size \(s\))

Implementation Note: Matrix Backward Pass

To execute the backward pass over a mini-batch \(s\) without loops:

Error Terms:

  • \(\delta_2 = \hat{Y} - Y \quad \in \mathbb{R}^{s \times 1}\)

  • \(\delta_1 = (\delta_2 W_2^\top) \odot 1_{H>0} \quad \in \mathbb{R}^{s \times q}\)

Cont.

Gradient Averaging:

  • \(\nabla_{W_2}\mathcal{L} = \frac{1}{s} H^\top \delta_2\)

  • \(\nabla_{b_2}\mathcal{L} = \frac{1}{s} \sum_{i=1}^{s} \delta_2(i)\)

  • \(\nabla_{W_1}\mathcal{L} = \frac{1}{s} X^\top \delta_1\)

  • \(\nabla_{b_1}\mathcal{L} = \frac{1}{s} \sum_{i=1}^{s} \delta_1(i)\)

Terminology: What is an Epoch?

  • In machine learning—especially when training neural networks—an epoch is a standard unit of measure.

  • It refers to one complete pass through the entire training data set by the optimization algorithm.

  • For Stochastic Gradient Descent (SGD), one epoch equals one full pass through all the mini-batches in your training set.

Epochs: Gradient Descent vs. SGD

  • Standard Gradient Descent (GD):

    • 1 epoch = 1 gradient iteration.

    • Because it uses the whole dataset at once to make a single update.

  • Stochastic Gradient Descent (SGD):

    • The training data (size \(m\)) is divided into \(r = m/s\) mini-batches, where \(s\) is the mini-batch size.

    • 1 epoch = \(r\) mini-batch gradient updates.

    • Example: If you specify 100 epochs, SGD performs \(100r\) iterations; i.e., it updates the model parameters \(100r\) times.

Data Splits: Train, Val, and Test

When training our MLP, we must partition our data to ensure the model generalizes well to new, unseen data.

  • Training Set: Used to actually fit the model parameters (\(W\) and \(b\)). The size of this set is \(m\).

  • Validation Set: Used to tune hyperparameters (like our step size \(\eta\) and hidden layer dimension \(q\)) and to prevent overfitting.

  • Test Set: Used strictly at the end to evaluate the final model’s generalization performance.

  • A Typical Split: 70% Train, 15% Validation, 15% Test.