The Data: We are working with a dataset defined as \(\{(x_i, y_i)\}_{i=1}^N\).
Features and Labels: The input features are vectors \(x_i \in \mathbb{R}^p\), and the labels indicate binary classes where \(y_i \in \{0,1\}\).
Our Goal: We need to train a model \(f(x; \theta)\) to accurately predict the probability \(\mathbb{P}(y_i = 1 | x_i)\).
Modeling Approaches: While logistic regression is a standard approach for this problem, the Multilayer Perceptron (MLP) provides a powerful alternative.
Output Layer: Calculated as \(\hat{y} = \sigma(W_2h + b_2)\).
Weights: \(W_2 \in \mathbb{R}^{1 \times q}\).
Bias: \(b_2 \in \mathbb{R}\).
Output dimension: \(\hat{y} \in \mathbb{R}\).
Activation \(\sigma(\cdot)\): We typically use ReLU for the hidden layer and a sigmoid function for the output layer.
Objective Function: Logistic Regression to MLP
Recall Logistic Regression: The objective function we minimized was \(-l(\beta) = -\frac{1}{m}\sum_{i=1}^{m}[y_i(x_i^\top \beta) - \log(1 + \exp(x_i^\top \beta))]\).
MLP Binary Classification: We use the Binary Cross-Entropy (BCE) loss function, defined as \(\mathcal{L}_{BCE} = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})]\).
Variables: \(y \in \{0,1\}\) is the true label, and \(\hat{y} \in (0,1)\) is the predicted probability.
Intuition Behind BCE Loss
When \(y = 1\): We want our prediction \(\hat{y}\) to be close to 1, making the loss small because \(\log(\hat{y})\) approaches 0.
When \(y = 0\): We want our prediction \(\hat{y}\) to be close to 0, making the loss small because \(\log(1-\hat{y})\) approaches 0.
The Penalty: BCE heavily penalizes predictions that are highly confident but entirely incorrect (e.g., predicting \(\hat{y} = 0.01\) when the true label is \(y = 1\)).
The Optimization Problem
Full BCE Loss: \(\mathcal{L}_{BCE}(\hat{y}(\theta), y) = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})]\).
Objective Function: We want to minimize the average loss over our sample size \(m\). \[ \min_\theta \mathcal{L}_{BCE}(\theta) = \min_\theta -\frac{1}{m}\sum_{i=1}^{m}[y_i \log(\hat{y}(\theta)) + (1-y_i)\log(1-\hat{y}(\theta))] \]
Model Parameters: For a single layer MLP, we are optimizing \(\theta = (W_1, b_1, W_2, b_2)\)
Training an MLP
Training a neural network requires two main phases:
1. Forward Pass - Computes the predictions from the inputs.
- Data flows sequentially: input \(\rightarrow\) hidden layers \(\rightarrow\) output.
- The loss is calculated by comparing these predictions against the actual targets.
2. Backward Pass - Computes the gradients of the loss function with respect to the parameters \(\theta\) using the chain rule. - These gradients are then used to update the weights via an iterative optimization algorithm (like SGD or Adam).
Forward Pass
For a single observation, the computation flows as follows:
Hidden Layer Pre-activation: \(a = W_1x + b_1\).
Hidden Layer Activation: \(h = \sigma(a)\), where \(\sigma(\cdot)\) is the ReLU function \(\max(0, t)\).
Loss Computation: \(\mathcal{L}_{BCE} = -(y \log\hat{y} + (1-y)\log(1-\hat{y}))\).
Backward Pass
To apply Gradient Descent (or variants like SGD/Adam), we must calculate the gradient of the BCE loss with respect to our model parameters: \(\nabla_\theta \mathcal{L}_{BCE}\).
Specifically, we need to find:
\(\nabla_{W_1} \mathcal{L}_{BCE}\)
\(\nabla_{b_1} \mathcal{L}_{BCE}\)
\(\nabla_{W_2} \mathcal{L}_{BCE}\)
\(\nabla_{b_2} \mathcal{L}_{BCE}\)
Because these gradients depend on intermediate parameters, we calculate them starting from the output layer backwards to the input layer.
We continue propagating the error backward to the hidden layer \(h\). - Recall: \(z = W_2h + b_2\), which implies \(\frac{\partial z}{\partial h} = W_2^\top\).
Recall: \(h = \text{ReLU}(a)\) and \(a = W_1x + b_1\).
The derivative of the ReLU function is applied element-wise: \[ \frac{\partial h}{\partial a} = 1_{a>0} \] (A vector of ones where \(a_j > 0\) and zeros otherwise).
Chain Rule: The gradient of the loss with respect to \(a\) uses the element-wise (Hadamard) product \(\odot\): \[ \frac{\partial\mathcal{L}}{\partial a} = \frac{\partial\mathcal{L}}{\partial h} \odot 1_{\text{ReLU}}(a) \]
Backward Pass: Input Parameters
Recall: \(a = W_1x + b_1\).
We use the chain rule one final time to get the gradients for our input weights and biases: