Multinomial Logistic Regression

Sára Mód & Jasper Ginn
26/09/2018

Table of Contents

  • Multinomial logistic regression: when to use?
  • Introduction to practical
  • Bias-variance trade-off
  • Understanding v. predicting
  • A 'naive' multinomial model
  • Cross-validation
  • Regularization
  • Appendices

  • Presentation: http://rpubs.com/jhginn/mvsuu

Multinomial logistic regression: when to use? (1)

Multinomial logistic regression: when to use? (2)

Multinomial logistic regression: when to use? (3)

Multinomial logistic regression: when to use? (4)

Introduction to practical

Execute the following in a terminal:

docker pull jhginn/multivariate_statistics_uu

Then:

docker run -e PASSWORD=stats -p 8787:8787 jhginn/multivariate_statistics_uu

Go to http://localhost:8787

OR

Bias-Variance Trade-Off

\[ Total\:error = Bias + Variance + Var(\epsilon) \]

Understanding v. Prediction

A 'naive' model

Simple Cross-validation

Regularization

  • Log-likelihood function is what we try to optimize over successive iterations of the algorithm

    • Intuition: the outcomes must be likely given the data

    \[ \mathcal{L}(\hat{y}, y) = y \log(\hat{y}) + (1-y) \log(1-\hat{y}) \\ \]

  • We try to minimize the log-likelihood through successive iterations.

  • Much like we do with Newton-Rhapson

Intuition

  • How? Add a penalty term to likelihood so we don't overfit!
    • Aim:
      • Introduce more bias / restrict variance
      • Provide more generalizable results
  • By increasing the regularization parameter \( \lambda \), we 'increase' the minimum cost.

plot of chunk unnamed-chunk-1

Appendix: design matrix and responses

\[ y = \begin{bmatrix} y_0 \\ \vdots \\ y_m \end{bmatrix}, \: y_i \in \{1, 0\} \\ X = \begin{bmatrix} x_1 & x_2 & \dots & x_n \end{bmatrix}, \: \dim(X) = (m, n) \]

Appendix: Log-likelihood & L2-Norm

\[ w = \begin{bmatrix} w_1 \\ \vdots \\ w_n \end{bmatrix}, \: b \in \mathbb{R}, \: \hat{y} = \sigma(wX^{T} + b) \\ \mathcal{L}(\hat{y}, y) = y \log(\hat{y}) + (1-y) \log(1-\hat{y}) \\ \mathcal{J}(w, b) = -\frac{1}{m} \sum_{i=1}^m \mathcal{L}(\hat{y}, y) + \frac{\lambda}{2m} ||w||^2_2 \\ ||w||^2_2 = w^{T} \cdot w, \: \lambda \in \mathbb{R} \]

Appendix: pseudo-code

## Set parameters w and b to 0
w = matrix(0L, ncol=ncol(X))
b = 0
## Update parameters
for i in max_iterations:
  ## Linear combination & sigmoid function
  model = sigmoid(w %*% t(X) + b)
  ## Compute cost
  cost = -(1/m) * sum(y*log(yhat) + (1-y) * log(1-yhat)) + (lambda/2m * norm(w))
  ## Compute gradients
  dw = (1/m) * (t(X) %*% matrix(A-Y, ncol=1)) + t((lambda / m) * w)
  db = (1/m) * sum(A-Y)
  ## Update parameters
  w = w - t(learning_rate * dw)
  b = as.vector(b - t(learning_rate * db))

Appendix: Links

Appendix: Resources