ISL Ch4 Classification

4.3 Logistic Regression

Estimating the Coefficients

\[ \log \left( \frac{p(X)}{1-p(X)} \right) = \beta_0 + \beta_1 X \quad \text{logit function} \] Maximum Likelihood: we seek estimates for $\beta_0$ and $\beta_1$ so the predicted probability $\hat p (x_i)$ corresponds as closely as possible to the pbserved response $y_i$. The coefficients are chosen to maximize the likelihood function. \[ \ell (\beta_0, \beta_1) = \prod_{i:y_i = 1} p(x_i) \prod_{i':y_{i'} = 0} (1 - p(x_{i'}) ) \quad \text{likelihood function} \] The mathematical details of maximum likelihood can be found in Elements of Statistical Learning pg. 120 (if you dare). A Z-statistic computes associated the p-value for $\hat\beta / \text{SE}(\hat\beta_1)$ and the hypothesis $H_0 : \beta_1 = 0$. The intercept is typically not of interest; its main purpose is to adjust the average fitted probabilities to the proportion of ones in the data.

Multiple Logistic Regression

Confounding: a predictor’s coefficient changes when other predictors are included in the model. Looking at the Default dataset, let’s compute a model predicting the default rates for amount of credit balance and student status. As a single regressor, student status is positively correlated with defaulting. But accounting for balance in the model shows that student status is negatively correlated with defaulting. This is because students tend to have more debt, so given the same amount of debt and income, students are less likely to default than non-students. This is an important distinction for credit card companies determining who deserves more credit. A student is riskier if no balance information is available, but they are less risky than a non-student with the same credit card balance!

Logistic Regression for >2 Response Classes

Generally multiclass logistic regression models are not used very often… Instead we use discriminant analysis!

Linear Discriminant Analysis

Binary logistic regression models the conditional response $Y$ given the predictors $X$: $\text{Pr}(Y = k | X = x)$ where $k \in [0, 1]$. Discriminant analysis models the distribution of the predictos seperately in each of the response classes, then applies Bayes’ theorem to flip them into estimates of $\text{Pr}(Y = k | X = x)$. When the distributions are normal, the model is very similar in form to logistic regression.

Using Bayes’ Theorem for Classification

Suppose we want to classify an observation into one of $K$ classes where $K>2$. Let $\pi_k$ represent the overall or prior probability that a given observation is associated with the $k$th class of the response variable $Y$. Let $ f_k (x) (X = x | Y = k)$ denote the density function of $X$ for an observation that comes from the $k$th class. In other words, $f_k (x)$ is relatively large if there is a high probability that an observation in the $k$th class has $X \approx x$, and $f_k(x)$ is small if it is very unlikely that an observation in the $k$th class has $X\approx x$. Then:

\[ \text{Pr}(Y = k | X = x) = \frac{\pi_k f_k (x)}{\sum_{l=1}^K \pi_l f_l(x)} \quad \text{Bayes' Theorem} \] We use the abbreviation $p_k (X) = \text{Pr} (Y = k|X)$. Bayes’ theorem suggests instead of directly computing $p_k (X)$, we simply plug in estimates for $\pi_k$ and $f_k (X)$. Estimating $\pi_k$ is easy, just compute the proportion of $Y=k$ in the population. Estimating $f_k(X)$ is more challenging, unless we assume simple forms for the density. We refer to $p_k(x)$ as the *posterior probability$ that an observation $X = x$ belongs to the $k$th class. The Bayes classifier has the lowest possible error rate, that is, if $f_k(X)$ is estimated accurately.

Linear Discriminant Analysis for One Predictor

Assume only one predictor $p = 1$. We want to obtain an estimate for $f_k (x)$ that we can plug into Bayes’ theorem in order to estimate $p_k(x)$. Then classify an observation for which $p_k(x)$ is greatest. First we assume $f_k(x)$ is normal or Gaussian with parameters $N(\mu_k, \sigma_k)$ which are the mean and standard deviation parameters for the $k$th class. Further, assume $\sigma_1^2 = \cdots = \sigma_K^2$ = ^2$ that the variance is equivalent across all classes. Plugging into Bayes’ theorem

\[ p_k(x) = \frac{\pi_k N(\mu_k, \sigma)}{\sum_{l=1}^K \pi_l N(\mu_l, \sigma)} \] The Bayes’ classifier assigns an observation $X=x$ to the class for which $p_k(x)$ is largest. In the simple case where $K = 2$ and $\pi_1 = \pi_2$, then the Bayes decision boundary corresponds to $x = (\mu_1 + \mu_2) / 2$. In practice, even is we are quite certain of our assumption that $X$ is drawn from a normal population in each class, we still have to estimate the parameters $\mu_1, \ellipses, \mu_K, \pi_1, \ellipses, \pi_K$, and $\sigma^2$. The linear discriminant analysis (LDA) method approximates the Bayes classifier esimtates these parameters. \[ \begin{eqnarray} \hat\mu_k &=& \frac{1}{n_k} \sum_{i : y_i = k} x_i \\ \hat\sigma^2 &=& \frac{1}{n - K} \sum_{k = 1}^K \sum_{i : y_i = k} (x_i - \hat\mu_k)^2 \end{eqnarray} \]

Where $n$ is the total number of training observations, and $n_k$ is the number of training observations in the $k$th class. The estimate $\hat\mu_k$ is the average of all training observations in the $k$th class, while $\hat\sigma^2$ is the weighted average of the sample variances for each of the $K$ classes. In the absence of prior information, we estimate $\pi_k$ with the proportion of training observation in the $k$th class $\hat\pi_k = n_k / n$. The LDA classifier assigns the observation $X = x$ to a class for which the discriminant functions $\hat\delta_k(x)$ are the largest. These are linear functions of $x$. \[ \hat\delta_k(x) = x \cdot \frac{\hat\mu_k}{\hat\sigma^2} - \frac{\hat\mu_k^2}{2\hat\sigma^2} + \log(\hat\pi_k) \] To reiterate, the LDA classifier results from assuming that the observation within each class come from a normal distribution with a class-specific mean vector and a common variance $\sigma^2$, and plugging these estimated parameters into the Bayes classifier.

LDA for Multiple Predictors

We extend the LDA classifier to multiple predictors. To do this, we assume $X = (X_1, \ldots, X_p)$ are drawn from a multi-variate normal distribution, with a class specific mean vector and a common covariance matrix. - Multi-variate normal distribution: assumes each individual predictor follows a one-dimensional normal distribution, with some correlation between the predictors. A surface plot of two uncorrelated predictors takes the form of a bell-shape. Any correlation or unequal variance distorts the shape. To indicate a $p$-dimensional variable $X$ has a multi-variate normal distribution, we write $X \sim N(\mu, \Sigma)$. Here $E(X) = \mu \in \Bbb R^p$ is the mean of $X$ and $\text{Cov}(X) = \Sigma \in \Bbb R^{p \times p}$ is the covariance matrix of $X$. Plugging the parameters and density function into the Bayes classifier reveals the new discriminant function for each $k$ class. \[ \hat\delta_k(x) = x^T \Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k + \log(\hat\pi_k) \] For each class, there is a Bayes decision boundary separating them. That is, one separates class 1 from class 2, another separates class 1 from class 3, and the last separates class 2 from class 3. - Null classifier: a classifier that always predicts 0, regardless of the response. - Confusion matrix: calculates the sensetivity (ability to predict true responses) and specificity (ability to predict negative responses). We can lower the threshold for classifying a true response by lowering the posterior probability cutoff (like from 50% to 20%). Lowering the posterior cutoff increases sensetivity but decreases specificity (higher true positives, higher false positives). Domain knowledge can help determine the cost of lowering or raising the threshold. - ROC curve: summarizes the overall performance of a classifier given by the area under the curve (AUC). An AUC of 0.5 means the classifier is no better than chance.

		Predicted Class
		- or Null	+ or Positive	Total
True Class	- or Null	True Neg. (TN)	False Pos. (FP)	N
	+ or Positive	False Neg. (FN)	True Pos. (TP)	P
	Total	N*	P*
Name	Definition	Synonyms
——————	————	———————————————
False Pos. Rate	FP / N	Type I Error - Specificity
True Pos. Rate	TP / P	Type II Error - Sensitivity, power, recall
Pos. Pred. Value	TP / P*	Precision - false discovery proportion
Neg. Pred. Value	TN / N*

Quadratic Discriminant Analysis

LDA assumes equal variance across all classes. QDA does not, i.e. each observation from the $k$th class comes from $X \sim N(\mu_k, \Sigma_k)$. What does it matter? It’s all about the bias / variance trade-off. Estimating a covariance matrix requires estimating $p(p + 1)/2$ parameters, so QDA estimates a total of $Kp(p + 1)/2$ parameters.

Lab: Logistic Regression, LDA, QDA, and KNN

library(ISLR)
library(tidyverse)

## ── Attaching packages ──────────────

## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.3.0
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

attach(Smarket)
cor(Smarket[, -9]) # the only significant correlation is trading volumn increased over the years

##              Year         Lag1         Lag2         Lag3         Lag4
## Year   1.00000000  0.029699649  0.030596422  0.033194581  0.035688718
## Lag1   0.02969965  1.000000000 -0.026294328 -0.010803402 -0.002985911
## Lag2   0.03059642 -0.026294328  1.000000000 -0.025896670 -0.010853533
## Lag3   0.03319458 -0.010803402 -0.025896670  1.000000000 -0.024051036
## Lag4   0.03568872 -0.002985911 -0.010853533 -0.024051036  1.000000000
## Lag5   0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
## Volume 0.53900647  0.040909908 -0.043383215 -0.041823686 -0.048414246
## Today  0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
##                Lag5      Volume        Today
## Year    0.029787995  0.53900647  0.030095229
## Lag1   -0.005674606  0.04090991 -0.026155045
## Lag2   -0.003557949 -0.04338321 -0.010250033
## Lag3   -0.018808338 -0.04182369 -0.002447647
## Lag4   -0.027083641 -0.04841425 -0.006899527
## Lag5    1.000000000 -0.02200231 -0.034860083
## Volume -0.022002315  1.00000000  0.014591823
## Today  -0.034860083  0.01459182  1.000000000

glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = "binomial")
summary(glm.fit)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = "binomial", data = Smarket)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.446  -1.203   1.065   1.145   1.326  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

summary(glm.fit)$coef

##                 Estimate Std. Error    z value  Pr(>|z|)
## (Intercept) -0.126000257 0.24073574 -0.5233966 0.6006983
## Lag1        -0.073073746 0.05016739 -1.4565986 0.1452272
## Lag2        -0.042301344 0.05008605 -0.8445733 0.3983491
## Lag3         0.011085108 0.04993854  0.2219750 0.8243333
## Lag4         0.009358938 0.04997413  0.1872757 0.8514445
## Lag5         0.010313068 0.04951146  0.2082966 0.8349974
## Volume       0.135440659 0.15835970  0.8552723 0.3924004

# using predict without a dataset just uses the training set that created it
glm.probs <- predict(glm.fit, type = "response")
glm.pred <- factor(glm.probs > 0.5, labels = c("Down", "Up"))
head(glm.probs); head(glm.pred)

##         1         2         3         4         5         6 
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565

##    1    2    3    4    5    6 
##   Up Down Down   Up   Up   Up 
## Levels: Down Up

# produce a confusion matrix
table(glm.pred, Smarket$Direction)

##         
## glm.pred Down  Up
##     Down  145 141
##     Up    457 507

# we used the training data for testing, so this error rate is overly optimistic.
# let's hold out some of the data for testing
train <- (Year < 2005)
Smarket.2005 <- Smarket[!train, ] # year 2005 is the test set
Direction.2005 <- Direction[!train]

glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = "binomial", subset = train)
glm.probs <- predict(glm.fit, Smarket.2005, type = "response")

glm.pred <- factor(glm.probs > 0.5, labels = c("Down", "Up"))
table(glm.pred, Direction.2005)

##         Direction.2005
## glm.pred Down Up
##     Down   77 97
##     Up     34 44

mean(glm.pred == Direction.2005)

## [1] 0.4801587

# test error rate of 52% is worse than guessing!

library(MASS) # lda() is part of the MASS library

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

lda.fit <- lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
lda.fit # outputs prior probabilities and group means

## Call:
## lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
## 
## Prior probabilities of groups:
##     Down       Up 
## 0.491984 0.508016 
## 
## Group means:
##             Lag1        Lag2
## Down  0.04279022  0.03389409
## Up   -0.03954635 -0.03132544
## 
## Coefficients of linear discriminants:
##             LD1
## Lag1 -0.6420190
## Lag2 -0.5135293

plot(lda.fit)

lda.pred <- predict(lda.fit, Smarket.2005)
lda.class <- lda.pred$class
table(lda.class, Direction.2005)

##          Direction.2005
## lda.class Down  Up
##      Down   35  35
##      Up     76 106

# changing the posterior prob threshold
sum(lda.pred$posterior > .6)

## [1] 0

min(lda.pred$posterior) # maximum probability is 54%

## [1] 0.4577867

qda.fit <- qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
qda.fit

## Call:
## qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
## 
## Prior probabilities of groups:
##     Down       Up 
## 0.491984 0.508016 
## 
## Group means:
##             Lag1        Lag2
## Down  0.04279022  0.03389409
## Up   -0.03954635 -0.03132544

qda.pred <- predict(qda.fit, Smarket.2005)
qda.class <- qda.pred$class
table(qda.class, Direction.2005)

##          Direction.2005
## qda.class Down  Up
##      Down   30  20
##      Up     81 121

mean(qda.class == Direction.2005) # 60% accuracy! !$$$$$!

## [1] 0.5992063

library(class)
# before we fitted the model, then used it to predict outcomes
# KNN has a 4 step approach
# 1. a matrix containing predictors associated with training data
# 2. a matrix containing predictors associated with test data
# 3. vector containing training class labels
# 4. a value for K, the number of nearest neighbors used by the classifier
train.X <- Smarket %>%
  filter(train) %>%
  dplyr::select(Lag1, Lag2)
test.X <- Smarket %>%
  filter(!train) %>%
  dplyr::select(Lag1, Lag2)
train.Direction <- Smarket[train, "Direction"]

set.seed(1)
knn.pred <- knn(train.X, test.X, train.Direction, k = 3)
table(knn.pred, Direction.2005)

##         Direction.2005
## knn.pred Down Up
##     Down   48 55
##     Up     63 86

mean(knn.pred == Direction.2005)

## [1] 0.531746

# only 50% accuracte for k = 1, guessing!
# 53% accurate for k = 3. Not much better. 
# QDA is the best model in this application

attach(Caravan)
dim(Caravan) # p = 86, n = 5822,

## [1] 5822   86

# response variable = Purchase. p_1 = 6%

# knn has no intuition of scale: distance between Japanese yen and minutes is different than dollars and years
# standardize the data. µ = 0, ∂ = 1
stand.X <- scale(Caravan[, -86])
var(Caravan[, 1])

## [1] 165.0378

var(stand.X[,1])

## [1] 1

# create training / test sets
test <- 1:1000
train.X <- stand.X[-test, ]
test.X <- stand.X[test, ]
train.Y <- Purchase[-test]
test.Y <- Purchase[test]

set.seed(1)
knn.pred <- knn(train.X, test.X, train.Y, k = 3)
mean(test.Y != knn.pred)

## [1] 0.074

# overall error rate is 12%. but we're interested in selling insurance to the people who actually want insurance
table(knn.pred, test.Y)

##         test.Y
## knn.pred  No Yes
##      No  921  54
##      Yes  20   5

# false positive rate (specificity) = 68 / (68 + 873) = 7%
sum(knn.pred == "Yes" & test.Y == "No") / sum(test.Y == "No")

## [1] 0.02125399

# true positive rate (sensitivity) = 9 / (50 + 9) = 15%
sum(knn.pred == "Yes" & test.Y == "Yes") / sum(test.Y == "Yes")

## [1] 0.08474576

Ch 4 Exercises

Ref pg. 168 1. Prove Eq 4.2 = Eq 4.3 pg. 132 \[ \begin{eqnarray} p(X) &=& \frac{\exp(\beta_0 + \beta_1 X)}{1 + \exp(\beta_0 + \beta_1 X)} \\ \frac{p(X)}{1 - p(X)} &=& \frac{\exp(\beta_0 + \beta_1 X)}{1 + \exp(\beta_0 + \beta_1 X)} \cdot \frac{1 + \exp(\beta_0 + \beta_1 X)}{1 + \exp(\beta_0 + \beta_1 X) - \exp(\beta_0 + \beta_1 X)} \\ \frac{p(X)}{1 - p(X)} &=& \exp \beta_0 + \beta_1 X \end{eqnarray} \] 2. Prove eq 4.12 = 4.13 pg. 140. bah algebra…

Prove the Bayes’ classifier is non-linear for a QDA model with p = 1 and an observation in the kth class comes from $X \sim (\mu_k, \sigma_k^2)$. pg 149 Plug eq 4.11 into eq 4.10 (like 4.12) and solve…
KNN doesn’t work well when p is large. Here’s the relative scale required to contain 10% of the training data for 1, 2, and 100 predictors: \[ \begin{eqnarray} x ^{(1, 2, 100)} = 0.1 \Rightarrow x = 0.1 ^{1 / (1, 2, 100)} = (10\%, 32\%, 97\%) \end{eqnarray} \]
Differences between LDA and QDA

LDA performs better on linear training and test data.
QDA performs better on non-linear training and test data.
QDA performs bettern on large training sets because the cost of variance is lower
Given a non-linear Bayesian decision boundary, QDA may not necessarily perform better on test data because of variance (overfitting the training data)

Given $\hat\beta_0 = -6, \hat\beta_1 = 0.05, \hat\beta_2 = 1$.

$X_1 = 40, X_2 = 3.5 \Rightarrow p(X) = 0.38$.
He needs to study 50 hours to have a 50% chance of passing.

$ _1 = 0.8, X = (10, 0), ^2 = 36. f_1(4) = 0.04032845, f_0(4) = 0.05324133 p(Y = 1 | X = 4) = 0.75 $
Don’t know. KNN may have a large test error.
Odds.

What fraction of people have odds 0.37.