\[ \log \left( \frac{p(X)}{1-p(X)} \right) = \beta_0 + \beta_1 X \quad \text{logit function} \] Maximum Likelihood: we seek estimates for \(\beta_0\) and \(\beta_1\) so the predicted probability \(\hat p (x_i)\) corresponds as closely as possible to the pbserved response \(y_i\). The coefficients are chosen to maximize the likelihood function. \[ \ell (\beta_0, \beta_1) = \prod_{i:y_i = 1} p(x_i) \prod_{i':y_{i'} = 0} (1 - p(x_{i'}) ) \quad \text{likelihood function} \] The mathematical details of maximum likelihood can be found in Elements of Statistical Learning pg. 120 (if you dare). A Z-statistic computes associated the p-value for \(\hat\beta / \text{SE}(\hat\beta_1)\) and the hypothesis \(H_0 : \beta_1 = 0\). The intercept is typically not of interest; its main purpose is to adjust the average fitted probabilities to the proportion of ones in the data.
Confounding: a predictor’s coefficient changes when other predictors are included in the model. Looking at the Default dataset, let’s compute a model predicting the default rates for amount of credit balance and student status. As a single regressor, student status is positively correlated with defaulting. But accounting for balance in the model shows that student status is negatively correlated with defaulting. This is because students tend to have more debt, so given the same amount of debt and income, students are less likely to default than non-students. This is an important distinction for credit card companies determining who deserves more credit. A student is riskier if no balance information is available, but they are less risky than a non-student with the same credit card balance!
Generally multiclass logistic regression models are not used very often… Instead we use discriminant analysis!
Binary logistic regression models the conditional response \(Y\) given the predictors \(X\): \(\text{Pr}(Y = k | X = x)\) where \(k \in [0, 1]\). Discriminant analysis models the distribution of the predictos seperately in each of the response classes, then applies Bayes’ theorem to flip them into estimates of \(\text{Pr}(Y = k | X = x)\). When the distributions are normal, the model is very similar in form to logistic regression.
Suppose we want to classify an observation into one of \(K\) classes where \(K>2\). Let \(\pi_k\) represent the overall or prior probability that a given observation is associated with the \(k\)th class of the response variable \(Y\). Let $ f_k (x) (X = x | Y = k)$ denote the density function of \(X\) for an observation that comes from the \(k\)th class. In other words, \(f_k (x)\) is relatively large if there is a high probability that an observation in the \(k\)th class has \(X \approx x\), and \(f_k(x)\) is small if it is very unlikely that an observation in the \(k\)th class has \(X\approx x\). Then:
\[ \text{Pr}(Y = k | X = x) = \frac{\pi_k f_k (x)}{\sum_{l=1}^K \pi_l f_l(x)} \quad \text{Bayes' Theorem} \] We use the abbreviation \(p_k (X) = \text{Pr} (Y = k|X)\). Bayes’ theorem suggests instead of directly computing \(p_k (X)\), we simply plug in estimates for \(\pi_k\) and \(f_k (X)\). Estimating \(\pi_k\) is easy, just compute the proportion of \(Y=k\) in the population. Estimating \(f_k(X)\) is more challenging, unless we assume simple forms for the density. We refer to \(p_k(x)\) as the *posterior probability$ that an observation \(X = x\) belongs to the \(k\)th class. The Bayes classifier has the lowest possible error rate, that is, if \(f_k(X)\) is estimated accurately.
Assume only one predictor \(p = 1\). We want to obtain an estimate for \(f_k (x)\) that we can plug into Bayes’ theorem in order to estimate \(p_k(x)\). Then classify an observation for which \(p_k(x)\) is greatest. First we assume \(f_k(x)\) is normal or Gaussian with parameters \(N(\mu_k, \sigma_k)\) which are the mean and standard deviation parameters for the \(k\)th class. Further, assume \(\sigma_1^2 = \cdots = \sigma_K^2\) = ^2$ that the variance is equivalent across all classes. Plugging into Bayes’ theorem
\[ p_k(x) = \frac{\pi_k N(\mu_k, \sigma)}{\sum_{l=1}^K \pi_l N(\mu_l, \sigma)} \] The Bayes’ classifier assigns an observation \(X=x\) to the class for which \(p_k(x)\) is largest. In the simple case where \(K = 2\) and \(\pi_1 = \pi_2\), then the Bayes decision boundary corresponds to \(x = (\mu_1 + \mu_2) / 2\). In practice, even is we are quite certain of our assumption that \(X\) is drawn from a normal population in each class, we still have to estimate the parameters \(\mu_1, \ellipses, \mu_K, \pi_1, \ellipses, \pi_K\), and \(\sigma^2\). The linear discriminant analysis (LDA) method approximates the Bayes classifier esimtates these parameters. \[ \begin{eqnarray} \hat\mu_k &=& \frac{1}{n_k} \sum_{i : y_i = k} x_i \\ \hat\sigma^2 &=& \frac{1}{n - K} \sum_{k = 1}^K \sum_{i : y_i = k} (x_i - \hat\mu_k)^2 \end{eqnarray} \]
Where \(n\) is the total number of training observations, and \(n_k\) is the number of training observations in the \(k\)th class. The estimate \(\hat\mu_k\) is the average of all training observations in the \(k\)th class, while \(\hat\sigma^2\) is the weighted average of the sample variances for each of the \(K\) classes. In the absence of prior information, we estimate \(\pi_k\) with the proportion of training observation in the \(k\)th class \(\hat\pi_k = n_k / n\). The LDA classifier assigns the observation \(X = x\) to a class for which the discriminant functions \(\hat\delta_k(x)\) are the largest. These are linear functions of \(x\). \[ \hat\delta_k(x) = x \cdot \frac{\hat\mu_k}{\hat\sigma^2} - \frac{\hat\mu_k^2}{2\hat\sigma^2} + \log(\hat\pi_k) \] To reiterate, the LDA classifier results from assuming that the observation within each class come from a normal distribution with a class-specific mean vector and a common variance \(\sigma^2\), and plugging these estimated parameters into the Bayes classifier.
We extend the LDA classifier to multiple predictors. To do this, we assume \(X = (X_1, \ldots, X_p)\) are drawn from a multi-variate normal distribution, with a class specific mean vector and a common covariance matrix. - Multi-variate normal distribution: assumes each individual predictor follows a one-dimensional normal distribution, with some correlation between the predictors. A surface plot of two uncorrelated predictors takes the form of a bell-shape. Any correlation or unequal variance distorts the shape. To indicate a \(p\)-dimensional variable \(X\) has a multi-variate normal distribution, we write \(X \sim N(\mu, \Sigma)\). Here \(E(X) = \mu \in \Bbb R^p\) is the mean of \(X\) and \(\text{Cov}(X) = \Sigma \in \Bbb R^{p \times p}\) is the covariance matrix of \(X\). Plugging the parameters and density function into the Bayes classifier reveals the new discriminant function for each \(k\) class. \[ \hat\delta_k(x) = x^T \Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k + \log(\hat\pi_k) \] For each class, there is a Bayes decision boundary separating them. That is, one separates class 1 from class 2, another separates class 1 from class 3, and the last separates class 2 from class 3. - Null classifier: a classifier that always predicts 0, regardless of the response. - Confusion matrix: calculates the sensetivity (ability to predict true responses) and specificity (ability to predict negative responses). We can lower the threshold for classifying a true response by lowering the posterior probability cutoff (like from 50% to 20%). Lowering the posterior cutoff increases sensetivity but decreases specificity (higher true positives, higher false positives). Domain knowledge can help determine the cost of lowering or raising the threshold. - ROC curve: summarizes the overall performance of a classifier given by the area under the curve (AUC). An AUC of 0.5 means the classifier is no better than chance.
| Predicted Class | ||||
|---|---|---|---|---|
| - or Null | + or Positive | Total | ||
| True Class | - or Null | True Neg. (TN) | False Pos. (FP) | N |
| + or Positive | False Neg. (FN) | True Pos. (TP) | P | |
| Total | N* | P* | ||
| Name | Definition | Synonyms | ||
| —————— | ———— | ——————————————— | ||
| False Pos. Rate | FP / N | Type I Error - Specificity | ||
| True Pos. Rate | TP / P | Type II Error - Sensitivity, power, recall | ||
| Pos. Pred. Value | TP / P* | Precision - false discovery proportion | ||
| Neg. Pred. Value | TN / N* |
LDA assumes equal variance across all classes. QDA does not, i.e. each observation from the \(k\)th class comes from \(X \sim N(\mu_k, \Sigma_k)\). What does it matter? It’s all about the bias / variance trade-off. Estimating a covariance matrix requires estimating \(p(p + 1)/2\) parameters, so QDA estimates a total of \(Kp(p + 1)/2\) parameters.
library(ISLR)
library(tidyverse)
## ── Attaching packages ──────────────
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.3.0
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
attach(Smarket)
cor(Smarket[, -9]) # the only significant correlation is trading volumn increased over the years
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 0.029699649 0.030596422 0.033194581 0.035688718
## Lag1 0.02969965 1.000000000 -0.026294328 -0.010803402 -0.002985911
## Lag2 0.03059642 -0.026294328 1.000000000 -0.025896670 -0.010853533
## Lag3 0.03319458 -0.010803402 -0.025896670 1.000000000 -0.024051036
## Lag4 0.03568872 -0.002985911 -0.010853533 -0.024051036 1.000000000
## Lag5 0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
## Volume 0.53900647 0.040909908 -0.043383215 -0.041823686 -0.048414246
## Today 0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
## Lag5 Volume Today
## Year 0.029787995 0.53900647 0.030095229
## Lag1 -0.005674606 0.04090991 -0.026155045
## Lag2 -0.003557949 -0.04338321 -0.010250033
## Lag3 -0.018808338 -0.04182369 -0.002447647
## Lag4 -0.027083641 -0.04841425 -0.006899527
## Lag5 1.000000000 -0.02200231 -0.034860083
## Volume -0.022002315 1.00000000 0.014591823
## Today -0.034860083 0.01459182 1.000000000
glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = "binomial")
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = "binomial", data = Smarket)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.446 -1.203 1.065 1.145 1.326
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
summary(glm.fit)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000257 0.24073574 -0.5233966 0.6006983
## Lag1 -0.073073746 0.05016739 -1.4565986 0.1452272
## Lag2 -0.042301344 0.05008605 -0.8445733 0.3983491
## Lag3 0.011085108 0.04993854 0.2219750 0.8243333
## Lag4 0.009358938 0.04997413 0.1872757 0.8514445
## Lag5 0.010313068 0.04951146 0.2082966 0.8349974
## Volume 0.135440659 0.15835970 0.8552723 0.3924004
# using predict without a dataset just uses the training set that created it
glm.probs <- predict(glm.fit, type = "response")
glm.pred <- factor(glm.probs > 0.5, labels = c("Down", "Up"))
head(glm.probs); head(glm.pred)
## 1 2 3 4 5 6
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565
## 1 2 3 4 5 6
## Up Down Down Up Up Up
## Levels: Down Up
# produce a confusion matrix
table(glm.pred, Smarket$Direction)
##
## glm.pred Down Up
## Down 145 141
## Up 457 507
# we used the training data for testing, so this error rate is overly optimistic.
# let's hold out some of the data for testing
train <- (Year < 2005)
Smarket.2005 <- Smarket[!train, ] # year 2005 is the test set
Direction.2005 <- Direction[!train]
glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = "binomial", subset = train)
glm.probs <- predict(glm.fit, Smarket.2005, type = "response")
glm.pred <- factor(glm.probs > 0.5, labels = c("Down", "Up"))
table(glm.pred, Direction.2005)
## Direction.2005
## glm.pred Down Up
## Down 77 97
## Up 34 44
mean(glm.pred == Direction.2005)
## [1] 0.4801587
# test error rate of 52% is worse than guessing!
library(MASS) # lda() is part of the MASS library
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
lda.fit <- lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
lda.fit # outputs prior probabilities and group means
## Call:
## lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.491984 0.508016
##
## Group means:
## Lag1 Lag2
## Down 0.04279022 0.03389409
## Up -0.03954635 -0.03132544
##
## Coefficients of linear discriminants:
## LD1
## Lag1 -0.6420190
## Lag2 -0.5135293
plot(lda.fit)
lda.pred <- predict(lda.fit, Smarket.2005)
lda.class <- lda.pred$class
table(lda.class, Direction.2005)
## Direction.2005
## lda.class Down Up
## Down 35 35
## Up 76 106
# changing the posterior prob threshold
sum(lda.pred$posterior > .6)
## [1] 0
min(lda.pred$posterior) # maximum probability is 54%
## [1] 0.4577867
qda.fit <- qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
qda.fit
## Call:
## qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.491984 0.508016
##
## Group means:
## Lag1 Lag2
## Down 0.04279022 0.03389409
## Up -0.03954635 -0.03132544
qda.pred <- predict(qda.fit, Smarket.2005)
qda.class <- qda.pred$class
table(qda.class, Direction.2005)
## Direction.2005
## qda.class Down Up
## Down 30 20
## Up 81 121
mean(qda.class == Direction.2005) # 60% accuracy! !$$$$$!
## [1] 0.5992063
library(class)
# before we fitted the model, then used it to predict outcomes
# KNN has a 4 step approach
# 1. a matrix containing predictors associated with training data
# 2. a matrix containing predictors associated with test data
# 3. vector containing training class labels
# 4. a value for K, the number of nearest neighbors used by the classifier
train.X <- Smarket %>%
filter(train) %>%
dplyr::select(Lag1, Lag2)
test.X <- Smarket %>%
filter(!train) %>%
dplyr::select(Lag1, Lag2)
train.Direction <- Smarket[train, "Direction"]
set.seed(1)
knn.pred <- knn(train.X, test.X, train.Direction, k = 3)
table(knn.pred, Direction.2005)
## Direction.2005
## knn.pred Down Up
## Down 48 55
## Up 63 86
mean(knn.pred == Direction.2005)
## [1] 0.531746
# only 50% accuracte for k = 1, guessing!
# 53% accurate for k = 3. Not much better.
# QDA is the best model in this application
attach(Caravan)
dim(Caravan) # p = 86, n = 5822,
## [1] 5822 86
# response variable = Purchase. p_1 = 6%
# knn has no intuition of scale: distance between Japanese yen and minutes is different than dollars and years
# standardize the data. µ = 0, ∂ = 1
stand.X <- scale(Caravan[, -86])
var(Caravan[, 1])
## [1] 165.0378
var(stand.X[,1])
## [1] 1
# create training / test sets
test <- 1:1000
train.X <- stand.X[-test, ]
test.X <- stand.X[test, ]
train.Y <- Purchase[-test]
test.Y <- Purchase[test]
set.seed(1)
knn.pred <- knn(train.X, test.X, train.Y, k = 3)
mean(test.Y != knn.pred)
## [1] 0.074
# overall error rate is 12%. but we're interested in selling insurance to the people who actually want insurance
table(knn.pred, test.Y)
## test.Y
## knn.pred No Yes
## No 921 54
## Yes 20 5
# false positive rate (specificity) = 68 / (68 + 873) = 7%
sum(knn.pred == "Yes" & test.Y == "No") / sum(test.Y == "No")
## [1] 0.02125399
# true positive rate (sensitivity) = 9 / (50 + 9) = 15%
sum(knn.pred == "Yes" & test.Y == "Yes") / sum(test.Y == "Yes")
## [1] 0.08474576
Ref pg. 168 1. Prove Eq 4.2 = Eq 4.3 pg. 132 \[ \begin{eqnarray} p(X) &=& \frac{\exp(\beta_0 + \beta_1 X)}{1 + \exp(\beta_0 + \beta_1 X)} \\ \frac{p(X)}{1 - p(X)} &=& \frac{\exp(\beta_0 + \beta_1 X)}{1 + \exp(\beta_0 + \beta_1 X)} \cdot \frac{1 + \exp(\beta_0 + \beta_1 X)}{1 + \exp(\beta_0 + \beta_1 X) - \exp(\beta_0 + \beta_1 X)} \\ \frac{p(X)}{1 - p(X)} &=& \exp \beta_0 + \beta_1 X \end{eqnarray} \] 2. Prove eq 4.12 = 4.13 pg. 140. bah algebra…
Prove the Bayes’ classifier is non-linear for a QDA model with p = 1 and an observation in the kth class comes from \(X \sim (\mu_k, \sigma_k^2)\). pg 149 Plug eq 4.11 into eq 4.10 (like 4.12) and solve…
KNN doesn’t work well when p is large. Here’s the relative scale required to contain 10% of the training data for 1, 2, and 100 predictors: \[ \begin{eqnarray} x ^{(1, 2, 100)} = 0.1 \Rightarrow x = 0.1 ^{1 / (1, 2, 100)} = (10\%, 32\%, 97\%) \end{eqnarray} \]
$ _1 = 0.8, X = (10, 0), ^2 = 36. f_1(4) = 0.04032845, f_0(4) = 0.05324133 p(Y = 1 | X = 4) = 0.75 $
Don’t know. KNN may have a large test error.