Example: To illustrate, suppose a quiz has 10 multiple-choice questions, with 5 possible answers for each. A student randomly guesses the answer for each question. Let Y denote the number of correct responses. The probability of a correct response is 0.20 for a given question, so n = 10 and \(\pi = 0.2\). The probability of y = 0 correct responses, equals \(P(y)=\frac{n!}{y!(n-y)!}\pi^y(1-\pi)^{n-y}\)
n <- 10
p <- 0.2
y <- 0
prob_3_heads <- dbinom(y, size = n, prob = p)
cat("P(Y = 0):", prob_3_heads, "\n")
## P(Y = 0): 0.1073742
are of fundamental importance. The most popular model for binary data is logistic regression.
For a probability of success \(\pi\), the odds of success are defined to be \(\frac{\pi}{1-\pi}\) the odds ratio are defined to be \(\theta=\frac{\pi_1/(1-\pi_1)}{\pi_2/(1-\pi_2)}\). \(\theta\gt1\) subject 1 are more likely to have successes than are subjet 2.
\(\log(\frac{\pi(x)}{1-\pi(x)}) = \beta_0+\beta_1x\)
pp <- c("True postive(TP)","False positive(FN)","Sensitivity(TPR) = TP/P")
pn <- c("False postive(FP)","True negative(TN)","Specifcity(TNR) = TN/N")
cm <- data.frame(cbind("Predicted Positvie(PP)"= pp,"Predicted Negative(PN)"=pn))
rownames(cm) <- c("Actual Positive(P)", "Actual Negative(N)","Metric")
kable(cm)
Predicted.Positvie.PP. | Predicted.Negative.PN. | |
---|---|---|
Actual Positive(P) | True postive(TP) | False postive(FP) |
Actual Negative(N) | False positive(FN) | True negative(TN) |
Metric | Sensitivity(TPR) = TP/P | Specifcity(TNR) = TN/N |
library(ISLR2)
data(Smarket)
str(Smarket)
## 'data.frame': 1250 obs. of 9 variables:
## $ Year : num 2001 2001 2001 2001 2001 ...
## $ Lag1 : num 0.381 0.959 1.032 -0.623 0.614 ...
## $ Lag2 : num -0.192 0.381 0.959 1.032 -0.623 ...
## $ Lag3 : num -2.624 -0.192 0.381 0.959 1.032 ...
## $ Lag4 : num -1.055 -2.624 -0.192 0.381 0.959 ...
## $ Lag5 : num 5.01 -1.055 -2.624 -0.192 0.381 ...
## $ Volume : num 1.19 1.3 1.41 1.28 1.21 ...
## $ Today : num 0.959 1.032 -0.623 0.614 0.213 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...
glm.fits <- glm(
Direction ~ .-Year-Today,
data = Smarket , family = binomial
)
summary(glm.fits)
##
## Call:
## glm(formula = Direction ~ . - Year - Today, family = binomial,
## data = Smarket)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
The smallest p-value here is associated with Lag1. The negative coefficient for this predictor suggests that if the market had a positive return yesterday, then it is less likely to go up today. However, at a value of 0.15, the p-value is still relatively large, and so there is no clear evidence of a real association between Lag1 and Direction.
The predict() function can be used to predict the probability that the market will go up, given values of the predictors. The type = “response” option tells R to output probabilities
glm.probs <- predict(glm.fits , type = "response")
glm.probs[1:10]
## 1 2 3 4 5 6 7 8
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509 0.5092292
## 9 10
## 0.5176135 0.4888378
In order to make a prediction as to whether the market will go up or down on a particular day, we must convert these predicted probabilities into class labels, Up or Down.
glm.pred <- rep("Down", 1250)
glm.pred[glm.probs > .5] = "Up"
Let’s take a look at the confusion matrix.
table(glm.pred , Smarket$Direction)
##
## glm.pred Down Up
## Down 145 141
## Up 457 507
We correctly classified 145+507/1250=0.5216
Not good practice, since we traing and testing on same data.
A better way is split the data for training and testing. We will first create a vector corresponding to the observations from 2001 through 2004. We will then use this vector to create a held out data set of observations from 2005.
train <- (Smarket$Year < 2005)
Smarket.2005 <- Smarket[!train,]
We now fit a logistic regression model using only the subset of the observations that correspond to dates before 2005.
glm.fits <- glm(Direction ~.-Year-Today,data = Smarket , family = binomial , subset = train)
summary(glm.fits)
##
## Call:
## glm(formula = Direction ~ . - Year - Today, family = binomial,
## data = Smarket, subset = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.191213 0.333690 0.573 0.567
## Lag1 -0.054178 0.051785 -1.046 0.295
## Lag2 -0.045805 0.051797 -0.884 0.377
## Lag3 0.007200 0.051644 0.139 0.889
## Lag4 0.006441 0.051706 0.125 0.901
## Lag5 -0.004223 0.051138 -0.083 0.934
## Volume -0.116257 0.239618 -0.485 0.628
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1383.3 on 997 degrees of freedom
## Residual deviance: 1381.1 on 991 degrees of freedom
## AIC: 1395.1
##
## Number of Fisher Scoring iterations: 3
glm.probs <- predict(glm.fits , Smarket.2005,type = "response")
glm.pred <- rep("Down", 252)
glm.pred[glm.probs > .5] <- "Up"
Direction.2005 <-Smarket$Direction[!train]
table(glm.pred , Direction.2005)
## Direction.2005
## glm.pred Down Up
## Down 77 97
## Up 34 44
mean(glm.pred == Direction.2005)
## [1] 0.4801587
step(glm.fits)
## Start: AIC=1395.11
## Direction ~ (Year + Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume +
## Today) - Year - Today
##
## Df Deviance AIC
## - Lag5 1 1381.1 1393.1
## - Lag4 1 1381.1 1393.1
## - Lag3 1 1381.1 1393.1
## - Volume 1 1381.3 1393.3
## - Lag2 1 1381.9 1393.9
## - Lag1 1 1382.2 1394.2
## <none> 1381.1 1395.1
##
## Step: AIC=1393.11
## Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Volume
##
## Df Deviance AIC
## - Lag4 1 1381.1 1391.1
## - Lag3 1 1381.1 1391.1
## - Volume 1 1381.3 1391.3
## - Lag2 1 1381.9 1391.9
## - Lag1 1 1382.2 1392.2
## <none> 1381.1 1393.1
##
## Step: AIC=1391.13
## Direction ~ Lag1 + Lag2 + Lag3 + Volume
##
## Df Deviance AIC
## - Lag3 1 1381.2 1389.2
## - Volume 1 1381.4 1389.4
## - Lag2 1 1381.9 1389.9
## - Lag1 1 1382.2 1390.2
## <none> 1381.1 1391.1
##
## Step: AIC=1389.15
## Direction ~ Lag1 + Lag2 + Volume
##
## Df Deviance AIC
## - Volume 1 1381.4 1387.4
## - Lag2 1 1381.9 1387.9
## - Lag1 1 1382.2 1388.2
## <none> 1381.2 1389.2
##
## Step: AIC=1387.4
## Direction ~ Lag1 + Lag2
##
## Df Deviance AIC
## - Lag2 1 1382.1 1386.1
## - Lag1 1 1382.6 1386.6
## <none> 1381.4 1387.4
##
## Step: AIC=1386.14
## Direction ~ Lag1
##
## Df Deviance AIC
## - Lag1 1 1383.3 1385.3
## <none> 1382.1 1386.1
##
## Step: AIC=1385.27
## Direction ~ 1
##
## Call: glm(formula = Direction ~ 1, family = binomial, data = Smarket,
## subset = train)
##
## Coefficients:
## (Intercept)
## 0.03207
##
## Degrees of Freedom: 997 Total (i.e. Null); 997 Residual
## Null Deviance: 1383
## Residual Deviance: 1383 AIC: 1385
How to interpret
glm.lag1 <- glm(Direction ~Lag1,data = Smarket , family = binomial , subset = train)
summary(glm.lag1)
##
## Call:
## glm(formula = Direction ~ Lag1, family = binomial, data = Smarket,
## subset = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.03215 0.06335 0.507 0.612
## Lag1 -0.05460 0.05165 -1.057 0.290
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1383.3 on 997 degrees of freedom
## Residual deviance: 1382.1 on 996 degrees of freedom
## AIC: 1386.1
##
## Number of Fisher Scoring iterations: 3
The odds of the + multiply by exp(-0.0546)=0.95 for every one-unit increase in Lag1. There is 0.05 decrease.
crabs <- read.csv("./Crabs.csv")
Here, we let Y indicate whether a female crab has any satellites (other males who could mate with her). That is, Y = 1 if a female crab has at least one satellite, and Y = 0 if she has no satellite.
crabs$y<- ifelse( crabs$sat== 0,0,1)
glm.c <- glm(y~.-alone-sat,data=crabs,family = "binomial")
summary(glm.c)
##
## Call:
## glm(formula = y ~ . - alone - sat, family = "binomial", data = crabs)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.6826250 3.8700294 -2.502 0.01235 *
## crab 0.0004875 0.0036804 0.132 0.89463
## weight 0.8288982 0.7052041 1.175 0.23983
## width 0.2618095 0.1958258 1.337 0.18124
## colormedium 1.5029140 0.5672439 2.650 0.00806 **
## colormedium dark 1.1206505 0.5936360 1.888 0.05906 .
## colormedium light 1.6215450 0.9406600 1.724 0.08474 .
## spineSome wear -0.1029865 0.7057308 -0.146 0.88398
## spineWorn 0.3927643 0.5055311 0.777 0.43720
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 225.76 on 172 degrees of freedom
## Residual deviance: 185.18 on 164 degrees of freedom
## AIC: 203.18
##
## Number of Fisher Scoring iterations: 4
step(glm.c,direction = "backward")
## Start: AIC=203.18
## y ~ (crab + sat + alone + weight + width + color + spine) - alone -
## sat
##
## Df Deviance AIC
## - spine 2 186.17 200.17
## - crab 1 185.20 201.20
## - weight 1 186.60 202.60
## - width 1 186.95 202.95
## <none> 185.18 203.18
## - color 3 192.75 204.75
##
## Step: AIC=200.17
## y ~ crab + weight + width + color
##
## Df Deviance AIC
## - crab 1 186.21 198.21
## - weight 1 187.43 199.43
## <none> 186.17 200.17
## - width 1 188.44 200.44
## - color 3 192.83 200.83
##
## Step: AIC=198.21
## y ~ weight + width + color
##
## Df Deviance AIC
## - weight 1 187.46 197.46
## <none> 186.21 198.21
## - width 1 188.54 198.54
## - color 3 192.89 198.89
##
## Step: AIC=197.46
## y ~ width + color
##
## Df Deviance AIC
## <none> 187.46 197.46
## - color 3 194.45 198.45
## - width 1 212.06 220.06
##
## Call: glm(formula = y ~ width + color, family = "binomial", data = crabs)
##
## Coefficients:
## (Intercept) width colormedium colormedium dark
## -12.715 0.468 1.402 1.106
## colormedium light
## 1.330
##
## Degrees of Freedom: 172 Total (i.e. Null); 168 Residual
## Null Deviance: 225.8
## Residual Deviance: 187.5 AIC: 197.5
glm.p <- glm(y~color+width,data=crabs)
summary(glm.p)
##
## Call:
## glm(formula = y ~ color + width, data = crabs)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.73250 0.42419 -4.084 6.83e-05 ***
## colormedium 0.29276 0.10567 2.770 0.00623 **
## colormedium dark 0.23475 0.11401 2.059 0.04104 *
## colormedium light 0.29516 0.15876 1.859 0.06475 .
## width 0.08111 0.01637 4.955 1.76e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1897883)
##
## Null deviance: 39.780 on 172 degrees of freedom
## Residual deviance: 31.884 on 168 degrees of freedom
## AIC: 210.38
##
## Number of Fisher Scoring iterations: 2
predpr <- predict(glm.p,type="response")
We have seen that a model with discrimination ability has an ROC curve which goes closer to the top left hand corner of the plot, whereas a model with no discrimination ability has an ROC curve close to a 45 degree line.
library(pROC)
roccurve <- roc(crabs$y~predpr)
plot(roccurve,legacy.axes=TRUE)
The area under the estimated ROC curve (AUC) is reported when we plot
the ROC curve in R’s Console.
auc(roccurve)
## Area under the curve: 0.769
In a logistic regression model, we have a binary response Y and a predictor x.