An Introduction to Logistic Regression

Let’s first introducing Binomial experiment

  • Fixed number of trials(n)
  • Each trial is independent
  • Each trial results in one of two outcomes: success or failure
  • Probablity of success remains same for every trial.

Example: To illustrate, suppose a quiz has 10 multiple-choice questions, with 5 possible answers for each. A student randomly guesses the answer for each question. Let Y denote the number of correct responses. The probability of a correct response is 0.20 for a given question, so n = 10 and \(\pi = 0.2\). The probability of y = 0 correct responses, equals \(P(y)=\frac{n!}{y!(n-y)!}\pi^y(1-\pi)^{n-y}\)

n <- 10
p <- 0.2
y <- 0

prob_3_heads <- dbinom(y, size = n, prob = p)
cat("P(Y = 0):", prob_3_heads, "\n")
## P(Y = 0): 0.1073742
  • \(\mu=n\pi,\sigma^2=n\pi(1-\pi)\)

Let us now take a closer look at the statistical modeling of binary response variables. Binary data are the most common form of categorical data, and the methods of this chapter

are of fundamental importance. The most popular model for binary data is logistic regression.

THe logistic regression model has linear form for the logit of this probability.

For a probability of success \(\pi\), the odds of success are defined to be \(\frac{\pi}{1-\pi}\) the odds ratio are defined to be \(\theta=\frac{\pi_1/(1-\pi_1)}{\pi_2/(1-\pi_2)}\). \(\theta\gt1\) subject 1 are more likely to have successes than are subjet 2.

\(\log(\frac{\pi(x)}{1-\pi(x)}) = \beta_0+\beta_1x\)

Confusion Matrix

pp <- c("True postive(TP)","False positive(FN)","Sensitivity(TPR) = TP/P")
pn <- c("False postive(FP)","True negative(TN)","Specifcity(TNR) = TN/N") 
cm <- data.frame(cbind("Predicted Positvie(PP)"= pp,"Predicted Negative(PN)"=pn))
rownames(cm) <- c("Actual Positive(P)", "Actual Negative(N)","Metric")
kable(cm)
Predicted.Positvie.PP. Predicted.Negative.PN.
Actual Positive(P) True postive(TP) False postive(FP)
Actual Negative(N) False positive(FN) True negative(TN)
Metric Sensitivity(TPR) = TP/P Specifcity(TNR) = TN/N
library(ISLR2)
data(Smarket)
str(Smarket)
## 'data.frame':    1250 obs. of  9 variables:
##  $ Year     : num  2001 2001 2001 2001 2001 ...
##  $ Lag1     : num  0.381 0.959 1.032 -0.623 0.614 ...
##  $ Lag2     : num  -0.192 0.381 0.959 1.032 -0.623 ...
##  $ Lag3     : num  -2.624 -0.192 0.381 0.959 1.032 ...
##  $ Lag4     : num  -1.055 -2.624 -0.192 0.381 0.959 ...
##  $ Lag5     : num  5.01 -1.055 -2.624 -0.192 0.381 ...
##  $ Volume   : num  1.19 1.3 1.41 1.28 1.21 ...
##  $ Today    : num  0.959 1.032 -0.623 0.614 0.213 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...

We will fit a logistic regression model in order to predict Direction using Lag1 through Lag5 and Volume. The glm() function can be used to fit many types of generalized linear models, including logistic regression. The glm() generalized syntax of the glm() function is similar to that of lm(), except that we must linear model pass in the argument family = binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model.

glm.fits <- glm( 
  Direction ~ .-Year-Today, 
  data = Smarket , family = binomial
)
summary(glm.fits)
## 
## Call:
## glm(formula = Direction ~ . - Year - Today, family = binomial, 
##     data = Smarket)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

The smallest p-value here is associated with Lag1. The negative coefficient for this predictor suggests that if the market had a positive return yesterday, then it is less likely to go up today. However, at a value of 0.15, the p-value is still relatively large, and so there is no clear evidence of a real association between Lag1 and Direction.

The predict() function can be used to predict the probability that the market will go up, given values of the predictors. The type = “response” option tells R to output probabilities

glm.probs <- predict(glm.fits , type = "response")
glm.probs[1:10]
##         1         2         3         4         5         6         7         8 
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509 0.5092292 
##         9        10 
## 0.5176135 0.4888378

In order to make a prediction as to whether the market will go up or down on a particular day, we must convert these predicted probabilities into class labels, Up or Down.

glm.pred <- rep("Down", 1250)
glm.pred[glm.probs > .5] = "Up"

Let’s take a look at the confusion matrix.

table(glm.pred , Smarket$Direction)
##         
## glm.pred Down  Up
##     Down  145 141
##     Up    457 507

We correctly classified 145+507/1250=0.5216

Not good practice, since we traing and testing on same data.

A better way is split the data for training and testing. We will first create a vector corresponding to the observations from 2001 through 2004. We will then use this vector to create a held out data set of observations from 2005.

train <- (Smarket$Year < 2005)
Smarket.2005 <- Smarket[!train,]

We now fit a logistic regression model using only the subset of the observations that correspond to dates before 2005.

glm.fits <- glm(Direction ~.-Year-Today,data = Smarket , family = binomial , subset = train)
summary(glm.fits)
## 
## Call:
## glm(formula = Direction ~ . - Year - Today, family = binomial, 
##     data = Smarket, subset = train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.191213   0.333690   0.573    0.567
## Lag1        -0.054178   0.051785  -1.046    0.295
## Lag2        -0.045805   0.051797  -0.884    0.377
## Lag3         0.007200   0.051644   0.139    0.889
## Lag4         0.006441   0.051706   0.125    0.901
## Lag5        -0.004223   0.051138  -0.083    0.934
## Volume      -0.116257   0.239618  -0.485    0.628
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1383.3  on 997  degrees of freedom
## Residual deviance: 1381.1  on 991  degrees of freedom
## AIC: 1395.1
## 
## Number of Fisher Scoring iterations: 3
glm.probs <- predict(glm.fits , Smarket.2005,type = "response")
glm.pred <- rep("Down", 252)
glm.pred[glm.probs > .5] <- "Up"
Direction.2005 <-Smarket$Direction[!train]
table(glm.pred , Direction.2005)
##         Direction.2005
## glm.pred Down Up
##     Down   77 97
##     Up     34 44
mean(glm.pred == Direction.2005)
## [1] 0.4801587
step(glm.fits)
## Start:  AIC=1395.11
## Direction ~ (Year + Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume + 
##     Today) - Year - Today
## 
##          Df Deviance    AIC
## - Lag5    1   1381.1 1393.1
## - Lag4    1   1381.1 1393.1
## - Lag3    1   1381.1 1393.1
## - Volume  1   1381.3 1393.3
## - Lag2    1   1381.9 1393.9
## - Lag1    1   1382.2 1394.2
## <none>        1381.1 1395.1
## 
## Step:  AIC=1393.11
## Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Volume
## 
##          Df Deviance    AIC
## - Lag4    1   1381.1 1391.1
## - Lag3    1   1381.1 1391.1
## - Volume  1   1381.3 1391.3
## - Lag2    1   1381.9 1391.9
## - Lag1    1   1382.2 1392.2
## <none>        1381.1 1393.1
## 
## Step:  AIC=1391.13
## Direction ~ Lag1 + Lag2 + Lag3 + Volume
## 
##          Df Deviance    AIC
## - Lag3    1   1381.2 1389.2
## - Volume  1   1381.4 1389.4
## - Lag2    1   1381.9 1389.9
## - Lag1    1   1382.2 1390.2
## <none>        1381.1 1391.1
## 
## Step:  AIC=1389.15
## Direction ~ Lag1 + Lag2 + Volume
## 
##          Df Deviance    AIC
## - Volume  1   1381.4 1387.4
## - Lag2    1   1381.9 1387.9
## - Lag1    1   1382.2 1388.2
## <none>        1381.2 1389.2
## 
## Step:  AIC=1387.4
## Direction ~ Lag1 + Lag2
## 
##        Df Deviance    AIC
## - Lag2  1   1382.1 1386.1
## - Lag1  1   1382.6 1386.6
## <none>      1381.4 1387.4
## 
## Step:  AIC=1386.14
## Direction ~ Lag1
## 
##        Df Deviance    AIC
## - Lag1  1   1383.3 1385.3
## <none>      1382.1 1386.1
## 
## Step:  AIC=1385.27
## Direction ~ 1
## 
## Call:  glm(formula = Direction ~ 1, family = binomial, data = Smarket, 
##     subset = train)
## 
## Coefficients:
## (Intercept)  
##     0.03207  
## 
## Degrees of Freedom: 997 Total (i.e. Null);  997 Residual
## Null Deviance:       1383 
## Residual Deviance: 1383  AIC: 1385

How to interpret

glm.lag1 <-  glm(Direction ~Lag1,data = Smarket , family = binomial , subset = train)
summary(glm.lag1)
## 
## Call:
## glm(formula = Direction ~ Lag1, family = binomial, data = Smarket, 
##     subset = train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.03215    0.06335   0.507    0.612
## Lag1        -0.05460    0.05165  -1.057    0.290
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1383.3  on 997  degrees of freedom
## Residual deviance: 1382.1  on 996  degrees of freedom
## AIC: 1386.1
## 
## Number of Fisher Scoring iterations: 3

The odds of the + multiply by exp(-0.0546)=0.95 for every one-unit increase in Lag1. There is 0.05 decrease.

crabs <- read.csv("./Crabs.csv")

Here, we let Y indicate whether a female crab has any satellites (other males who could mate with her). That is, Y = 1 if a female crab has at least one satellite, and Y = 0 if she has no satellite.

crabs$y<- ifelse( crabs$sat== 0,0,1)
glm.c <- glm(y~.-alone-sat,data=crabs,family = "binomial")
summary(glm.c)
## 
## Call:
## glm(formula = y ~ . - alone - sat, family = "binomial", data = crabs)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)   
## (Intercept)       -9.6826250  3.8700294  -2.502  0.01235 * 
## crab               0.0004875  0.0036804   0.132  0.89463   
## weight             0.8288982  0.7052041   1.175  0.23983   
## width              0.2618095  0.1958258   1.337  0.18124   
## colormedium        1.5029140  0.5672439   2.650  0.00806 **
## colormedium dark   1.1206505  0.5936360   1.888  0.05906 . 
## colormedium light  1.6215450  0.9406600   1.724  0.08474 . 
## spineSome wear    -0.1029865  0.7057308  -0.146  0.88398   
## spineWorn          0.3927643  0.5055311   0.777  0.43720   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.76  on 172  degrees of freedom
## Residual deviance: 185.18  on 164  degrees of freedom
## AIC: 203.18
## 
## Number of Fisher Scoring iterations: 4
step(glm.c,direction = "backward")
## Start:  AIC=203.18
## y ~ (crab + sat + alone + weight + width + color + spine) - alone - 
##     sat
## 
##          Df Deviance    AIC
## - spine   2   186.17 200.17
## - crab    1   185.20 201.20
## - weight  1   186.60 202.60
## - width   1   186.95 202.95
## <none>        185.18 203.18
## - color   3   192.75 204.75
## 
## Step:  AIC=200.17
## y ~ crab + weight + width + color
## 
##          Df Deviance    AIC
## - crab    1   186.21 198.21
## - weight  1   187.43 199.43
## <none>        186.17 200.17
## - width   1   188.44 200.44
## - color   3   192.83 200.83
## 
## Step:  AIC=198.21
## y ~ weight + width + color
## 
##          Df Deviance    AIC
## - weight  1   187.46 197.46
## <none>        186.21 198.21
## - width   1   188.54 198.54
## - color   3   192.89 198.89
## 
## Step:  AIC=197.46
## y ~ width + color
## 
##         Df Deviance    AIC
## <none>       187.46 197.46
## - color  3   194.45 198.45
## - width  1   212.06 220.06
## 
## Call:  glm(formula = y ~ width + color, family = "binomial", data = crabs)
## 
## Coefficients:
##       (Intercept)              width        colormedium   colormedium dark  
##           -12.715              0.468              1.402              1.106  
## colormedium light  
##             1.330  
## 
## Degrees of Freedom: 172 Total (i.e. Null);  168 Residual
## Null Deviance:       225.8 
## Residual Deviance: 187.5     AIC: 197.5
glm.p <- glm(y~color+width,data=crabs)
summary(glm.p)
## 
## Call:
## glm(formula = y ~ color + width, data = crabs)
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.73250    0.42419  -4.084 6.83e-05 ***
## colormedium        0.29276    0.10567   2.770  0.00623 ** 
## colormedium dark   0.23475    0.11401   2.059  0.04104 *  
## colormedium light  0.29516    0.15876   1.859  0.06475 .  
## width              0.08111    0.01637   4.955 1.76e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1897883)
## 
##     Null deviance: 39.780  on 172  degrees of freedom
## Residual deviance: 31.884  on 168  degrees of freedom
## AIC: 210.38
## 
## Number of Fisher Scoring iterations: 2
predpr <- predict(glm.p,type="response")

The area under the ROC curve.

We have seen that a model with discrimination ability has an ROC curve which goes closer to the top left hand corner of the plot, whereas a model with no discrimination ability has an ROC curve close to a 45 degree line.

library(pROC)
roccurve <- roc(crabs$y~predpr)
plot(roccurve,legacy.axes=TRUE)

The area under the estimated ROC curve (AUC) is reported when we plot the ROC curve in R’s Console.

auc(roccurve)
## Area under the curve: 0.769

In a logistic regression model, we have a binary response Y and a predictor x.