Learning Objectives

In this lesson students will learn …

  • The generalized linear model (GLM) framework
  • How to implement simple and multiple logistic regression
  • How to interpret the effect of coefficients in the logistic regression model
  • Perform predictions from a logistic regression model
  • Trade-offs when choosing a threshold
  • Perform variable selection using AIC for glms
  • Compare models based on their confusion matrices

Data Inspiration

Motivation/Background:

  • “The Pima Indians of Arizona have the highest reported prevalences of obesity and non-insulin-dependent diabetes mellitus (NIDDM). In parallel with abrupt changes in lifestyle, these prevalences in Arizona Pimas have increased to epidemic proportions during the past decades.” (Ravussin et al. 1994)
  • “This study provides compelling evidence that changes in lifestyle associated with Westernization play a major role in the global epidemic of type 2 diabetes.” (Schultz et al. 2006)

Sources:

Step 1: Import the Data

pima<-read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA252/main/Data/diabetes.csv", 
               header=TRUE)

str(pima)
## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

Step 2: Visualize

Our goal in this step is to find feature variables that best separates the two groups (0 = diabetes negative; 1 = diabetes positive).

library(tidyverse)

## GLUCOSE DENSITY
ggplot(data=pima, aes(x=Glucose ,fill=factor(Outcome)))+
  geom_density(alpha=.5)

Step 3: Training and Testing

We will use stratified splitting to create 70/30 - training/testing datasets to build our models.

Note: We will use the same seed as we did for the knn example so that we can compare.

library(caret)
# Split the data into training and test set
set.seed(314)
caretSamp <- createDataPartition(pima$Outcome , 
                                        p = 0.7, 
                                        list = FALSE)

## SPLIT TESTING AND TRAINING
trainCaret  <- pima[caretSamp, ]
testCaret <- pima[-caretSamp, ]

Step 4: Building a Model

“If the only tool you have is a hammer, you tend to see every problem as a nail.” - Abraham Maslow

Linear regression is the most popular model; however, the assumptions are often not met.

ggplot(data=trainCaret, aes(x=Glucose, y=Outcome))+
  geom_point()+
  geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'

That didn’t work out so well.

So far we’ve covered SLR and MLR, but this only works when we have a continuous response variable. Now let’s extend the model. We can consider the generalized linear model framework.

Generalized Linear Model (GLM) Framework

A GLM consists of three things:

  • A linear predictor
  • A distribution of the response
  • A link function between the two

Note: In GLM, methods there isn’t a mathematical closed form estimates. We need to use machine learning techniques to optimize and find estimates.

The SLR and MLR models fit within this umbrella framework.

  • Model: \(Y=\beta_0+\beta_1 X_1+ \cdots+\beta_k X_k+\varepsilon\)
    • A linear predictor: \(\beta_0+\beta_1 X_1+ \cdots+\beta_k X_k\)
    • A distribution of the response: Normal distribution
    • A link function between the two: Identity link

Rather than modeling the response \(Y\) directly, model the probability that \(Y\) belongs to a category.

  • Logistic Model: \(logit(p)=log(\frac{p}{1-p})=\beta_0+\beta_1 X_1+ \cdots+\beta_k X_k\)
    • A linear predictor: \(\beta_0+\beta_1 X_1+ \cdots+\beta_k X_k\)
    • A distribution of the response: Binomial distribution
    • A link function between the two: Logit link

Simple Logistic Regression

As with simple linear regression, a simple logistic regression has only one feature.

\[logit(p)=log(\frac{p}{1-p})=\beta_0+\beta_1 X_1\]

Step 5: Simple Logistic Model

Let’s use the feature Glucose to predict Outcome.

modSimp <- glm(Outcome ~ Glucose, data = trainCaret, family = "binomial")
summary(modSimp)
## 
## Call:
## glm(formula = Outcome ~ Glucose, family = "binomial", data = trainCaret)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2393  -0.7661  -0.5031   0.7952   3.4302  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.880463   0.533361  -11.03   <2e-16 ***
## Glucose      0.042146   0.004133   10.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 696.28  on 537  degrees of freedom
## Residual deviance: 548.32  on 536  degrees of freedom
## AIC: 552.32
## 
## Number of Fisher Scoring iterations: 4

Definitions:

Plot the model:

# logistic
ggplot(data=trainCaret, aes(x=Glucose, y=Outcome))+
  geom_point()+
  geom_line(aes(x = Glucose, y = modSimp$fitted), color="blue")

Step 6: Interpretation

The estimate for the slope, \(\beta_1\) is

# slope
slope<-modSimp$coefficients[2]
slope
##    Glucose 
## 0.04214601

In SLR a one unit increase in the \(X_1\) direction is related to a \(\beta_1\) change in the response (\(Y\) direction), on average. However, this is more complex in logistic regression. In logistic regression, a one unit increase in the \(X_1\) direction is related to a \(\beta_1\) change in the log of the odds, on average. What does that even mean?

Definition:

Since we are using the logit link, we will need to back-transform (applying the inverse logit) to interpret the slope.

  • \(logit(p)=log(\frac{p}{1-p})=\beta_0+\beta_1 (X_1+1)\)
  • \(e^{log(\frac{p}{1-p})}=e^{\beta_0+\beta_1 (X_1+1)}\)
  • \(\frac{p}{1-p}=e^{\beta_0+\beta_1 X_1+\beta_1}\)
  • \(\frac{p}{1-p}=e^{\beta_0+\beta_1 X_1}\times e^{\beta_1}\)

The above derivations illustrate that a one unit increase in the \(X_1\) direction is related to a odds multiplicative effect on the odds of \(e^{\beta_1}\)

Thus, the effect of a one unit change in \(X_1\) is

# exponentiate
exp(slope)
##  Glucose 
## 1.043047

Step 7: Prediction

We can use the predict() function using the glm model object and the partitioned testing data set.

# PREDICT (DEFAULT)
pred1<-predict(modSimp, newdata = testCaret)
head(pred1)
##           2           3          10          13          15          18 
## -2.29805250  1.83225649 -0.61221209 -0.02216795  1.11577432 -1.37084028
Question:
  • What are your observations of these values?

By default, we the predict function is used it generates the log-odds, we would rather work with the probabilities themselves. Let’s try this code again..

# PREDICT (Response)
pred1R<-predict(modSimp, newdata = testCaret, type="response")
head(pred1R)
##          2          3         10         13         15         18 
## 0.09128438 0.86203032 0.35155475 0.49445824 0.75320406 0.20248412

Step 8: Thresholding and Confusion Matrix

This method produces estimated probabilities of observing a success (1), therefore, we will need to set a threshold to create a binary prediction. A natural first choice might be to use 0.50 as a threshold. This implies that errors (false positives and false negatives) have equal weight.

## CONFUSION MATRIX
conf_mat1<-data.frame(testOutcome=testCaret$Outcome, predOut=pred1R>.5)
table(conf_mat1$predOut, conf_mat1$testOutcome)
##        
##           0   1
##   FALSE 128  42
##   TRUE   22  38
## CORRECT RATE
mean(conf_mat1$predOut==conf_mat1$testOutcome)
## [1] 0.7217391

In reality, errors might have different weights. It might be worse to have a false negative because that patient could have sought treatment if they had the proper diagnoses. If we wanted to decrease the rate of false negatives, we would want to increase the threshold. However, the cost of this would be an increase in false positives.

## CONFUSION MATRIX
conf_mat2<-data.frame(testOutcome=testCaret$Outcome, predOut=pred1R>.3)
table(conf_mat2$predOut, conf_mat2$testOutcome)
##        
##          0  1
##   FALSE 99 22
##   TRUE  51 58
## CORRECT RATE
mean(conf_mat2$predOut==conf_mat2$testOutcome)
## [1] 0.6826087
Question:
  • What would you set the threshold to to have no false positives? What would be the cost of this? (in terms of error)

  • What would you set the threshold to to have no false negatives? What would be the cost of this? (in terms of error)

Multiple Logistic Regression

In multiple logistic regression we consider more features!

\[logit(p)=log(\frac{p}{1-p})=\beta_0+\beta_1 X_1+\beta_2 X_2+\cdots+\beta_k X_k\]

Step 9: Visualize

library(GGally)
library(tidyverse)
pima$Outcome<-as.factor(pima$Outcome)

ggpairs(pima, 
        ggplot2::aes(color=factor(Outcome)))

Step 10: Fit a Saturated Model

modMulti<-glm(Outcome ~.,
          data = trainCaret, family = "binomial")

summary(modMulti)
## 
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = trainCaret)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7295  -0.6871  -0.3715   0.6500   2.8935  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -9.864678   0.955914 -10.320  < 2e-16 ***
## Pregnancies               0.089298   0.039861   2.240  0.02507 *  
## Glucose                   0.039068   0.004743   8.238  < 2e-16 ***
## BloodPressure            -0.010305   0.006199  -1.662  0.09645 .  
## SkinThickness            -0.001639   0.008613  -0.190  0.84904    
## Insulin                  -0.001380   0.001071  -1.288  0.19760    
## BMI                       0.101616   0.019038   5.338 9.42e-08 ***
## DiabetesPedigreeFunction  0.827195   0.374813   2.207  0.02732 *  
## Age                       0.033207   0.012437   2.670  0.00759 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 696.28  on 537  degrees of freedom
## Residual deviance: 479.07  on 529  degrees of freedom
## AIC: 497.07
## 
## Number of Fisher Scoring iterations: 5
Question:
  • Is the effect of Pregnancies significant at an \(\alpha=0.05\) level?
  • What is the effect of one additional Pregnancy on the odds of having diabetes?

Step 11: Variable Selection

If we want to perform stepwise selection for a binary response we will need to use the bestglm package. This package will perform stepwise selection by minimizing the AIC.

#install.packages("bestglm")
library(bestglm)
## Loading required package: leaps
Backward Selection

Backward selection is the default of the step() function. Observe that the algorithm starts with all the variables, then removes SkinThickness, and then stops because it has minimized the AIC.

## Backward
step(modMulti)
## Start:  AIC=497.07
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + 
##     Insulin + BMI + DiabetesPedigreeFunction + Age
## 
##                            Df Deviance    AIC
## - SkinThickness             1   479.11 495.11
## - Insulin                   1   480.72 496.72
## <none>                          479.07 497.07
## - BloodPressure             1   481.87 497.87
## - DiabetesPedigreeFunction  1   484.14 500.14
## - Pregnancies               1   484.17 500.17
## - Age                       1   486.20 502.20
## - BMI                       1   512.94 528.94
## - Glucose                   1   567.47 583.47
## 
## Step:  AIC=495.11
## Outcome ~ Pregnancies + Glucose + BloodPressure + Insulin + BMI + 
##     DiabetesPedigreeFunction + Age
## 
##                            Df Deviance    AIC
## <none>                          479.11 495.11
## - Insulin                   1   481.37 495.37
## - BloodPressure             1   482.11 496.11
## - DiabetesPedigreeFunction  1   484.15 498.15
## - Pregnancies               1   484.20 498.20
## - Age                       1   486.34 500.34
## - BMI                       1   516.57 530.57
## - Glucose                   1   570.95 584.95
## 
## Call:  glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure + 
##     Insulin + BMI + DiabetesPedigreeFunction + Age, family = "binomial", 
##     data = trainCaret)
## 
## Coefficients:
##              (Intercept)               Pregnancies                   Glucose  
##                -9.858443                  0.089295                  0.039208  
##            BloodPressure                   Insulin                       BMI  
##                -0.010512                 -0.001465                  0.100440  
## DiabetesPedigreeFunction                       Age  
##                 0.818040                  0.033384  
## 
## Degrees of Freedom: 537 Total (i.e. Null);  530 Residual
## Null Deviance:       696.3 
## Residual Deviance: 479.1     AIC: 495.1
Forward Selection

We can perform forward selection by specifying the full model and the null model. This algorithm works by adding one variable in at a time and then stops when the AIC is minimized.

## NULL MODEL
mod0 <- glm(Outcome ~ 1, family = binomial, data = trainCaret)

## FORWARD Selection
step(mod0, scope = list(upper = modMulti, lower = mod0), 
     method = "forward")
## Start:  AIC=698.28
## Outcome ~ 1
## 
##                            Df Deviance    AIC
## + Glucose                   1   548.32 552.32
## + BMI                       1   634.08 638.08
## + Age                       1   651.92 655.92
## + Pregnancies               1   667.54 671.54
## + DiabetesPedigreeFunction  1   680.46 684.46
## + Insulin                   1   684.13 688.13
## + SkinThickness             1   691.52 695.52
## + BloodPressure             1   693.41 697.41
## <none>                          696.28 698.28
## 
## Step:  AIC=552.32
## Outcome ~ Glucose
## 
##                            Df Deviance    AIC
## + BMI                       1   515.34 521.34
## + Pregnancies               1   529.41 535.41
## + Age                       1   529.85 535.85
## + DiabetesPedigreeFunction  1   542.29 548.29
## <none>                          548.32 552.32
## + SkinThickness             1   546.93 552.93
## + Insulin                   1   546.99 552.99
## + BloodPressure             1   548.30 554.30
## - Glucose                   1   696.28 698.28
## 
## Step:  AIC=521.34
## Outcome ~ Glucose + BMI
## 
##                            Df Deviance    AIC
## + Age                       1   493.66 501.66
## + Pregnancies               1   496.12 504.12
## + DiabetesPedigreeFunction  1   511.20 519.20
## + Insulin                   1   511.86 519.86
## <none>                          515.34 521.34
## + SkinThickness             1   513.92 521.92
## + BloodPressure             1   514.67 522.67
## - BMI                       1   548.32 552.32
## - Glucose                   1   634.08 638.08
## 
## Step:  AIC=501.66
## Outcome ~ Glucose + BMI + Age
## 
##                            Df Deviance    AIC
## + Pregnancies               1   489.25 499.25
## + DiabetesPedigreeFunction  1   489.98 499.98
## + BloodPressure             1   490.54 500.54
## + Insulin                   1   491.39 501.39
## <none>                          493.66 501.66
## + SkinThickness             1   493.00 503.00
## - Age                       1   515.34 521.34
## - BMI                       1   529.85 535.85
## - Glucose                   1   590.70 596.70
## 
## Step:  AIC=499.25
## Outcome ~ Glucose + BMI + Age + Pregnancies
## 
##                            Df Deviance    AIC
## + DiabetesPedigreeFunction  1   484.80 496.80
## + BloodPressure             1   485.65 497.65
## <none>                          489.25 499.25
## + Insulin                   1   487.40 499.40
## + SkinThickness             1   488.62 500.62
## - Pregnancies               1   493.66 501.66
## - Age                       1   496.12 504.12
## - BMI                       1   524.54 532.54
## - Glucose                   1   587.80 595.80
## 
## Step:  AIC=496.8
## Outcome ~ Glucose + BMI + Age + Pregnancies + DiabetesPedigreeFunction
## 
##                            Df Deviance    AIC
## + BloodPressure             1   481.37 495.37
## + Insulin                   1   482.11 496.11
## <none>                          484.80 496.80
## + SkinThickness             1   483.47 497.47
## - DiabetesPedigreeFunction  1   489.25 499.25
## - Pregnancies               1   489.98 499.98
## - Age                       1   491.09 501.09
## - BMI                       1   517.85 527.85
## - Glucose                   1   578.06 588.06
## 
## Step:  AIC=495.37
## Outcome ~ Glucose + BMI + Age + Pregnancies + DiabetesPedigreeFunction + 
##     BloodPressure
## 
##                            Df Deviance    AIC
## + Insulin                   1   479.11 495.11
## <none>                          481.37 495.37
## + SkinThickness             1   480.72 496.72
## - BloodPressure             1   484.80 496.80
## - DiabetesPedigreeFunction  1   485.65 497.65
## - Pregnancies               1   487.06 499.06
## - Age                       1   488.78 500.78
## - BMI                       1   517.37 529.37
## - Glucose                   1   575.37 587.37
## 
## Step:  AIC=495.11
## Outcome ~ Glucose + BMI + Age + Pregnancies + DiabetesPedigreeFunction + 
##     BloodPressure + Insulin
## 
##                            Df Deviance    AIC
## <none>                          479.11 495.11
## - Insulin                   1   481.37 495.37
## - BloodPressure             1   482.11 496.11
## + SkinThickness             1   479.07 497.07
## - DiabetesPedigreeFunction  1   484.15 498.15
## - Pregnancies               1   484.20 498.20
## - Age                       1   486.34 500.34
## - BMI                       1   516.57 530.57
## - Glucose                   1   570.95 584.95
## 
## Call:  glm(formula = Outcome ~ Glucose + BMI + Age + Pregnancies + DiabetesPedigreeFunction + 
##     BloodPressure + Insulin, family = binomial, data = trainCaret)
## 
## Coefficients:
##              (Intercept)                   Glucose                       BMI  
##                -9.858443                  0.039208                  0.100440  
##                      Age               Pregnancies  DiabetesPedigreeFunction  
##                 0.033384                  0.089295                  0.818040  
##            BloodPressure                   Insulin  
##                -0.010512                 -0.001465  
## 
## Degrees of Freedom: 537 Total (i.e. Null);  530 Residual
## Null Deviance:       696.3 
## Residual Deviance: 479.1     AIC: 495.1
Best Subset Selection

Choose the best model of a given size.

# the Xy matrix needs y as the right-most variable
# this one already has Outcome at the last column
best.Xy <- trainCaret

# Run best subset
bglm.AIC = bestglm(Xy = best.Xy, family = binomial, IC = "AIC", 
                   TopModels = 10)
## Morgan-Tatar search since family is non-gaussian.
bglm.AIC$BestModels
##    Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1         TRUE    TRUE          TRUE         FALSE    TRUE TRUE
## 2         TRUE    TRUE          TRUE         FALSE   FALSE TRUE
## 3         TRUE    TRUE         FALSE         FALSE    TRUE TRUE
## 4         TRUE    TRUE          TRUE          TRUE   FALSE TRUE
## 5         TRUE    TRUE         FALSE         FALSE   FALSE TRUE
## 6         TRUE    TRUE          TRUE          TRUE    TRUE TRUE
## 7         TRUE    TRUE         FALSE          TRUE   FALSE TRUE
## 8         TRUE    TRUE          TRUE         FALSE   FALSE TRUE
## 9         TRUE    TRUE         FALSE          TRUE    TRUE TRUE
## 10        TRUE    TRUE          TRUE         FALSE    TRUE TRUE
##    DiabetesPedigreeFunction  Age Criterion
## 1                      TRUE TRUE  493.1085
## 2                      TRUE TRUE  493.3689
## 3                      TRUE TRUE  494.1125
## 4                      TRUE TRUE  494.7150
## 5                      TRUE TRUE  494.8003
## 6                      TRUE TRUE  495.0723
## 7                      TRUE TRUE  495.4712
## 8                     FALSE TRUE  495.6468
## 9                      TRUE TRUE  495.8685
## 10                    FALSE TRUE  496.1452

Step 12: Best Model

Observe that the best model is agreed upon by the backward, forward, and best subset methods!

## BEST MODEL INCLUDES:
## Pregnancies+Glucose+BloodPressure+Insulin+BMI+DiabetesPedigreeFunction+Age
modBest<-glm(Outcome ~Pregnancies+Glucose+BloodPressure+
            Insulin+BMI+DiabetesPedigreeFunction+Age,
          data = trainCaret, family = "binomial")

summary(modBest)
## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure + 
##     Insulin + BMI + DiabetesPedigreeFunction + Age, family = "binomial", 
##     data = trainCaret)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7148  -0.6859  -0.3728   0.6465   2.8887  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -9.8584426  0.9550310 -10.323  < 2e-16 ***
## Pregnancies               0.0892952  0.0398994   2.238  0.02522 *  
## Glucose                   0.0392085  0.0046882   8.363  < 2e-16 ***
## BloodPressure            -0.0105119  0.0061027  -1.723  0.08498 .  
## Insulin                  -0.0014653  0.0009707  -1.510  0.13114    
## BMI                       0.1004396  0.0179787   5.587 2.32e-08 ***
## DiabetesPedigreeFunction  0.8180399  0.3714597   2.202  0.02765 *  
## Age                       0.0333838  0.0124085   2.690  0.00714 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 696.28  on 537  degrees of freedom
## Residual deviance: 479.11  on 530  degrees of freedom
## AIC: 495.11
## 
## Number of Fisher Scoring iterations: 5

Predict for the best model:

## Predict
predBest<-predict(modBest, newdata = testCaret, type="response")

Threshold and Confusion Matrix:

## Threshold
conf_matBest<-data.frame(testOutcome=testCaret$Outcome, predOut=predBest>.5)

## Confusion Matrix
table(conf_matBest$predOut, conf_matBest$testOutcome)
##        
##           0   1
##   FALSE 126  36
##   TRUE   24  44
## CORRECT RATE
mean(conf_matBest$predOut==conf_matBest$testOutcome)
## [1] 0.7391304

Step 13: Compare Models

Using the same training and testing sets, we can compare the error rates. Which model would you pick?

A. Simple Logistic

Only uses Glucose to predict Outcome.

## Confusion Matrix
table(conf_mat1$predOut, conf_mat1$testOutcome)
##        
##           0   1
##   FALSE 128  42
##   TRUE   22  38
## CORRECT RATE
mean(conf_mat1$predOut==conf_mat1$testOutcome)
## [1] 0.7217391
B. Best Multiple Logistic

Uses Pregnancies, Glucose, BloodPressure, Insulin, BMI, DiabetesPedigreeFunction, and Age to predict Outcome.

## Confusion Matrix
table(conf_matBest$predOut, conf_matBest$testOutcome)
##        
##           0   1
##   FALSE 126  36
##   TRUE   24  44
## CORRECT RATE
mean(conf_matBest$predOut==conf_matBest$testOutcome)
## [1] 0.7391304
C. KNN

Uses the 12 closest neighbors.

From the KNN lesson we had

## Confusion Matrix
table(knn.pred12,testOut)
##           testOut
## knn.pred12   0   1
##          0 118  37
##          1  32  43
## CORRECT RATE
mean(knn.pred12==testOut)
## [1] 0.7