In this lesson students will learn …
Motivation/Background:
Sources:
pima<-read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA252/main/Data/diabetes.csv",
header=TRUE)
str(pima)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
Our goal in this step is to find feature variables that best separates the two groups (0 = diabetes negative; 1 = diabetes positive).
library(tidyverse)
## GLUCOSE DENSITY
ggplot(data=pima, aes(x=Glucose ,fill=factor(Outcome)))+
geom_density(alpha=.5)
We will use stratified splitting to create 70/30 - training/testing datasets to build our models.
Note: We will use the same seed as we did for the knn example so that we can compare.
library(caret)
# Split the data into training and test set
set.seed(314)
caretSamp <- createDataPartition(pima$Outcome ,
p = 0.7,
list = FALSE)
## SPLIT TESTING AND TRAINING
trainCaret <- pima[caretSamp, ]
testCaret <- pima[-caretSamp, ]
“If the only tool you have is a hammer, you tend to see every problem as a nail.” - Abraham Maslow
Linear regression is the most popular model; however, the assumptions are often not met.
ggplot(data=trainCaret, aes(x=Glucose, y=Outcome))+
geom_point()+
geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'
That didn’t work out so well.
So far we’ve covered SLR and MLR, but this only works when we have a continuous response variable. Now let’s extend the model. We can consider the generalized linear model framework.
A GLM consists of three things:
Note: In GLM, methods there isn’t a mathematical closed form estimates. We need to use machine learning techniques to optimize and find estimates.
The SLR and MLR models fit within this umbrella framework.
Rather than modeling the response \(Y\) directly, model the probability that \(Y\) belongs to a category.
As with simple linear regression, a simple logistic regression has only one feature.
\[logit(p)=log(\frac{p}{1-p})=\beta_0+\beta_1 X_1\]
Let’s use the feature Glucose
to predict
Outcome
.
modSimp <- glm(Outcome ~ Glucose, data = trainCaret, family = "binomial")
summary(modSimp)
##
## Call:
## glm(formula = Outcome ~ Glucose, family = "binomial", data = trainCaret)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2393 -0.7661 -0.5031 0.7952 3.4302
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.880463 0.533361 -11.03 <2e-16 ***
## Glucose 0.042146 0.004133 10.20 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 696.28 on 537 degrees of freedom
## Residual deviance: 548.32 on 536 degrees of freedom
## AIC: 552.32
##
## Number of Fisher Scoring iterations: 4
Definitions:
Plot the model:
# logistic
ggplot(data=trainCaret, aes(x=Glucose, y=Outcome))+
geom_point()+
geom_line(aes(x = Glucose, y = modSimp$fitted), color="blue")
The estimate for the slope, \(\beta_1\) is
# slope
slope<-modSimp$coefficients[2]
slope
## Glucose
## 0.04214601
In SLR a one unit increase in the \(X_1\) direction is related to a \(\beta_1\) change in the response (\(Y\) direction), on average. However, this is more complex in logistic regression. In logistic regression, a one unit increase in the \(X_1\) direction is related to a \(\beta_1\) change in the log of the odds, on average. What does that even mean?
Definition:
Since we are using the logit link, we will need to back-transform (applying the inverse logit) to interpret the slope.
The above derivations illustrate that a one unit increase in the \(X_1\) direction is related to a odds multiplicative effect on the odds of \(e^{\beta_1}\)
Thus, the effect of a one unit change in \(X_1\) is
# exponentiate
exp(slope)
## Glucose
## 1.043047
We can use the predict()
function using the glm model
object and the partitioned testing data set.
# PREDICT (DEFAULT)
pred1<-predict(modSimp, newdata = testCaret)
head(pred1)
## 2 3 10 13 15 18
## -2.29805250 1.83225649 -0.61221209 -0.02216795 1.11577432 -1.37084028
By default, we the predict function is used it generates the log-odds, we would rather work with the probabilities themselves. Let’s try this code again..
# PREDICT (Response)
pred1R<-predict(modSimp, newdata = testCaret, type="response")
head(pred1R)
## 2 3 10 13 15 18
## 0.09128438 0.86203032 0.35155475 0.49445824 0.75320406 0.20248412
This method produces estimated probabilities of observing a success (1), therefore, we will need to set a threshold to create a binary prediction. A natural first choice might be to use 0.50 as a threshold. This implies that errors (false positives and false negatives) have equal weight.
## CONFUSION MATRIX
conf_mat1<-data.frame(testOutcome=testCaret$Outcome, predOut=pred1R>.5)
table(conf_mat1$predOut, conf_mat1$testOutcome)
##
## 0 1
## FALSE 128 42
## TRUE 22 38
## CORRECT RATE
mean(conf_mat1$predOut==conf_mat1$testOutcome)
## [1] 0.7217391
In reality, errors might have different weights. It might be worse to have a false negative because that patient could have sought treatment if they had the proper diagnoses. If we wanted to decrease the rate of false negatives, we would want to increase the threshold. However, the cost of this would be an increase in false positives.
## CONFUSION MATRIX
conf_mat2<-data.frame(testOutcome=testCaret$Outcome, predOut=pred1R>.3)
table(conf_mat2$predOut, conf_mat2$testOutcome)
##
## 0 1
## FALSE 99 22
## TRUE 51 58
## CORRECT RATE
mean(conf_mat2$predOut==conf_mat2$testOutcome)
## [1] 0.6826087
What would you set the threshold to to have no false positives? What would be the cost of this? (in terms of error)
What would you set the threshold to to have no false negatives? What would be the cost of this? (in terms of error)
In multiple logistic regression we consider more features!
\[logit(p)=log(\frac{p}{1-p})=\beta_0+\beta_1 X_1+\beta_2 X_2+\cdots+\beta_k X_k\]
library(GGally)
library(tidyverse)
pima$Outcome<-as.factor(pima$Outcome)
ggpairs(pima,
ggplot2::aes(color=factor(Outcome)))
modMulti<-glm(Outcome ~.,
data = trainCaret, family = "binomial")
summary(modMulti)
##
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = trainCaret)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7295 -0.6871 -0.3715 0.6500 2.8935
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.864678 0.955914 -10.320 < 2e-16 ***
## Pregnancies 0.089298 0.039861 2.240 0.02507 *
## Glucose 0.039068 0.004743 8.238 < 2e-16 ***
## BloodPressure -0.010305 0.006199 -1.662 0.09645 .
## SkinThickness -0.001639 0.008613 -0.190 0.84904
## Insulin -0.001380 0.001071 -1.288 0.19760
## BMI 0.101616 0.019038 5.338 9.42e-08 ***
## DiabetesPedigreeFunction 0.827195 0.374813 2.207 0.02732 *
## Age 0.033207 0.012437 2.670 0.00759 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 696.28 on 537 degrees of freedom
## Residual deviance: 479.07 on 529 degrees of freedom
## AIC: 497.07
##
## Number of Fisher Scoring iterations: 5
If we want to perform stepwise selection for a binary response we
will need to use the bestglm
package. This package will
perform stepwise selection by minimizing the AIC.
#install.packages("bestglm")
library(bestglm)
## Loading required package: leaps
Backward selection is the default of the step()
function. Observe that the algorithm starts with all the variables, then
removes SkinThickness
, and then stops because it has
minimized the AIC.
## Backward
step(modMulti)
## Start: AIC=497.07
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness +
## Insulin + BMI + DiabetesPedigreeFunction + Age
##
## Df Deviance AIC
## - SkinThickness 1 479.11 495.11
## - Insulin 1 480.72 496.72
## <none> 479.07 497.07
## - BloodPressure 1 481.87 497.87
## - DiabetesPedigreeFunction 1 484.14 500.14
## - Pregnancies 1 484.17 500.17
## - Age 1 486.20 502.20
## - BMI 1 512.94 528.94
## - Glucose 1 567.47 583.47
##
## Step: AIC=495.11
## Outcome ~ Pregnancies + Glucose + BloodPressure + Insulin + BMI +
## DiabetesPedigreeFunction + Age
##
## Df Deviance AIC
## <none> 479.11 495.11
## - Insulin 1 481.37 495.37
## - BloodPressure 1 482.11 496.11
## - DiabetesPedigreeFunction 1 484.15 498.15
## - Pregnancies 1 484.20 498.20
## - Age 1 486.34 500.34
## - BMI 1 516.57 530.57
## - Glucose 1 570.95 584.95
##
## Call: glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
## Insulin + BMI + DiabetesPedigreeFunction + Age, family = "binomial",
## data = trainCaret)
##
## Coefficients:
## (Intercept) Pregnancies Glucose
## -9.858443 0.089295 0.039208
## BloodPressure Insulin BMI
## -0.010512 -0.001465 0.100440
## DiabetesPedigreeFunction Age
## 0.818040 0.033384
##
## Degrees of Freedom: 537 Total (i.e. Null); 530 Residual
## Null Deviance: 696.3
## Residual Deviance: 479.1 AIC: 495.1
We can perform forward selection by specifying the full model and the null model. This algorithm works by adding one variable in at a time and then stops when the AIC is minimized.
## NULL MODEL
mod0 <- glm(Outcome ~ 1, family = binomial, data = trainCaret)
## FORWARD Selection
step(mod0, scope = list(upper = modMulti, lower = mod0),
method = "forward")
## Start: AIC=698.28
## Outcome ~ 1
##
## Df Deviance AIC
## + Glucose 1 548.32 552.32
## + BMI 1 634.08 638.08
## + Age 1 651.92 655.92
## + Pregnancies 1 667.54 671.54
## + DiabetesPedigreeFunction 1 680.46 684.46
## + Insulin 1 684.13 688.13
## + SkinThickness 1 691.52 695.52
## + BloodPressure 1 693.41 697.41
## <none> 696.28 698.28
##
## Step: AIC=552.32
## Outcome ~ Glucose
##
## Df Deviance AIC
## + BMI 1 515.34 521.34
## + Pregnancies 1 529.41 535.41
## + Age 1 529.85 535.85
## + DiabetesPedigreeFunction 1 542.29 548.29
## <none> 548.32 552.32
## + SkinThickness 1 546.93 552.93
## + Insulin 1 546.99 552.99
## + BloodPressure 1 548.30 554.30
## - Glucose 1 696.28 698.28
##
## Step: AIC=521.34
## Outcome ~ Glucose + BMI
##
## Df Deviance AIC
## + Age 1 493.66 501.66
## + Pregnancies 1 496.12 504.12
## + DiabetesPedigreeFunction 1 511.20 519.20
## + Insulin 1 511.86 519.86
## <none> 515.34 521.34
## + SkinThickness 1 513.92 521.92
## + BloodPressure 1 514.67 522.67
## - BMI 1 548.32 552.32
## - Glucose 1 634.08 638.08
##
## Step: AIC=501.66
## Outcome ~ Glucose + BMI + Age
##
## Df Deviance AIC
## + Pregnancies 1 489.25 499.25
## + DiabetesPedigreeFunction 1 489.98 499.98
## + BloodPressure 1 490.54 500.54
## + Insulin 1 491.39 501.39
## <none> 493.66 501.66
## + SkinThickness 1 493.00 503.00
## - Age 1 515.34 521.34
## - BMI 1 529.85 535.85
## - Glucose 1 590.70 596.70
##
## Step: AIC=499.25
## Outcome ~ Glucose + BMI + Age + Pregnancies
##
## Df Deviance AIC
## + DiabetesPedigreeFunction 1 484.80 496.80
## + BloodPressure 1 485.65 497.65
## <none> 489.25 499.25
## + Insulin 1 487.40 499.40
## + SkinThickness 1 488.62 500.62
## - Pregnancies 1 493.66 501.66
## - Age 1 496.12 504.12
## - BMI 1 524.54 532.54
## - Glucose 1 587.80 595.80
##
## Step: AIC=496.8
## Outcome ~ Glucose + BMI + Age + Pregnancies + DiabetesPedigreeFunction
##
## Df Deviance AIC
## + BloodPressure 1 481.37 495.37
## + Insulin 1 482.11 496.11
## <none> 484.80 496.80
## + SkinThickness 1 483.47 497.47
## - DiabetesPedigreeFunction 1 489.25 499.25
## - Pregnancies 1 489.98 499.98
## - Age 1 491.09 501.09
## - BMI 1 517.85 527.85
## - Glucose 1 578.06 588.06
##
## Step: AIC=495.37
## Outcome ~ Glucose + BMI + Age + Pregnancies + DiabetesPedigreeFunction +
## BloodPressure
##
## Df Deviance AIC
## + Insulin 1 479.11 495.11
## <none> 481.37 495.37
## + SkinThickness 1 480.72 496.72
## - BloodPressure 1 484.80 496.80
## - DiabetesPedigreeFunction 1 485.65 497.65
## - Pregnancies 1 487.06 499.06
## - Age 1 488.78 500.78
## - BMI 1 517.37 529.37
## - Glucose 1 575.37 587.37
##
## Step: AIC=495.11
## Outcome ~ Glucose + BMI + Age + Pregnancies + DiabetesPedigreeFunction +
## BloodPressure + Insulin
##
## Df Deviance AIC
## <none> 479.11 495.11
## - Insulin 1 481.37 495.37
## - BloodPressure 1 482.11 496.11
## + SkinThickness 1 479.07 497.07
## - DiabetesPedigreeFunction 1 484.15 498.15
## - Pregnancies 1 484.20 498.20
## - Age 1 486.34 500.34
## - BMI 1 516.57 530.57
## - Glucose 1 570.95 584.95
##
## Call: glm(formula = Outcome ~ Glucose + BMI + Age + Pregnancies + DiabetesPedigreeFunction +
## BloodPressure + Insulin, family = binomial, data = trainCaret)
##
## Coefficients:
## (Intercept) Glucose BMI
## -9.858443 0.039208 0.100440
## Age Pregnancies DiabetesPedigreeFunction
## 0.033384 0.089295 0.818040
## BloodPressure Insulin
## -0.010512 -0.001465
##
## Degrees of Freedom: 537 Total (i.e. Null); 530 Residual
## Null Deviance: 696.3
## Residual Deviance: 479.1 AIC: 495.1
Choose the best model of a given size.
# the Xy matrix needs y as the right-most variable
# this one already has Outcome at the last column
best.Xy <- trainCaret
# Run best subset
bglm.AIC = bestglm(Xy = best.Xy, family = binomial, IC = "AIC",
TopModels = 10)
## Morgan-Tatar search since family is non-gaussian.
bglm.AIC$BestModels
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 TRUE TRUE TRUE FALSE TRUE TRUE
## 2 TRUE TRUE TRUE FALSE FALSE TRUE
## 3 TRUE TRUE FALSE FALSE TRUE TRUE
## 4 TRUE TRUE TRUE TRUE FALSE TRUE
## 5 TRUE TRUE FALSE FALSE FALSE TRUE
## 6 TRUE TRUE TRUE TRUE TRUE TRUE
## 7 TRUE TRUE FALSE TRUE FALSE TRUE
## 8 TRUE TRUE TRUE FALSE FALSE TRUE
## 9 TRUE TRUE FALSE TRUE TRUE TRUE
## 10 TRUE TRUE TRUE FALSE TRUE TRUE
## DiabetesPedigreeFunction Age Criterion
## 1 TRUE TRUE 493.1085
## 2 TRUE TRUE 493.3689
## 3 TRUE TRUE 494.1125
## 4 TRUE TRUE 494.7150
## 5 TRUE TRUE 494.8003
## 6 TRUE TRUE 495.0723
## 7 TRUE TRUE 495.4712
## 8 FALSE TRUE 495.6468
## 9 TRUE TRUE 495.8685
## 10 FALSE TRUE 496.1452
Observe that the best model is agreed upon by the backward, forward, and best subset methods!
## BEST MODEL INCLUDES:
## Pregnancies+Glucose+BloodPressure+Insulin+BMI+DiabetesPedigreeFunction+Age
modBest<-glm(Outcome ~Pregnancies+Glucose+BloodPressure+
Insulin+BMI+DiabetesPedigreeFunction+Age,
data = trainCaret, family = "binomial")
summary(modBest)
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
## Insulin + BMI + DiabetesPedigreeFunction + Age, family = "binomial",
## data = trainCaret)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7148 -0.6859 -0.3728 0.6465 2.8887
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.8584426 0.9550310 -10.323 < 2e-16 ***
## Pregnancies 0.0892952 0.0398994 2.238 0.02522 *
## Glucose 0.0392085 0.0046882 8.363 < 2e-16 ***
## BloodPressure -0.0105119 0.0061027 -1.723 0.08498 .
## Insulin -0.0014653 0.0009707 -1.510 0.13114
## BMI 0.1004396 0.0179787 5.587 2.32e-08 ***
## DiabetesPedigreeFunction 0.8180399 0.3714597 2.202 0.02765 *
## Age 0.0333838 0.0124085 2.690 0.00714 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 696.28 on 537 degrees of freedom
## Residual deviance: 479.11 on 530 degrees of freedom
## AIC: 495.11
##
## Number of Fisher Scoring iterations: 5
Predict for the best model:
## Predict
predBest<-predict(modBest, newdata = testCaret, type="response")
Threshold and Confusion Matrix:
## Threshold
conf_matBest<-data.frame(testOutcome=testCaret$Outcome, predOut=predBest>.5)
## Confusion Matrix
table(conf_matBest$predOut, conf_matBest$testOutcome)
##
## 0 1
## FALSE 126 36
## TRUE 24 44
## CORRECT RATE
mean(conf_matBest$predOut==conf_matBest$testOutcome)
## [1] 0.7391304
Using the same training and testing sets, we can compare the error rates. Which model would you pick?
Only uses Glucose
to predict Outcome
.
## Confusion Matrix
table(conf_mat1$predOut, conf_mat1$testOutcome)
##
## 0 1
## FALSE 128 42
## TRUE 22 38
## CORRECT RATE
mean(conf_mat1$predOut==conf_mat1$testOutcome)
## [1] 0.7217391
Uses Pregnancies
, Glucose
,
BloodPressure
, Insulin
, BMI
,
DiabetesPedigreeFunction
, and Age
to predict
Outcome
.
## Confusion Matrix
table(conf_matBest$predOut, conf_matBest$testOutcome)
##
## 0 1
## FALSE 126 36
## TRUE 24 44
## CORRECT RATE
mean(conf_matBest$predOut==conf_matBest$testOutcome)
## [1] 0.7391304
Uses the 12 closest neighbors.
From the KNN lesson we had
## Confusion Matrix
table(knn.pred12,testOut)
## testOut
## knn.pred12 0 1
## 0 118 37
## 1 32 43
## CORRECT RATE
mean(knn.pred12==testOut)
## [1] 0.7