library(tibble)
library(tidyverse)
library(vtable)
library(ggplot2)
library(gplots)
library(graphics)
library(vcd)
library(tidyr)
library(knitr)
library(gmodels)
library(ggpubr)
library(PairedData)
library(multcomp)
library(car)
WEEK 10
Loading Packages
Logistic Regression Essentials in R:
Logistic regression is used to predict the class (or category) of individuals based on one or multiple predictor variables (x). It is used to model a binary outcome, that is a variable, which can have only two possible values: 0 or 1, yes or no, diseased or non-diseased.
Logistic function:
The standard logistic regression function, for predicting the outcome of an observation given a predictor variable (x), is an s-shaped curve defined as p = exp(y) / [1 + exp(y)] (James et al. 2014). This can be also simply written as p = 1/[1 + exp(-y)]
library(caret)
Warning: package 'caret' was built under R version 4.4.2
Attaching package: 'caret'
The following object is masked from 'package:survival':
cluster
The following object is masked from 'package:purrr':
lift
library(dplyr)
theme_set(theme_bw())
Preparing the data:
Performing the following steps might improve the accuracy of your model:
- Remove potential outliers
- Make sure that the predictor variables are normally distributed. If not, you can use log, root, Box-Cox transformation.
- Remove highly correlated predictors to minimize overfitting. The presence of highly correlated predictors might lead to an unstable model solution.
Here, we’ll use the PimaIndiansDiabetes2 [in mlbench package]for predicting the probability of being diabetes positive based on multiple clinical variables.
# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
<- na.omit(PimaIndiansDiabetes2)
PimaIndiansDiabetes2 # Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
pregnant glucose pressure triceps insulin mass pedigree age diabetes
724 5 117 86 30 105 39.1 0.251 42 neg
460 9 134 74 33 60 25.9 0.460 81 neg
253 2 90 80 14 55 24.4 0.249 24 neg
# Split the data into training and test set
set.seed(123)
<- PimaIndiansDiabetes2$diabetes %>%
training.samples createDataPartition(p = 0.8, list = FALSE)
<- PimaIndiansDiabetes2[training.samples, ]
train.data <- PimaIndiansDiabetes2[-training.samples, ] test.data
Computing logistic regression:
The R function glm(), for generalized linear model, can be used to compute logistic regression. You need to specify the option family = binomial, which tells to R that we want to fit logistic regression.
# Fit the model
<- glm( diabetes ~., data = train.data, family = binomial)
model # Summarize the model
summary(model)
Call:
glm(formula = diabetes ~ ., family = binomial, data = train.data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.053e+01 1.440e+00 -7.317 2.54e-13 ***
pregnant 1.005e-01 6.127e-02 1.640 0.10092
glucose 3.710e-02 6.486e-03 5.719 1.07e-08 ***
pressure -3.876e-04 1.383e-02 -0.028 0.97764
triceps 1.418e-02 1.998e-02 0.710 0.47800
insulin 5.940e-04 1.508e-03 0.394 0.69371
mass 7.997e-02 3.180e-02 2.515 0.01190 *
pedigree 1.329e+00 4.823e-01 2.756 0.00585 **
age 2.718e-02 2.020e-02 1.346 0.17840
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 398.80 on 313 degrees of freedom
Residual deviance: 267.18 on 305 degrees of freedom
AIC: 285.18
Number of Fisher Scoring iterations: 5
# Make predictions
<- model %>% predict(test.data, type = "response")
probabilities <- ifelse(probabilities > 0.5, "pos", "neg")
predicted.classes # Model accuracy
mean(predicted.classes == test.data$diabetes)
[1] 0.7564103
Simple logistic regression:
The simple logistic regression is used to predict the probability of class membership based on one single predictor variable.
<- glm( diabetes ~ glucose, data = train.data, family = binomial)
model summary(model)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.15882009 0.700096646 -8.797100 1.403974e-18
glucose 0.04327234 0.005341133 8.101716 5.418949e-16
Predictions can be easily made using the function predict(). Use the option type = “response” to directly obtain the probabilities:
<- data.frame(glucose = c(20, 180))
newdata <- model %>% predict(newdata, type = "response")
probabilities <- ifelse(probabilities > 0.5, "pos", "neg")
predicted.classes predicted.classes
1 2
"neg" "pos"
The logistic function gives an s-shaped probability curve illustrated as follow:
%>%
train.data mutate(prob = ifelse(diabetes == "pos", 1, 0)) %>%
ggplot(aes(glucose, prob)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
labs(
title = "Logistic Regression Model",
x = "Plasma Glucose Concentration",
y = "Probability of being diabete-pos"
)
`geom_smooth()` using formula = 'y ~ x'
Multiple logistic regression:
<- glm( diabetes ~ glucose + mass + pregnant,
model data = train.data, family = binomial)
summary(model)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.32369818 1.125997285 -8.280391 1.227711e-16
glucose 0.03886154 0.005404219 7.190962 6.433636e-13
mass 0.09458458 0.023529905 4.019760 5.825738e-05
pregnant 0.14466661 0.045125729 3.205857 1.346611e-03
Here, we want to include all the predictor variables available in the data set. This is done using ~.:
<- glm( diabetes ~., data = train.data, family = binomial)
model summary(model)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.053400e+01 1.439679266 -7.31690975 2.537464e-13
pregnant 1.005031e-01 0.061266974 1.64041157 1.009196e-01
glucose 3.709621e-02 0.006486093 5.71934633 1.069346e-08
pressure -3.875933e-04 0.013826185 -0.02803328 9.776356e-01
triceps 1.417771e-02 0.019981885 0.70952823 4.779967e-01
insulin 5.939876e-04 0.001508231 0.39383055 6.937061e-01
mass 7.997447e-02 0.031798907 2.51500698 1.190300e-02
pedigree 1.329149e+00 0.482291020 2.75590704 5.852963e-03
age 2.718224e-02 0.020199295 1.34570257 1.783985e-01
Note that, the functions coef() and summary() can be used to extract only the coefficients, as follow:
coef(model)
(Intercept) pregnant glucose pressure triceps
-1.053400e+01 1.005031e-01 3.709621e-02 -3.875933e-04 1.417771e-02
insulin mass pedigree age
5.939876e-04 7.997447e-02 1.329149e+00 2.718224e-02
summary(model )$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.053400e+01 1.439679266 -7.31690975 2.537464e-13
pregnant 1.005031e-01 0.061266974 1.64041157 1.009196e-01
glucose 3.709621e-02 0.006486093 5.71934633 1.069346e-08
pressure -3.875933e-04 0.013826185 -0.02803328 9.776356e-01
triceps 1.417771e-02 0.019981885 0.70952823 4.779967e-01
insulin 5.939876e-04 0.001508231 0.39383055 6.937061e-01
mass 7.997447e-02 0.031798907 2.51500698 1.190300e-02
pedigree 1.329149e+00 0.482291020 2.75590704 5.852963e-03
age 2.718224e-02 0.020199295 1.34570257 1.783985e-01
Interpretation:
The coefficient estimate of the variable glucose is b = 0.045, which is positive. This means that an increase in glucose is associated with increase in the probability of being diabetes-positive. However the coefficient for the variable pressure is b = -0.007, which is negative. This means that an increase in blood pressure will be associated with a decreased probability of being diabetes-positive.
An important concept to understand, for interpreting the logistic beta coefficients, is the odds ratio. An odds ratio measures the association between a predictor variable (x) and the outcome variable (y). It represents the ratio of the odds that an event will occur (event = 1) given the presence of the predictor x (x = 1), compared to the odds of the event occurring in the absence of that predictor (x = 0).
Here, as we have a small number of predictors (n = 9), we can select manually the most significant:
<- glm( diabetes ~ pregnant + glucose + pressure + mass + pedigree,
model data = train.data, family = binomial)
Making predictions :
- Predict the class membership probabilities of observations based on predictor variables
- Assign the observations to the class with highest probability score (i.e above 0.5).
<- model %>% predict(test.data, type = "response")
probabilities head(probabilities)
19 21 32 55 64 71
0.1352603 0.5127526 0.6795376 0.7517408 0.2734867 0.1648174
Check the dummy coding:
contrasts(test.data$diabetes)
pos
neg 0
pos 1
Predict the class of individuals:
<- ifelse(probabilities > 0.5, "pos", "neg")
predicted.classes head(predicted.classes)
19 21 32 55 64 71
"neg" "pos" "pos" "pos" "neg" "neg"
Assessing model accuracy:
The model accuracy is measured as the proportion of observations that have been correctly classified. Inversely, the classification error is defined as the proportion of observations that have been misclassified.
mean(predicted.classes == test.data$diabetes)
[1] 0.7564103
Discussion:
library("mgcv")
Loading required package: nlme
Attaching package: 'nlme'
The following object is masked from 'package:dplyr':
collapse
This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
# Fit the model
<- gam(diabetes ~ s(glucose) + mass + pregnant,
gam.model data = train.data, family = "binomial")
# Summarize model
summary(gam.model )
Family: binomial
Link function: logit
Formula:
diabetes ~ s(glucose) + mass + pregnant
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.59794 0.86982 -5.286 1.25e-07 ***
mass 0.09458 0.02353 4.020 5.83e-05 ***
pregnant 0.14467 0.04513 3.206 0.00135 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(glucose) 1 1 51.71 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.339 Deviance explained = 29.8%
UBRE = -0.083171 Scale est. = 1 n = 314
# Make predictions
<- gam.model %>% predict(test.data, type = "response")
probabilities <- ifelse(probabilities> 0.5, "pos", "neg")
predicted.classes # Model Accuracy
mean(predicted.classes == test.data$diabetes)
[1] 0.7820513