WEEK 10

Author

Kaenat Gul (N1325553)

Loading Packages

library(tibble)
library(tidyverse)
library(vtable)
library(ggplot2)
library(gplots)
library(graphics)
library(vcd)
library(tidyr)
library(knitr)
library(gmodels)
library(ggpubr)
library(PairedData)
library(multcomp)
library(car)

Logistic Regression Essentials in R:

Logistic regression is used to predict the class (or category) of individuals based on one or multiple predictor variables (x). It is used to model a binary outcome, that is a variable, which can have only two possible values: 0 or 1, yes or no, diseased or non-diseased.

Logistic function:

The standard logistic regression function, for predicting the outcome of an observation given a predictor variable (x), is an s-shaped curve defined as p = exp(y) / [1 + exp(y)] (James et al. 2014). This can be also simply written as p = 1/[1 + exp(-y)]

library(caret)
Warning: package 'caret' was built under R version 4.4.2

Attaching package: 'caret'
The following object is masked from 'package:survival':

    cluster
The following object is masked from 'package:purrr':

    lift
library(dplyr)
theme_set(theme_bw())

Preparing the data:

Performing the following steps might improve the accuracy of your model:

  1. Remove potential outliers
  2. Make sure that the predictor variables are normally distributed. If not, you can use log, root, Box-Cox transformation.
  3. Remove highly correlated predictors to minimize overfitting. The presence of highly correlated predictors might lead to an unstable model solution.

Here, we’ll use the PimaIndiansDiabetes2 [in mlbench package]for predicting the probability of being diabetes positive based on multiple clinical variables.

# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
    pregnant glucose pressure triceps insulin mass pedigree age diabetes
724        5     117       86      30     105 39.1    0.251  42      neg
460        9     134       74      33      60 25.9    0.460  81      neg
253        2      90       80      14      55 24.4    0.249  24      neg
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>% 
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]

Computing logistic regression:

The R function glm(), for generalized linear model, can be used to compute logistic regression. You need to specify the option family = binomial, which tells to R that we want to fit logistic regression.

# Fit the model
model <- glm( diabetes ~., data = train.data, family = binomial)
# Summarize the model
summary(model)

Call:
glm(formula = diabetes ~ ., family = binomial, data = train.data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.053e+01  1.440e+00  -7.317 2.54e-13 ***
pregnant     1.005e-01  6.127e-02   1.640  0.10092    
glucose      3.710e-02  6.486e-03   5.719 1.07e-08 ***
pressure    -3.876e-04  1.383e-02  -0.028  0.97764    
triceps      1.418e-02  1.998e-02   0.710  0.47800    
insulin      5.940e-04  1.508e-03   0.394  0.69371    
mass         7.997e-02  3.180e-02   2.515  0.01190 *  
pedigree     1.329e+00  4.823e-01   2.756  0.00585 ** 
age          2.718e-02  2.020e-02   1.346  0.17840    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 398.80  on 313  degrees of freedom
Residual deviance: 267.18  on 305  degrees of freedom
AIC: 285.18

Number of Fisher Scoring iterations: 5
# Make predictions
probabilities <- model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
mean(predicted.classes == test.data$diabetes)
[1] 0.7564103

Simple logistic regression:

The simple logistic regression is used to predict the probability of class membership based on one single predictor variable.

model <- glm( diabetes ~ glucose, data = train.data, family = binomial)
summary(model)$coef
               Estimate  Std. Error   z value     Pr(>|z|)
(Intercept) -6.15882009 0.700096646 -8.797100 1.403974e-18
glucose      0.04327234 0.005341133  8.101716 5.418949e-16

Predictions can be easily made using the function predict(). Use the option type = “response” to directly obtain the probabilities:

newdata <- data.frame(glucose = c(20,  180))
probabilities <- model %>% predict(newdata, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
predicted.classes
    1     2 
"neg" "pos" 

The logistic function gives an s-shaped probability curve illustrated as follow:

train.data %>%
  mutate(prob = ifelse(diabetes == "pos", 1, 0)) %>%
  ggplot(aes(glucose, prob)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "glm", method.args = list(family = "binomial")) +
  labs(
    title = "Logistic Regression Model", 
    x = "Plasma Glucose Concentration",
    y = "Probability of being diabete-pos"
    )
`geom_smooth()` using formula = 'y ~ x'

Multiple logistic regression:

model <- glm( diabetes ~ glucose + mass + pregnant, 
                data = train.data, family = binomial)
summary(model)$coef
               Estimate  Std. Error   z value     Pr(>|z|)
(Intercept) -9.32369818 1.125997285 -8.280391 1.227711e-16
glucose      0.03886154 0.005404219  7.190962 6.433636e-13
mass         0.09458458 0.023529905  4.019760 5.825738e-05
pregnant     0.14466661 0.045125729  3.205857 1.346611e-03

Here, we want to include all the predictor variables available in the data set. This is done using ~.:

model <- glm( diabetes ~., data = train.data, family = binomial)
summary(model)$coef
                 Estimate  Std. Error     z value     Pr(>|z|)
(Intercept) -1.053400e+01 1.439679266 -7.31690975 2.537464e-13
pregnant     1.005031e-01 0.061266974  1.64041157 1.009196e-01
glucose      3.709621e-02 0.006486093  5.71934633 1.069346e-08
pressure    -3.875933e-04 0.013826185 -0.02803328 9.776356e-01
triceps      1.417771e-02 0.019981885  0.70952823 4.779967e-01
insulin      5.939876e-04 0.001508231  0.39383055 6.937061e-01
mass         7.997447e-02 0.031798907  2.51500698 1.190300e-02
pedigree     1.329149e+00 0.482291020  2.75590704 5.852963e-03
age          2.718224e-02 0.020199295  1.34570257 1.783985e-01

Note that, the functions coef() and summary() can be used to extract only the coefficients, as follow:

coef(model)
  (Intercept)      pregnant       glucose      pressure       triceps 
-1.053400e+01  1.005031e-01  3.709621e-02 -3.875933e-04  1.417771e-02 
      insulin          mass      pedigree           age 
 5.939876e-04  7.997447e-02  1.329149e+00  2.718224e-02 
summary(model )$coef
                 Estimate  Std. Error     z value     Pr(>|z|)
(Intercept) -1.053400e+01 1.439679266 -7.31690975 2.537464e-13
pregnant     1.005031e-01 0.061266974  1.64041157 1.009196e-01
glucose      3.709621e-02 0.006486093  5.71934633 1.069346e-08
pressure    -3.875933e-04 0.013826185 -0.02803328 9.776356e-01
triceps      1.417771e-02 0.019981885  0.70952823 4.779967e-01
insulin      5.939876e-04 0.001508231  0.39383055 6.937061e-01
mass         7.997447e-02 0.031798907  2.51500698 1.190300e-02
pedigree     1.329149e+00 0.482291020  2.75590704 5.852963e-03
age          2.718224e-02 0.020199295  1.34570257 1.783985e-01

Interpretation:

The coefficient estimate of the variable glucose is b = 0.045, which is positive. This means that an increase in glucose is associated with increase in the probability of being diabetes-positive. However the coefficient for the variable pressure is b = -0.007, which is negative. This means that an increase in blood pressure will be associated with a decreased probability of being diabetes-positive.

An important concept to understand, for interpreting the logistic beta coefficients, is the odds ratio. An odds ratio measures the association between a predictor variable (x) and the outcome variable (y). It represents the ratio of the odds that an event will occur (event = 1) given the presence of the predictor x (x = 1), compared to the odds of the event occurring in the absence of that predictor (x = 0).

Here, as we have a small number of predictors (n = 9), we can select manually the most significant:

model <- glm( diabetes ~ pregnant + glucose + pressure + mass + pedigree, 
                data = train.data, family = binomial)

Making predictions :

  1. Predict the class membership probabilities of observations based on predictor variables
  2. Assign the observations to the class with highest probability score (i.e above 0.5).
probabilities <- model %>% predict(test.data, type = "response")
head(probabilities)
       19        21        32        55        64        71 
0.1352603 0.5127526 0.6795376 0.7517408 0.2734867 0.1648174 

Check the dummy coding:

contrasts(test.data$diabetes)
    pos
neg   0
pos   1

Predict the class of individuals:

predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
head(predicted.classes)
   19    21    32    55    64    71 
"neg" "pos" "pos" "pos" "neg" "neg" 

Assessing model accuracy:

The model accuracy is measured as the proportion of observations that have been correctly classified. Inversely, the classification error is defined as the proportion of observations that have been misclassified.

mean(predicted.classes == test.data$diabetes)
[1] 0.7564103

Discussion:

library("mgcv")
Loading required package: nlme

Attaching package: 'nlme'
The following object is masked from 'package:dplyr':

    collapse
This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
# Fit the model
gam.model <- gam(diabetes ~ s(glucose) + mass + pregnant,
                 data = train.data, family = "binomial")
# Summarize model
summary(gam.model )

Family: binomial 
Link function: logit 

Formula:
diabetes ~ s(glucose) + mass + pregnant

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.59794    0.86982  -5.286 1.25e-07 ***
mass         0.09458    0.02353   4.020 5.83e-05 ***
pregnant     0.14467    0.04513   3.206  0.00135 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
           edf Ref.df Chi.sq p-value    
s(glucose)   1      1  51.71  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.339   Deviance explained = 29.8%
UBRE = -0.083171  Scale est. = 1         n = 314
# Make predictions
probabilities <- gam.model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities> 0.5, "pos", "neg")
# Model Accuracy
mean(predicted.classes == test.data$diabetes)
[1] 0.7820513