Logistic regression was developed to address problems where we want to predict the probability of one of two events happening (binary logistic regression) or one of several events happening (multinomial logistic regression).

The prediction of of a probility, which, of course, must be between 0 and 1. For that to occur, a transformation must be applied to the linear relationship among the independent variables that we used in linear regression; and that is called the logistic transformation.

In order to illustrate this method, we’ll use the famous coronary heart disease data (CHD) from the Hosmer and Lemeshow text [1].

First, let’s read in that data and see what it looks like.

setwd("C:/Users/Ken/Google Drive/Courses/eH705/Ch 13 logistic regression")
df<- as.data.frame(read.csv("CHDdata.csv", sep=",", header=TRUE)) # reading in the data file
attach(df)
head(df)
##   X ID CHD AGE AGRP
## 1 1  1   0  20    1
## 2 2  2   0  23    1
## 3 3  3   0  24    1
## 4 4  4   0  25    1
## 5 5  5   1  25    1
## 6 6  6   0  26    1

The logistic regression model.

Here, we will fit a logistic model to explain coronary heart disease (CHD) by using one predictor, Age.

# fit the logistic model with AGE as a predictor
fit<- glm(CHD ~ AGE, family=binomial)
summary(fit) # not deviances
## 
## Call:
## glm(formula = CHD ~ AGE, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9718  -0.8456  -0.4576   0.8253   2.2859  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.30945    1.13365  -4.683 2.82e-06 ***
## AGE          0.11092    0.02406   4.610 4.02e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 136.66  on 99  degrees of freedom
## Residual deviance: 107.35  on 98  degrees of freedom
## AIC: 111.35
## 
## Number of Fisher Scoring iterations: 4

The “NULL” model

We’d like to know if the model above where Age was used to predict CHD is any better than the “NULL” model where there are no predictor variables.

# where the Null model is
fit0<- glm(CHD~ 1, family=binomial)
summary(fit0)
## 
## Call:
## glm(formula = CHD ~ 1, family = binomial)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.060  -1.060  -1.060   1.299   1.299  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.2819     0.2020  -1.395    0.163
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 136.66  on 99  degrees of freedom
## Residual deviance: 136.66  on 99  degrees of freedom
## AIC: 138.66
## 
## Number of Fisher Scoring iterations: 4

Of course, we need some way to test if the model using Age is any better than the Null model.

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
lrtest(fit, fit0) 
## Likelihood ratio test
## 
## Model 1: CHD ~ AGE
## Model 2: CHD ~ 1
##   #Df  LogLik Df Chisq Pr(>Chisq)    
## 1   2 -53.677                        
## 2   1 -68.331 -1 29.31  6.168e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# or
anova(fit, fit0, test ="Chisq")
## Analysis of Deviance Table
## 
## Model 1: CHD ~ AGE
## Model 2: CHD ~ 1
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        98     107.35                          
## 2        99     136.66 -1   -29.31 6.168e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1