Introduction

These are my notes for the lectures of the Coursera course “Introduction to Logistic Regression” by Professor Stanley Lemeshow. The goal of these notes is to provide the R code to obtain the same results as the Stata code in the lectures. Please read the Preliminaries of the code for lecture 1 for some details.

R code for previous lectures:

Warning about the data sets:

The first slides for this lecture (approx. up to slide 20) contain two examples for which the source data is not available. I have given some thought to the idea of simulating the data from the models, but unfortunately I don’t have the time for that. Thus in this document I will only deal with the last example in this weeks lecture, beginning from slide 26.

Example with the low birth weight data

Reading the data file and creating a data frame.

This example uses the LOWBWT data set. You can get the data set in txt format from the book web site, as described in my notes for Week1 of the course. We read the full data set into R:

LOWBWT = read.table("./data/LOWBWT.txt", header = TRUE)

But since we are only going to use three variables, I will redefine the data set. Besides, the LWD is defined as a factor from the LWT variable in the original data.

(Y = LOWBWT$LOW)
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
##  [71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [106] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [141] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [176] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(LWD = factor(ifelse(LOWBWT$LWT >= 110, yes = 0, no = 1)))
##   [1] 0 0 0 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 0 1 1 0 0
##  [36] 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 1 0 0 1
##  [71] 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1
## [106] 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [141] 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [176] 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1
(AGE = LOWBWT$AGE)
##   [1] 28 29 34 25 25 27 23 24 24 21 32 19 25 16 25 20 21 24 21 20 25 19 19
##  [24] 26 24 17 20 22 27 20 17 25 20 18 18 20 21 26 31 15 23 20 24 15 23 30
##  [47] 22 17 23 17 26 20 26 14 28 14 23 17 21 19 33 20 21 18 21 22 17 29 26
##  [70] 19 19 22 30 18 18 15 25 20 28 32 31 36 28 25 28 17 29 26 17 17 24 35
##  [93] 25 25 29 19 27 31 33 21 19 23 21 18 18 32 19 24 22 22 23 22 30 19 16
## [116] 21 30 20 17 17 23 24 28 26 20 24 28 20 22 22 31 23 16 16 18 25 32 20
## [139] 23 22 32 30 20 23 17 19 23 36 22 24 21 19 25 16 29 29 19 19 30 24 19
## [162] 24 23 20 25 30 22 18 16 32 18 29 33 20 28 14 28 25 16 20 26 21 22 25
## [185] 31 35 19 24 45
LOWBWT = data.frame(Y, LWD, AGE)

Logistic models

The construction of the logistic models follows our usual routine. To keep in sync with the results in the lecture, we begin with a null model:

model0 = glm(Y ~  1, family = binomial(link = "logit"), data = LOWBWT)
(summ0 = summary(model0))
## 
## Call:
## glm(formula = Y ~ 1, family = binomial(link = "logit"), data = LOWBWT)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8651  -0.8651  -0.8651   1.5259   1.5259  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.790      0.157  -5.033 4.84e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 234.67  on 188  degrees of freedom
## AIC: 236.67
## 
## Number of Fisher Scoring iterations: 4
logLik(model0)
## 'log Lik.' -117.336 (df=1)

Next we consider a model with the dichotomous LWD as a predictor. We also compute the G statistic for this model and the corresponding p-value of a likelihood ratio test.

model1 = glm(Y ~  LWD, family = binomial(link = "logit"), data = LOWBWT)
(summ1 = summary(model1))
## 
## Call:
## glm(formula = Y ~ LWD, family = binomial(link = "logit"), data = LOWBWT)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1774  -0.7734  -0.7734   1.1774   1.6449  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.0538     0.1884  -5.594 2.22e-08 ***
## LWD1          1.0538     0.3616   2.914  0.00356 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 226.24  on 187  degrees of freedom
## AIC: 230.24
## 
## Number of Fisher Scoring iterations: 4
logLik(model1)
## 'log Lik.' -113.1206 (df=2)
(Gmodel1 = summ1$null.deviance - summ1$deviance) 
## [1] 8.430839
(pValue1 = pchisq(Gmodel1, df = 1, lower.tail = FALSE))
## [1] 0.003689101

Note that the value summ1$null.deviance equals the deviance of model0:

summ1$null.deviance
## [1] 234.672
summ0$deviance
## [1] 234.672

The next model adds AGE as a covariate:

model2 = glm(Y ~  LWD + AGE, family = binomial(link = "logit"), data = LOWBWT)
(summ2 = summary(model2))
## 
## Call:
## glm(formula = Y ~ LWD + AGE, family = binomial(link = "logit"), 
##     data = LOWBWT)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.316  -0.822  -0.733   1.211   1.858  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.02689    0.76215  -0.035  0.97185   
## LWD1         1.01012    0.36426   2.773  0.00555 **
## AGE         -0.04423    0.03222  -1.373  0.16987   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 224.29  on 186  degrees of freedom
## AIC: 230.29
## 
## Number of Fisher Scoring iterations: 4
logLik(model2)
## 'log Lik.' -112.1434 (df=3)
(Gmodel2 = summ1$deviance - summ2$deviance) 
## [1] 1.954395
(pValue2 = pchisq(Gmodel2, df = 1, lower.tail = FALSE))
## [1] 0.162114

And finally, we consider a model with an interaction term. This model is defined in R by using an asterisk:

model3 = glm(Y ~  LWD * AGE, family = binomial(link = "logit"), data = LOWBWT)
(summ3 = summary(model3))
## 
## Call:
## glm(formula = Y ~ LWD * AGE, family = binomial(link = "logit"), 
##     data = LOWBWT)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4257  -0.8554  -0.6960   1.1602   2.0329  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.77450    0.91010   0.851   0.3948  
## LWD1        -1.94409    1.72481  -1.127   0.2597  
## AGE         -0.07957    0.03963  -2.008   0.0447 *
## LWD1:AGE     0.13220    0.07570   1.746   0.0807 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 221.14  on 185  degrees of freedom
## AIC: 229.14
## 
## Number of Fisher Scoring iterations: 4
logLik(model3)
## 'log Lik.' -110.57 (df=4)
(Gmodel3 = summ2$deviance - summ3$deviance) 
## [1] 3.146832
(pValue3 = pchisq(Gmodel3, df = 1, lower.tail = FALSE))
## [1] 0.07607455

Let me mention that an equivalent definition of the model formula is Y ~ LWD + AGE + LWD:AGE (the syntax for model definition in R may seem strange at first, but you will get used to it with practice; you might start by taking a look at the help for formula).

Log Odds Ratio estimate for a given age in the model with interaction.

To estimate the log odds ratio and its variance we are going to need the variance - covariance matrix for this model. That is:

(vcov3 = vcov(model3))
##             (Intercept)        LWD1          AGE     LWD1:AGE
## (Intercept)  0.82827928 -0.82827928 -0.035266546  0.035266546
## LWD1        -0.82827928  2.97495564  0.035266546 -0.127603824
## AGE         -0.03526655  0.03526655  0.001570894 -0.001570894
## LWD1:AGE     0.03526655 -0.12760382 -0.001570894  0.005730240

Now, to make our work easier, we define two functions to get the estimates for the log odds ratio and the variance:

LogOR = function(a){
  coefficients(model3)["LWD1"] +  coefficients(model3)["LWD1:AGE"] * a 
}

VarLogOR = function(a){
    vcov3[2, 2] + a^2 * vcov3[4, 4] + 2 * a * vcov3[2, 4]
}

For example, for a woman aged 30 we get:

LogOR(30)
##     LWD1 
## 2.021813
VarLogOR(30)
## [1] 0.4759425

These values are slighlty different from the values in the lecture pdf (2.016 and 0.452, respectively). A rounding problem, perhaps?

Computing log odd ratios for a set of age values

Using the functions we defined previously, we can easily generalize the above computations to any vector of age values:

(ageValues = 15:45)
##  [1] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
## [24] 38 39 40 41 42 43 44 45

The corresponding estimates for the log odds ratios and variances are

(LogOR.ageValues = LogOR(ageValues))
##  [1] 0.03886225 0.17105900 0.30325574 0.43545249 0.56764923 0.69984597
##  [7] 0.83204272 0.96423946 1.09643621 1.22863295 1.36082970 1.49302644
## [13] 1.62522318 1.75741993 1.88961667 2.02181342 2.15401016 2.28620691
## [19] 2.41840365 2.55060039 2.68279714 2.81499388 2.94719063 3.07938737
## [25] 3.21158411 3.34378086 3.47597760 3.60817435 3.74037109 3.87256784
## [31] 4.00476458
(VarLogOR.ageValues = VarLogOR(ageValues))
##  [1] 0.4361450 0.3585748 0.2924651 0.2378158 0.1946271 0.1628988 0.1426310
##  [8] 0.1338237 0.1364769 0.1505905 0.1761646 0.2131992 0.2616943 0.3216499
## [15] 0.3930659 0.4759425 0.5702795 0.6760770 0.7933349 0.9220534 1.0622323
## [22] 1.2138717 1.3769716 1.5515320 1.7375529 1.9350342 2.1439760 2.3643783
## [29] 2.5962411 2.8395643 3.0943481

You can check these values against the odds ratio that appear in the table on page 30 of the lecture pdf. That table contains only a subset of the values displayed here. I have made the vector into a named vector (with the age as name) to make it easier to check the values.

OR_ageValues = round(exp(LogOR(ageValues)), 1)
names(OR_ageValues) = ageValues
OR_ageValues
##   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29 
##  1.0  1.2  1.4  1.5  1.8  2.0  2.3  2.6  3.0  3.4  3.9  4.5  5.1  5.8  6.6 
##   30   31   32   33   34   35   36   37   38   39   40   41   42   43   44 
##  7.6  8.6  9.8 11.2 12.8 14.6 16.7 19.1 21.7 24.8 28.3 32.3 36.9 42.1 48.1 
##   45 
## 54.9

Graph of the estimated logit confidence intervals

We are going to plot the above values of the log odd ratios, using the variances to obtain the confidence intervals (note that it is a 90% ci).

par(lwd=4)
plot(ageValues, LogOR.ageValues, type = "l", col="orange", xlim = c(15,35), ylim=c(-2, 5),
     xlab="AGE", ylab="LogOdds")
points(ageValues, LogOR.ageValues, , pch = "·", cex=4, col="orange")
points(ageValues, LogOR.ageValues + qnorm(0.95) * sqrt(VarLogOR.ageValues), type = "l", col="brown")
points(ageValues, LogOR.ageValues + qnorm(0.95) * sqrt(VarLogOR.ageValues), pch = "·", cex=4, col="brown")
points(ageValues, LogOR.ageValues - qnorm(0.95) * sqrt(VarLogOR.ageValues), type = "l", col="blue")
points(ageValues, LogOR.ageValues - qnorm(0.95) * sqrt(VarLogOR.ageValues), pch = "·", cex=4, col="blue")

To add the linear (non-interaction) model to this graph we need to obtain the log odds ratio for this model, which does not depend on age (see also page 33 of the lecture pdf):

(logOR.mod2 = coefficients(model2)["LWD1"])
##     LWD1 
## 1.010122

Note: I have used a different function to access the coefficients for a model. You can use the function like this to get the coefficient by name, if you like.

The odds ratio is then

(OR = exp(logOR.mod2))
##     LWD1 
## 2.745937

The variance can be obtained from the variance-caovariance matrix for this model:

VarlogOR.mod2 = vcov(model2)[2, 2]

and the standard error for the log odds ratio is:

sqrt(VarlogOR.mod2)
## [1] 0.3642627

With these values we are ready to add the non-interaction model to the plot, to get a figure like the one on the last page of the lecture pdf:

par(lwd=4)
plot(ageValues, LogOR.ageValues, type = "l", col="orange", xlim = c(15,35), ylim=c(-2, 5),
     xlab="AGE", ylab="LogOdds")
points(ageValues, LogOR.ageValues, , pch = "·", cex=4, col="orange")
points(ageValues, LogOR.ageValues + qnorm(0.95) * sqrt(VarLogOR.ageValues), type = "l", col="brown")
points(ageValues, LogOR.ageValues + qnorm(0.95) * sqrt(VarLogOR.ageValues), pch = "·", cex=4, col="brown")
points(ageValues, LogOR.ageValues - qnorm(0.95) * sqrt(VarLogOR.ageValues), type = "l", col="blue")
points(ageValues, LogOR.ageValues - qnorm(0.95) * sqrt(VarLogOR.ageValues), pch = "·", cex=4, col="blue")
points(ageValues, rep(logOR.mod2, length(ageValues)), type = "l", col="black")
points(ageValues, rep(logOR.mod2, length(ageValues)), pch = "·", cex=4, col="black")
points(ageValues, rep(logOR.mod2 + qnorm(0.95) * sqrt(VarlogOR.mod2), length(ageValues)), type = "l", col="green")
points(ageValues, rep(logOR.mod2 + qnorm(0.95) * sqrt(VarlogOR.mod2), length(ageValues)), pch = "·", cex=4, col="green")
points(ageValues, rep(logOR.mod2 - qnorm(0.95) * sqrt(VarlogOR.mod2), length(ageValues)), pch = "·", cex=4, col="magenta")
points(ageValues, rep(logOR.mod2 - qnorm(0.95) * sqrt(VarlogOR.mod2), length(ageValues)), type = "l", col="magenta")
legend("bottom", legend= c("No interaction", "Interaction"),col = c("orange", "black"), bty=1, lwd=3,cex=1)


Thanks for your attention!