These are my notes for the lectures of the Coursera course “Introduction to Logistic Regression” by Professor Stanley Lemeshow. The goal of these notes is to provide the R code to obtain the same results as the Stata code in the lectures. Please read the Preliminaries of the code for lecture 1 for some details.
The first slides for this lecture (approx. up to slide 20) contain two examples for which the source data is not available. I have given some thought to the idea of simulating the data from the models, but unfortunately I don’t have the time for that. Thus in this document I will only deal with the last example in this weeks lecture, beginning from slide 26.
https://github.com/fernandosansegundo/LogisticRegressionCoursera
This example uses the LOWBWT data set. You can get the data set in txt format from the book web site, as described in my notes for Week1 of the course. We read the full data set into R:
LOWBWT = read.table("./data/LOWBWT.txt", header = TRUE)
But since we are only going to use three variables, I will redefine the data set. Besides, the LWD is defined as a factor from the LWT variable in the original data.
(Y = LOWBWT$LOW)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
## [71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [106] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [141] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [176] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(LWD = factor(ifelse(LOWBWT$LWT >= 110, yes = 0, no = 1)))
## [1] 0 0 0 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 0 1 1 0 0
## [36] 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 1 0 0 1
## [71] 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1
## [106] 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [141] 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [176] 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1
(AGE = LOWBWT$AGE)
## [1] 28 29 34 25 25 27 23 24 24 21 32 19 25 16 25 20 21 24 21 20 25 19 19
## [24] 26 24 17 20 22 27 20 17 25 20 18 18 20 21 26 31 15 23 20 24 15 23 30
## [47] 22 17 23 17 26 20 26 14 28 14 23 17 21 19 33 20 21 18 21 22 17 29 26
## [70] 19 19 22 30 18 18 15 25 20 28 32 31 36 28 25 28 17 29 26 17 17 24 35
## [93] 25 25 29 19 27 31 33 21 19 23 21 18 18 32 19 24 22 22 23 22 30 19 16
## [116] 21 30 20 17 17 23 24 28 26 20 24 28 20 22 22 31 23 16 16 18 25 32 20
## [139] 23 22 32 30 20 23 17 19 23 36 22 24 21 19 25 16 29 29 19 19 30 24 19
## [162] 24 23 20 25 30 22 18 16 32 18 29 33 20 28 14 28 25 16 20 26 21 22 25
## [185] 31 35 19 24 45
LOWBWT = data.frame(Y, LWD, AGE)
The construction of the logistic models follows our usual routine. To keep in sync with the results in the lecture, we begin with a null model:
model0 = glm(Y ~ 1, family = binomial(link = "logit"), data = LOWBWT)
(summ0 = summary(model0))
##
## Call:
## glm(formula = Y ~ 1, family = binomial(link = "logit"), data = LOWBWT)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8651 -0.8651 -0.8651 1.5259 1.5259
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.790 0.157 -5.033 4.84e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 234.67 on 188 degrees of freedom
## AIC: 236.67
##
## Number of Fisher Scoring iterations: 4
logLik(model0)
## 'log Lik.' -117.336 (df=1)
Next we consider a model with the dichotomous LWD as a predictor. We also compute the G statistic for this model and the corresponding p-value of a likelihood ratio test.
model1 = glm(Y ~ LWD, family = binomial(link = "logit"), data = LOWBWT)
(summ1 = summary(model1))
##
## Call:
## glm(formula = Y ~ LWD, family = binomial(link = "logit"), data = LOWBWT)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1774 -0.7734 -0.7734 1.1774 1.6449
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0538 0.1884 -5.594 2.22e-08 ***
## LWD1 1.0538 0.3616 2.914 0.00356 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 226.24 on 187 degrees of freedom
## AIC: 230.24
##
## Number of Fisher Scoring iterations: 4
logLik(model1)
## 'log Lik.' -113.1206 (df=2)
(Gmodel1 = summ1$null.deviance - summ1$deviance)
## [1] 8.430839
(pValue1 = pchisq(Gmodel1, df = 1, lower.tail = FALSE))
## [1] 0.003689101
Note that the value summ1$null.deviance equals the deviance of model0:
summ1$null.deviance
## [1] 234.672
summ0$deviance
## [1] 234.672
The next model adds AGE as a covariate:
model2 = glm(Y ~ LWD + AGE, family = binomial(link = "logit"), data = LOWBWT)
(summ2 = summary(model2))
##
## Call:
## glm(formula = Y ~ LWD + AGE, family = binomial(link = "logit"),
## data = LOWBWT)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.316 -0.822 -0.733 1.211 1.858
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.02689 0.76215 -0.035 0.97185
## LWD1 1.01012 0.36426 2.773 0.00555 **
## AGE -0.04423 0.03222 -1.373 0.16987
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 224.29 on 186 degrees of freedom
## AIC: 230.29
##
## Number of Fisher Scoring iterations: 4
logLik(model2)
## 'log Lik.' -112.1434 (df=3)
(Gmodel2 = summ1$deviance - summ2$deviance)
## [1] 1.954395
(pValue2 = pchisq(Gmodel2, df = 1, lower.tail = FALSE))
## [1] 0.162114
And finally, we consider a model with an interaction term. This model is defined in R by using an asterisk:
model3 = glm(Y ~ LWD * AGE, family = binomial(link = "logit"), data = LOWBWT)
(summ3 = summary(model3))
##
## Call:
## glm(formula = Y ~ LWD * AGE, family = binomial(link = "logit"),
## data = LOWBWT)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4257 -0.8554 -0.6960 1.1602 2.0329
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.77450 0.91010 0.851 0.3948
## LWD1 -1.94409 1.72481 -1.127 0.2597
## AGE -0.07957 0.03963 -2.008 0.0447 *
## LWD1:AGE 0.13220 0.07570 1.746 0.0807 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 221.14 on 185 degrees of freedom
## AIC: 229.14
##
## Number of Fisher Scoring iterations: 4
logLik(model3)
## 'log Lik.' -110.57 (df=4)
(Gmodel3 = summ2$deviance - summ3$deviance)
## [1] 3.146832
(pValue3 = pchisq(Gmodel3, df = 1, lower.tail = FALSE))
## [1] 0.07607455
Let me mention that an equivalent definition of the model formula is Y ~ LWD + AGE + LWD:AGE (the syntax for model definition in R may seem strange at first, but you will get used to it with practice; you might start by taking a look at the help for formula).
To estimate the log odds ratio and its variance we are going to need the variance - covariance matrix for this model. That is:
(vcov3 = vcov(model3))
## (Intercept) LWD1 AGE LWD1:AGE
## (Intercept) 0.82827928 -0.82827928 -0.035266546 0.035266546
## LWD1 -0.82827928 2.97495564 0.035266546 -0.127603824
## AGE -0.03526655 0.03526655 0.001570894 -0.001570894
## LWD1:AGE 0.03526655 -0.12760382 -0.001570894 0.005730240
Now, to make our work easier, we define two functions to get the estimates for the log odds ratio and the variance:
LogOR = function(a){
coefficients(model3)["LWD1"] + coefficients(model3)["LWD1:AGE"] * a
}
VarLogOR = function(a){
vcov3[2, 2] + a^2 * vcov3[4, 4] + 2 * a * vcov3[2, 4]
}
For example, for a woman aged 30 we get:
LogOR(30)
## LWD1
## 2.021813
VarLogOR(30)
## [1] 0.4759425
These values are slighlty different from the values in the lecture pdf (2.016 and 0.452, respectively). A rounding problem, perhaps?
Using the functions we defined previously, we can easily generalize the above computations to any vector of age values:
(ageValues = 15:45)
## [1] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
## [24] 38 39 40 41 42 43 44 45
The corresponding estimates for the log odds ratios and variances are
(LogOR.ageValues = LogOR(ageValues))
## [1] 0.03886225 0.17105900 0.30325574 0.43545249 0.56764923 0.69984597
## [7] 0.83204272 0.96423946 1.09643621 1.22863295 1.36082970 1.49302644
## [13] 1.62522318 1.75741993 1.88961667 2.02181342 2.15401016 2.28620691
## [19] 2.41840365 2.55060039 2.68279714 2.81499388 2.94719063 3.07938737
## [25] 3.21158411 3.34378086 3.47597760 3.60817435 3.74037109 3.87256784
## [31] 4.00476458
(VarLogOR.ageValues = VarLogOR(ageValues))
## [1] 0.4361450 0.3585748 0.2924651 0.2378158 0.1946271 0.1628988 0.1426310
## [8] 0.1338237 0.1364769 0.1505905 0.1761646 0.2131992 0.2616943 0.3216499
## [15] 0.3930659 0.4759425 0.5702795 0.6760770 0.7933349 0.9220534 1.0622323
## [22] 1.2138717 1.3769716 1.5515320 1.7375529 1.9350342 2.1439760 2.3643783
## [29] 2.5962411 2.8395643 3.0943481
You can check these values against the odds ratio that appear in the table on page 30 of the lecture pdf. That table contains only a subset of the values displayed here. I have made the vector into a named vector (with the age as name) to make it easier to check the values.
OR_ageValues = round(exp(LogOR(ageValues)), 1)
names(OR_ageValues) = ageValues
OR_ageValues
## 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## 1.0 1.2 1.4 1.5 1.8 2.0 2.3 2.6 3.0 3.4 3.9 4.5 5.1 5.8 6.6
## 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
## 7.6 8.6 9.8 11.2 12.8 14.6 16.7 19.1 21.7 24.8 28.3 32.3 36.9 42.1 48.1
## 45
## 54.9
We are going to plot the above values of the log odd ratios, using the variances to obtain the confidence intervals (note that it is a 90% ci).
par(lwd=4)
plot(ageValues, LogOR.ageValues, type = "l", col="orange", xlim = c(15,35), ylim=c(-2, 5),
xlab="AGE", ylab="LogOdds")
points(ageValues, LogOR.ageValues, , pch = "·", cex=4, col="orange")
points(ageValues, LogOR.ageValues + qnorm(0.95) * sqrt(VarLogOR.ageValues), type = "l", col="brown")
points(ageValues, LogOR.ageValues + qnorm(0.95) * sqrt(VarLogOR.ageValues), pch = "·", cex=4, col="brown")
points(ageValues, LogOR.ageValues - qnorm(0.95) * sqrt(VarLogOR.ageValues), type = "l", col="blue")
points(ageValues, LogOR.ageValues - qnorm(0.95) * sqrt(VarLogOR.ageValues), pch = "·", cex=4, col="blue")
To add the linear (non-interaction) model to this graph we need to obtain the log odds ratio for this model, which does not depend on age (see also page 33 of the lecture pdf):
(logOR.mod2 = coefficients(model2)["LWD1"])
## LWD1
## 1.010122
Note: I have used a different function to access the coefficients for a model. You can use the function like this to get the coefficient by name, if you like.
The odds ratio is then
(OR = exp(logOR.mod2))
## LWD1
## 2.745937
The variance can be obtained from the variance-caovariance matrix for this model:
VarlogOR.mod2 = vcov(model2)[2, 2]
and the standard error for the log odds ratio is:
sqrt(VarlogOR.mod2)
## [1] 0.3642627
With these values we are ready to add the non-interaction model to the plot, to get a figure like the one on the last page of the lecture pdf:
par(lwd=4)
plot(ageValues, LogOR.ageValues, type = "l", col="orange", xlim = c(15,35), ylim=c(-2, 5),
xlab="AGE", ylab="LogOdds")
points(ageValues, LogOR.ageValues, , pch = "·", cex=4, col="orange")
points(ageValues, LogOR.ageValues + qnorm(0.95) * sqrt(VarLogOR.ageValues), type = "l", col="brown")
points(ageValues, LogOR.ageValues + qnorm(0.95) * sqrt(VarLogOR.ageValues), pch = "·", cex=4, col="brown")
points(ageValues, LogOR.ageValues - qnorm(0.95) * sqrt(VarLogOR.ageValues), type = "l", col="blue")
points(ageValues, LogOR.ageValues - qnorm(0.95) * sqrt(VarLogOR.ageValues), pch = "·", cex=4, col="blue")
points(ageValues, rep(logOR.mod2, length(ageValues)), type = "l", col="black")
points(ageValues, rep(logOR.mod2, length(ageValues)), pch = "·", cex=4, col="black")
points(ageValues, rep(logOR.mod2 + qnorm(0.95) * sqrt(VarlogOR.mod2), length(ageValues)), type = "l", col="green")
points(ageValues, rep(logOR.mod2 + qnorm(0.95) * sqrt(VarlogOR.mod2), length(ageValues)), pch = "·", cex=4, col="green")
points(ageValues, rep(logOR.mod2 - qnorm(0.95) * sqrt(VarlogOR.mod2), length(ageValues)), pch = "·", cex=4, col="magenta")
points(ageValues, rep(logOR.mod2 - qnorm(0.95) * sqrt(VarlogOR.mod2), length(ageValues)), type = "l", col="magenta")
legend("bottom", legend= c("No interaction", "Interaction"),col = c("orange", "black"), bty=1, lwd=3,cex=1)
Thanks for your attention!