These are my notes for the lectures of the Coursera course “Introduction to Logistic Regression” by Professor Stanley Lemeshow. The goal of these notes is to provide the R code to obtain the same results as the Stata code in the lectures. Please read the Preliminaries of the code for lecture 1 for some details.
https://github.com/fernandosansegundo/LogisticRegressionCoursera
In the first part of the lecture we will be working with the Low Birth Weight data set, so we begin by loading it into a data.frame.
LOWBWTdata = read.table("./data/LOWBWT.txt", header = TRUE)
This first section will be more theoretical in nature, for those of us who like to know some details about the maths behind the computations. So, if you are satisfied with using vcov in R to get the variance-covariance matrix, you can skip to the next section. If you choose to stay… you have been warned :)
In the previous lecture we used RACE as a predictor variable, and in my R code I converted it to a factor to let R take care of the details for us. But for some of the work I am about to do in this section it is actually better to manually create two dummy variables RACE2 and RACE3 with the same meaning as the ones that we have seen in the Stata code in the lectures (where they are called _Irace_2 and _Irace_3 respectively). We will see the details below but the reason why I am doing this is because I want to have direct access to the values of these dummy variables. When we let R create them automatically, they are not added to the original data frame and that complicates the code.
Thus, let me start by creating the dummies:
LOWBWTdata$RACE2 = as.integer(LOWBWTdata$RACE == 2)
LOWBWTdata$RACE3 = as.integer(LOWBWTdata$RACE == 3)
To create RACE2 the == operator compares the value of RACE against 2 (or 3 for RACE3) and returns a boolean (values TRUE or FALSE in R). These booleans are converted to 0s and 1s by as.integer.
Now we are ready to fit the logistic model with LWT, RACE2 and RACE3 as predictors.
glmLOWBWT = glm(LOW ~ LWT + RACE2 + RACE3, family = binomial(link = "logit"), data = LOWBWTdata)
summary(glmLOWBWT)
##
## Call:
## glm(formula = LOW ~ LWT + RACE2 + RACE3, family = binomial(link = "logit"),
## data = LOWBWTdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3491 -0.8919 -0.7196 1.2526 2.0993
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.805753 0.845167 0.953 0.3404
## LWT -0.015223 0.006439 -2.364 0.0181 *
## RACE2 1.081066 0.488052 2.215 0.0268 *
## RACE3 0.480603 0.356674 1.347 0.1778
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 223.26 on 185 degrees of freedom
## AIC: 231.26
##
## Number of Fisher Scoring iterations: 4
You can check the coefficients of the resulting model against what we called the reduced model in Lecture 2 and you will see that this is the exact same model. As I said, the advantages of this direct coding of the dummies will soon become apparent.
We have already seen that the simplest way to get the variance-covariance matrix for the model is:
(vcovLOWBWT = vcov(glmLOWBWT))
## (Intercept) LWT RACE2 RACE3
## (Intercept) 0.714306595 -5.213709e-03 0.0226026264 -0.1034970752
## LWT -0.005213709 4.146535e-05 -0.0006470332 0.0003558553
## RACE2 0.022602626 -6.470332e-04 0.2381949141 0.0532002366
## RACE3 -0.103497075 3.558553e-04 0.0532002366 0.1272161305
Now I am going to show that this is the same matrix provided by the equation (see page 3 of the lecture pdf): \[ \widehat{\operatorname{Var}}(\hat\beta) = [X' \hat{V} X]^{-1} \] where (as stated in the lecture) \[ X = \left[\begin{array}{cccc} 1&x_{11}&\cdots&x_{1p}\\ \vdots&\vdots&\ddots&\vdots\\ 1&x_{n1}&\cdots&x_{np} \end{array}\right] \] and \[ \hat V = \left[ \begin{array}{cccc} \hat{\pi}(x_{1})\left(1 - \hat{\pi}(x_{1})\right)&0 & \cdots&0\\ 0 &\hat{\pi}(x_{2})\left(1 - \hat{\pi}(x_{2})\right)&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0 &\cdots&0&\hat{\pi}(x_{n})\left(1 - \hat{\pi}(x_{n})\right) \end{array} \right] \] Finally \(X'\) denotes the traspose of \(X\).
Let’s begin constructiong the matrix \(X\). From:
head(LOWBWTdata)
## ID LOW AGE LWT RACE SMOKE PTL HT UI FTV BWT RACE2 RACE3
## 1 4 1 28 120 3 1 1 0 1 0 709 0 1
## 2 10 1 29 130 1 0 0 0 1 2 1021 0 0
## 3 11 1 34 187 2 1 0 1 0 0 1135 1 0
## 4 13 1 25 105 3 0 1 1 0 0 1330 0 1
## 5 15 1 25 85 3 0 0 0 1 0 1474 0 1
## 6 16 1 27 150 3 0 0 0 0 0 1588 0 1
and
ncol(LOWBWTdata)
## [1] 13
we see that the predictor variables are in columns 4, 12 and 13. Let’s take X0 to be that part of the data frame.
X0 = LOWBWTdata[ , c(4, 12, 13)]
head(X0)
## LWT RACE2 RACE3
## 1 120 0 1
## 2 130 0 0
## 3 187 1 0
## 4 105 0 1
## 5 85 0 1
## 6 150 0 1
This is the part that gets complicated if you let R handle the dummies by itself. The data frame in that case does not contain columns for RACE2 and RACE3 and defining X0 becomes more difficult.
Now we face one of these data-type subtleties that give us so much fun. This X0 is a data.frame. That’s just fine when we are interested in predicting probabilities with predict as we did in the previous lecture:
piX0 = predict(glmLOWBWT, newdata = X0, type="response")
head(piX0)
## 1 2 3 4 5 6
## 0.3680909 0.2362681 0.2768981 0.4226098 0.4980983 0.2695103
These are the values that we are going to use to define the \(\hat V\) diagonal matrix:
V = diag(piX0 * (1 - piX0))
V[1:5, 1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.2326 0.0000000 0.0000000 0.0000000 0.0000000
## [2,] 0.0000 0.1804455 0.0000000 0.0000000 0.0000000
## [3,] 0.0000 0.0000000 0.2002256 0.0000000 0.0000000
## [4,] 0.0000 0.0000000 0.0000000 0.2440108 0.0000000
## [5,] 0.0000 0.0000000 0.0000000 0.0000000 0.2499964
But now we need the \(X\) matrix. And it has to be a matrix in R, because we are going to use it in the \(X' \hat{V} X\) matrix product. A data frame won’t do. Thus I begin by converting it to a matrix:
X = as.matrix(X0)
and now we add the initial column of 1s to get \(X\)
X = cbind(rep(1, nrow(X)), X)
the repfunctions creates a vector with as many 1s as the number of rows of \(X\) (provided by nrow). Then cbind adds that vector as the first column of X.
A quick dimensionality check to ensure that the matrix product is weel defined:
dim(X)
## [1] 189 4
dim(V)
## [1] 189 189
And we can get the estimated information matrix \(\hat I(\hat\beta) = X' \hat{V} X\)
(infMatrix = t(X) %*% V %*% X )
## LWT RACE2 RACE3
## 38.21386 4765.8167 5.90509 15.28839
## LWT 4765.81667 621006.1207 834.62957 1791.10381
## RACE2 5.90509 834.6296 5.90509 0.00000
## RACE3 15.28839 1791.1038 0.00000 15.28839
With this we are ready to obtain the estimated variance-covariance matrix, as the inverse of the information matrix. In R the inverse of a non singular matrix is provided by solve
(vcov_by_hand = solve(infMatrix))
## LWT RACE2 RACE3
## 0.71430674 -5.213710e-03 0.0226026349 -0.1034970915
## LWT -0.00521371 4.146536e-05 -0.0006470335 0.0003558553
## RACE2 0.02260263 -6.470335e-04 0.2381949408 0.0532002545
## RACE3 -0.10349709 3.558553e-04 0.0532002545 0.1272161495
Compare this with the matrix we obtained with vcov
vcovLOWBWT
## (Intercept) LWT RACE2 RACE3
## (Intercept) 0.714306595 -5.213709e-03 0.0226026264 -0.1034970752
## LWT -0.005213709 4.146535e-05 -0.0006470332 0.0003558553
## RACE2 0.022602626 -6.470332e-04 0.2381949141 0.0532002366
## RACE3 -0.103497075 3.558553e-04 0.0532002366 0.1272161305
Up to some rounding errors, the two matrices are the same.
Page 2 of the lecture pdf contains yet another description of the information matrix, and therefore another way to get the variance-covariance matrix. More precisely, the information matrix is \(-Hessian(\log(\cal L))\), where \(\cal L\) is the likelihood function of the model and the \(Hessian\) is the matix of second order partial derivatives of the function (w.r.t. the \(\beta\) parameters). The hessian is to be computed at the \(\hat\beta\) estimates of the model coefficients that we obtain when we fit the model.
The pracma library of R includes a hessian function that can be used to get a numerical approximation of the hessian. If the library is not already installed in your local R you can download it from the R repositories and install it simply by executing this command:
install.packages("pracma")
When the installation finishes we load the library with
library(pracma)
Occasionally you will see a warning about different R versions being used in the lbrary and your system. That’s almost always not a problem.
To move forward we need to define the likelihood function. Recall that \[ \cal L = \prod_{i=1}^n \pi(x_i)^{y_i}\left(1 - \pi(x_i)\right)^{1- y_i}. \] where \[ \pi(x_i) = \dfrac{e^{\beta_0 + \beta_1 x_{i1}+ \cdots + \beta_p x_{ip} }}{1 + e^{\beta_0 + \beta_1 x_{i1}+ \cdots + \beta_p x_{ip}}} \] (By the way, if any of these formulas look too small in your browser, right click on it and look for the Zoom options of MathJaX.)
To define the likelihood function \(\cal L\) in R we begin by using \(Y\) to denote the response variable (LOW) observed values:
Y = LOWBWTdata$LOW
Now the log likelihood function is simply:
logLikelihood = function(beta){
logit = X %*% beta
piX = apply(logit, MARGIN = 1, FUN = plogis)
log(prod(piX[Y==1]) * prod(1 - piX[Y==0]))
}
Some remarks:
X and beta to get a 1-column matrix whose \(i\)th row is: \[\beta_0 + \beta_1x_{i1}+\cdots+\beta_px_{ip}\]piX is again a 1-column matrix whose \(i\)th row is: \[
\dfrac{e^{\beta_0 + \beta_1x_{i1}+\cdots+\beta_px_{ip}}}{1 + e^{\beta_0 + \beta_1x_{i1}+\cdots+\beta_px_{ip}}}
\]We can check that the definition works by using our function to compute the likelihood of the fitted model:
logLikelihood(glmLOWBWT$coefficients)
## [1] -111.6295
and compare this result to the obtained using the built-in logLik function of R:
logLik(glmLOWBWT)
## 'log Lik.' -111.6295 (df=4)
You may be wondering why I have manually constructed logLikelihood when we already have a perfectly fine logLik function in R. The problem is that logLik is a function of the model (of a model object, in R parlance). And we need to have the likelihood explicitly dependent on the \(\beta\) parameters, if we are to compute the hessian.
And in fact, we are ready to do just that, computing minus the hessian of the log likelihood for the \(\hat\beta\) of the fitted model:
- hessian(logLikelihood, x0 = glmLOWBWT$coefficients)
## [,1] [,2] [,3] [,4]
## [1,] 38.21386 4765.7857 5.905090 15.28839
## [2,] 4765.78565 621004.6973 834.617123 1791.08341
## [3,] 5.90509 834.6171 5.905091 0.00000
## [4,] 15.28839 1791.0834 0.000000 15.28839
And you can see that we have again arrived at the same estimated information matrix that we met before (as usual, up to some rounding):
infMatrix
## LWT RACE2 RACE3
## 38.21386 4765.8167 5.90509 15.28839
## LWT 4765.81667 621006.1207 834.62957 1791.10381
## RACE2 5.90509 834.6296 5.90509 0.00000
## RACE3 15.28839 1791.1038 0.00000 15.28839
Let’s deal now with the confidence interval computations appearing in page 4 of the lecture pdf. The computation of a confidence interval for the logit of a single subject is similar to what we did in Lecture 2 for the case of a single predictor variable. We will use predict, applied to a data frame containing the values of the predictor variables for the subject.
subjectData = data.frame(LWT=100, RACE2=1, RACE3=0)
Now we use predict with type="link" to get the answer in the logit scale and with se.fit=TRUE to get standard errors as part of the answer.
(predictedLogit = predict(glmLOWBWT, newdata = subjectData, type="link", se.fit = TRUE))
## $fit
## 1
## 0.3645094
##
## $se.fit
## [1] 0.4901142
##
## $residual.scale
## [1] 1
Check these values against those appearing on page 6 of the lecture pdf. The confidence interval for the logit is now easily obtained with:
(intervalLogit = predictedLogit$fit + c(-1, 1) * (qnorm(0.975) * predictedLogit$se.fit))
## [1] -0.5960967 1.3251156
And to get the interval for the probability we use the plogis function (this is the logistic link function, as we saw in the R code for the previous lecture):
(intervalProb = plogis(intervalLogit))
## [1] 0.3552372 0.7900315
If you want the predicted probability use:
(predictedProb = predict(glmLOWBWT, newdata = subjectData, type="response"))
## 1
## 0.5901316
In this part of the lecture we will be working with the CHD data set:
CHDdata = read.table("./data/CHDAGE.txt", header = TRUE)
We are going to define a dichotomous variable called AGE55 which is equal to 1 if the subject’s age is equal to or less than 55, and 0 otherwise. We add the new variable to the data frame:
CHDdata$AGE55 = as.integer(CHDdata$AGE >= 55)
head(CHDdata)
## ID AGE CHD AGE55
## 1 1 20 0 0
## 2 2 23 0 0
## 3 3 24 0 0
## 4 5 25 1 0
## 5 4 25 0 0
## 6 7 26 0 0
tail(CHDdata)
## ID AGE CHD AGE55
## 95 95 62 1 1
## 96 96 63 1 1
## 97 98 64 1 1
## 98 97 64 0 1
## 99 99 65 1 1
## 100 100 69 1 1
Let’s reproduce the contingency table on page 19 of the lecture pdf.
(table1 = addmargins(table(CHDdata$CHD, CHDdata$AGE55)[2:1,2:1]))
##
## 1 0 Sum
## 1 21 22 43
## 0 6 51 57
## Sum 27 73 100
I have had to fiddle a bit with the result of table, to get the variables in the same order as in the lecture. The addmargins function does just that: add the marginal totals to a table.
Our first estimate of the odds ratio can be obtained with the well-known formula from the diagonals of the contingency table:
(oddsRatio1 = table1[1,1] * table1[2,2] / (table1[1,2] * table1[2,1]))
## [1] 8.113636
Let’s fit a logistic model to the data:
glmCHD = glm(CHD ~ AGE55, family = binomial(link = "logit"), CHDdata)
(summGlmCHD = summary(glmCHD))
##
## Call:
## glm(formula = CHD ~ AGE55, family = binomial(link = "logit"),
## data = CHDdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7344 -0.8469 -0.8469 0.7090 1.5488
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8408 0.2551 -3.296 0.00098 ***
## AGE55 2.0935 0.5285 3.961 7.46e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136.66 on 99 degrees of freedom
## Residual deviance: 117.96 on 98 degrees of freedom
## AIC: 121.96
##
## Number of Fisher Scoring iterations: 4
to see that the estimated odds ratio equals the exponential of the model coefficient for the dichotomous predictor variable:
(oddsRatio2 = exp(glmCHD$coefficients[2]))
## AGE55
## 8.113636
We begin by computing \[ \widehat{\operatorname{Var}}(\hat\beta_1) = \left(\dfrac{1}{a} + \dfrac{1}{b} + \dfrac{1}{c} + \dfrac{1}{d} \right) \]
(varBeta1 = sum(1/table1[1:2, 1:2]))
## [1] 0.2793481
and from this the standard error is:
(SEb1 = sqrt(varBeta1))
## [1] 0.5285339
Therefore the confidence interval for \(\beta_1\) is:
(confint_beta1 = glmCHD$coefficients[2] + c(-1, 1) * qnorm(0.975) * SEb1)
## [1] 1.057639 3.129454
and exponentiating this we get the confidence interval for the odds ratio:
(confintOddsRatio = exp(confint_beta1))
## [1] 2.879563 22.861484
Alternatively we can turn to confint.default to get the confidence interval for \(\beta_1\) (and \(\beta_0\))
(confint_beta = confint.default(level = 0.95, glmCHD))
## 2.5 % 97.5 %
## (Intercept) -1.340717 -0.3408489
## AGE55 1.057639 3.1294530
confint_beta[2,]
## 2.5 % 97.5 %
## 1.057639 3.129453
The confidence interval for \(\beta_1\) is the second row (use [2, ] to select it).
Thanks for your attention!