Introduction

These are my notes for the lectures of the Coursera course “Introduction to Logistic Regression” by Professor Stanley Lemeshow. The goal of these notes is to provide the R code to obtain the same results as the Stata code in the lectures. Please read the Preliminaries of the code for lecture 1 for some details.

R code for previous lectures:

Variance-covariance matrix.

In the first part of the lecture we will be working with the Low Birth Weight data set, so we begin by loading it into a data.frame.

LOWBWTdata = read.table("./data/LOWBWT.txt", header = TRUE)

This first section will be more theoretical in nature, for those of us who like to know some details about the maths behind the computations. So, if you are satisfied with using vcov in R to get the variance-covariance matrix, you can skip to the next section. If you choose to stay… you have been warned :)

In the previous lecture we used RACE as a predictor variable, and in my R code I converted it to a factor to let R take care of the details for us. But for some of the work I am about to do in this section it is actually better to manually create two dummy variables RACE2 and RACE3 with the same meaning as the ones that we have seen in the Stata code in the lectures (where they are called _Irace_2 and _Irace_3 respectively). We will see the details below but the reason why I am doing this is because I want to have direct access to the values of these dummy variables. When we let R create them automatically, they are not added to the original data frame and that complicates the code.

Thus, let me start by creating the dummies:

LOWBWTdata$RACE2 = as.integer(LOWBWTdata$RACE == 2)
LOWBWTdata$RACE3 = as.integer(LOWBWTdata$RACE == 3)

To create RACE2 the == operator compares the value of RACE against 2 (or 3 for RACE3) and returns a boolean (values TRUE or FALSE in R). These booleans are converted to 0s and 1s by as.integer.

Now we are ready to fit the logistic model with LWT, RACE2 and RACE3 as predictors.

glmLOWBWT = glm(LOW ~ LWT + RACE2 + RACE3, family = binomial(link = "logit"), data = LOWBWTdata)
summary(glmLOWBWT)
## 
## Call:
## glm(formula = LOW ~ LWT + RACE2 + RACE3, family = binomial(link = "logit"), 
##     data = LOWBWTdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3491  -0.8919  -0.7196   1.2526   2.0993  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.805753   0.845167   0.953   0.3404  
## LWT         -0.015223   0.006439  -2.364   0.0181 *
## RACE2        1.081066   0.488052   2.215   0.0268 *
## RACE3        0.480603   0.356674   1.347   0.1778  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 223.26  on 185  degrees of freedom
## AIC: 231.26
## 
## Number of Fisher Scoring iterations: 4

You can check the coefficients of the resulting model against what we called the reduced model in Lecture 2 and you will see that this is the exact same model. As I said, the advantages of this direct coding of the dummies will soon become apparent.

We have already seen that the simplest way to get the variance-covariance matrix for the model is:

(vcovLOWBWT = vcov(glmLOWBWT))
##              (Intercept)           LWT         RACE2         RACE3
## (Intercept)  0.714306595 -5.213709e-03  0.0226026264 -0.1034970752
## LWT         -0.005213709  4.146535e-05 -0.0006470332  0.0003558553
## RACE2        0.022602626 -6.470332e-04  0.2381949141  0.0532002366
## RACE3       -0.103497075  3.558553e-04  0.0532002366  0.1272161305

Now I am going to show that this is the same matrix provided by the equation (see page 3 of the lecture pdf): \[ \widehat{\operatorname{Var}}(\hat\beta) = [X' \hat{V} X]^{-1} \] where (as stated in the lecture) \[ X = \left[\begin{array}{cccc} 1&x_{11}&\cdots&x_{1p}\\ \vdots&\vdots&\ddots&\vdots\\ 1&x_{n1}&\cdots&x_{np} \end{array}\right] \] and \[ \hat V = \left[ \begin{array}{cccc} \hat{\pi}(x_{1})\left(1 - \hat{\pi}(x_{1})\right)&0 & \cdots&0\\ 0 &\hat{\pi}(x_{2})\left(1 - \hat{\pi}(x_{2})\right)&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0 &\cdots&0&\hat{\pi}(x_{n})\left(1 - \hat{\pi}(x_{n})\right) \end{array} \right] \] Finally \(X'\) denotes the traspose of \(X\).

Let’s begin constructiong the matrix \(X\). From:

head(LOWBWTdata)
##   ID LOW AGE LWT RACE SMOKE PTL HT UI FTV  BWT RACE2 RACE3
## 1  4   1  28 120    3     1   1  0  1   0  709     0     1
## 2 10   1  29 130    1     0   0  0  1   2 1021     0     0
## 3 11   1  34 187    2     1   0  1  0   0 1135     1     0
## 4 13   1  25 105    3     0   1  1  0   0 1330     0     1
## 5 15   1  25  85    3     0   0  0  1   0 1474     0     1
## 6 16   1  27 150    3     0   0  0  0   0 1588     0     1

and

ncol(LOWBWTdata)
## [1] 13

we see that the predictor variables are in columns 4, 12 and 13. Let’s take X0 to be that part of the data frame.

X0 = LOWBWTdata[ , c(4, 12, 13)]
head(X0)
##   LWT RACE2 RACE3
## 1 120     0     1
## 2 130     0     0
## 3 187     1     0
## 4 105     0     1
## 5  85     0     1
## 6 150     0     1

This is the part that gets complicated if you let R handle the dummies by itself. The data frame in that case does not contain columns for RACE2 and RACE3 and defining X0 becomes more difficult.

Now we face one of these data-type subtleties that give us so much fun. This X0 is a data.frame. That’s just fine when we are interested in predicting probabilities with predict as we did in the previous lecture:

piX0 = predict(glmLOWBWT, newdata = X0, type="response")
head(piX0)
##         1         2         3         4         5         6 
## 0.3680909 0.2362681 0.2768981 0.4226098 0.4980983 0.2695103

These are the values that we are going to use to define the \(\hat V\) diagonal matrix:

V = diag(piX0 * (1 - piX0))
V[1:5, 1:5]
##        [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 0.2326 0.0000000 0.0000000 0.0000000 0.0000000
## [2,] 0.0000 0.1804455 0.0000000 0.0000000 0.0000000
## [3,] 0.0000 0.0000000 0.2002256 0.0000000 0.0000000
## [4,] 0.0000 0.0000000 0.0000000 0.2440108 0.0000000
## [5,] 0.0000 0.0000000 0.0000000 0.0000000 0.2499964

But now we need the \(X\) matrix. And it has to be a matrix in R, because we are going to use it in the \(X' \hat{V} X\) matrix product. A data frame won’t do. Thus I begin by converting it to a matrix:

X = as.matrix(X0)

and now we add the initial column of 1s to get \(X\)

X = cbind(rep(1, nrow(X)), X)

the repfunctions creates a vector with as many 1s as the number of rows of \(X\) (provided by nrow). Then cbind adds that vector as the first column of X.

A quick dimensionality check to ensure that the matrix product is weel defined:

dim(X)
## [1] 189   4
dim(V)
## [1] 189 189

And we can get the estimated information matrix \(\hat I(\hat\beta) = X' \hat{V} X\)

(infMatrix = t(X) %*% V %*% X )
##                          LWT     RACE2      RACE3
##         38.21386   4765.8167   5.90509   15.28839
## LWT   4765.81667 621006.1207 834.62957 1791.10381
## RACE2    5.90509    834.6296   5.90509    0.00000
## RACE3   15.28839   1791.1038   0.00000   15.28839

With this we are ready to obtain the estimated variance-covariance matrix, as the inverse of the information matrix. In R the inverse of a non singular matrix is provided by solve

(vcov_by_hand = solve(infMatrix))
##                             LWT         RACE2         RACE3
##        0.71430674 -5.213710e-03  0.0226026349 -0.1034970915
## LWT   -0.00521371  4.146536e-05 -0.0006470335  0.0003558553
## RACE2  0.02260263 -6.470335e-04  0.2381949408  0.0532002545
## RACE3 -0.10349709  3.558553e-04  0.0532002545  0.1272161495

Compare this with the matrix we obtained with vcov

vcovLOWBWT
##              (Intercept)           LWT         RACE2         RACE3
## (Intercept)  0.714306595 -5.213709e-03  0.0226026264 -0.1034970752
## LWT         -0.005213709  4.146535e-05 -0.0006470332  0.0003558553
## RACE2        0.022602626 -6.470332e-04  0.2381949141  0.0532002366
## RACE3       -0.103497075  3.558553e-04  0.0532002366  0.1272161305

Up to some rounding errors, the two matrices are the same.

Yet another way, via likelihood.

Page 2 of the lecture pdf contains yet another description of the information matrix, and therefore another way to get the variance-covariance matrix. More precisely, the information matrix is \(-Hessian(\log(\cal L))\), where \(\cal L\) is the likelihood function of the model and the \(Hessian\) is the matix of second order partial derivatives of the function (w.r.t. the \(\beta\) parameters). The hessian is to be computed at the \(\hat\beta\) estimates of the model coefficients that we obtain when we fit the model.

The pracma library of R includes a hessian function that can be used to get a numerical approximation of the hessian. If the library is not already installed in your local R you can download it from the R repositories and install it simply by executing this command:

install.packages("pracma")

When the installation finishes we load the library with

library(pracma)

Occasionally you will see a warning about different R versions being used in the lbrary and your system. That’s almost always not a problem.

To move forward we need to define the likelihood function. Recall that \[ \cal L = \prod_{i=1}^n \pi(x_i)^{y_i}\left(1 - \pi(x_i)\right)^{1- y_i}. \] where \[ \pi(x_i) = \dfrac{e^{\beta_0 + \beta_1 x_{i1}+ \cdots + \beta_p x_{ip} }}{1 + e^{\beta_0 + \beta_1 x_{i1}+ \cdots + \beta_p x_{ip}}} \] (By the way, if any of these formulas look too small in your browser, right click on it and look for the Zoom options of MathJaX.)

To define the likelihood function \(\cal L\) in R we begin by using \(Y\) to denote the response variable (LOW) observed values:

Y = LOWBWTdata$LOW

Now the log likelihood function is simply:

logLikelihood = function(beta){
  logit = X %*% beta
  piX = apply(logit, MARGIN = 1, FUN = plogis)
  log(prod(piX[Y==1]) * prod(1 - piX[Y==0]))
}

Some remarks:

  • The first line of code in the function body uses the matrix product of X and beta to get a 1-column matrix whose \(i\)th row is: \[\beta_0 + \beta_1x_{i1}+\cdots+\beta_px_{ip}\]
  • The second line applies the logistic link function to each row of that column matrix. Thus, piX is again a 1-column matrix whose \(i\)th row is: \[ \dfrac{e^{\beta_0 + \beta_1x_{i1}+\cdots+\beta_px_{ip}}}{1 + e^{\beta_0 + \beta_1x_{i1}+\cdots+\beta_px_{ip}}} \]
  • The final line uses the values of \(Y\) (0 or 1) to select which of the above values are used to compute the products that give the log likelihood function.

We can check that the definition works by using our function to compute the likelihood of the fitted model:

logLikelihood(glmLOWBWT$coefficients)
## [1] -111.6295

and compare this result to the obtained using the built-in logLik function of R:

logLik(glmLOWBWT)
## 'log Lik.' -111.6295 (df=4)

You may be wondering why I have manually constructed logLikelihood when we already have a perfectly fine logLik function in R. The problem is that logLik is a function of the model (of a model object, in R parlance). And we need to have the likelihood explicitly dependent on the \(\beta\) parameters, if we are to compute the hessian.

And in fact, we are ready to do just that, computing minus the hessian of the log likelihood for the \(\hat\beta\) of the fitted model:

- hessian(logLikelihood, x0 = glmLOWBWT$coefficients)
##            [,1]        [,2]       [,3]       [,4]
## [1,]   38.21386   4765.7857   5.905090   15.28839
## [2,] 4765.78565 621004.6973 834.617123 1791.08341
## [3,]    5.90509    834.6171   5.905091    0.00000
## [4,]   15.28839   1791.0834   0.000000   15.28839

And you can see that we have again arrived at the same estimated information matrix that we met before (as usual, up to some rounding):

infMatrix
##                          LWT     RACE2      RACE3
##         38.21386   4765.8167   5.90509   15.28839
## LWT   4765.81667 621006.1207 834.62957 1791.10381
## RACE2    5.90509    834.6296   5.90509    0.00000
## RACE3   15.28839   1791.1038   0.00000   15.28839

Confidence interval for the logit for a single subject

Let’s deal now with the confidence interval computations appearing in page 4 of the lecture pdf. The computation of a confidence interval for the logit of a single subject is similar to what we did in Lecture 2 for the case of a single predictor variable. We will use predict, applied to a data frame containing the values of the predictor variables for the subject.

subjectData = data.frame(LWT=100, RACE2=1, RACE3=0)

Now we use predict with type="link" to get the answer in the logit scale and with se.fit=TRUE to get standard errors as part of the answer.

(predictedLogit = predict(glmLOWBWT, newdata = subjectData, type="link", se.fit = TRUE))
## $fit
##         1 
## 0.3645094 
## 
## $se.fit
## [1] 0.4901142
## 
## $residual.scale
## [1] 1

Check these values against those appearing on page 6 of the lecture pdf. The confidence interval for the logit is now easily obtained with:

(intervalLogit = predictedLogit$fit + c(-1, 1) *  (qnorm(0.975) * predictedLogit$se.fit))
## [1] -0.5960967  1.3251156

And to get the interval for the probability we use the plogis function (this is the logistic link function, as we saw in the R code for the previous lecture):

(intervalProb = plogis(intervalLogit))
## [1] 0.3552372 0.7900315

If you want the predicted probability use:

(predictedProb = predict(glmLOWBWT, newdata = subjectData, type="response"))
##         1 
## 0.5901316

Interpretation of coefficients for dichotomous independent variable in terms of the odds ratio

In this part of the lecture we will be working with the CHD data set:

CHDdata = read.table("./data/CHDAGE.txt", header = TRUE)

We are going to define a dichotomous variable called AGE55 which is equal to 1 if the subject’s age is equal to or less than 55, and 0 otherwise. We add the new variable to the data frame:

CHDdata$AGE55 = as.integer(CHDdata$AGE >= 55) 
head(CHDdata)
##   ID AGE CHD AGE55
## 1  1  20   0     0
## 2  2  23   0     0
## 3  3  24   0     0
## 4  5  25   1     0
## 5  4  25   0     0
## 6  7  26   0     0
tail(CHDdata)
##      ID AGE CHD AGE55
## 95   95  62   1     1
## 96   96  63   1     1
## 97   98  64   1     1
## 98   97  64   0     1
## 99   99  65   1     1
## 100 100  69   1     1

Let’s reproduce the contingency table on page 19 of the lecture pdf.

(table1 = addmargins(table(CHDdata$CHD, CHDdata$AGE55)[2:1,2:1]))
##      
##        1  0 Sum
##   1   21 22  43
##   0    6 51  57
##   Sum 27 73 100

I have had to fiddle a bit with the result of table, to get the variables in the same order as in the lecture. The addmargins function does just that: add the marginal totals to a table.

Our first estimate of the odds ratio can be obtained with the well-known formula from the diagonals of the contingency table:

(oddsRatio1 = table1[1,1] * table1[2,2] / (table1[1,2] * table1[2,1]))
## [1] 8.113636

Let’s fit a logistic model to the data:

glmCHD = glm(CHD ~ AGE55, family = binomial(link = "logit"), CHDdata)
(summGlmCHD = summary(glmCHD))
## 
## Call:
## glm(formula = CHD ~ AGE55, family = binomial(link = "logit"), 
##     data = CHDdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7344  -0.8469  -0.8469   0.7090   1.5488  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.8408     0.2551  -3.296  0.00098 ***
## AGE55         2.0935     0.5285   3.961 7.46e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 136.66  on 99  degrees of freedom
## Residual deviance: 117.96  on 98  degrees of freedom
## AIC: 121.96
## 
## Number of Fisher Scoring iterations: 4

to see that the estimated odds ratio equals the exponential of the model coefficient for the dichotomous predictor variable:

(oddsRatio2 = exp(glmCHD$coefficients[2]))
##    AGE55 
## 8.113636

Confidence interval for the odds ratio

We begin by computing \[ \widehat{\operatorname{Var}}(\hat\beta_1) = \left(\dfrac{1}{a} + \dfrac{1}{b} + \dfrac{1}{c} + \dfrac{1}{d} \right) \]

(varBeta1 = sum(1/table1[1:2, 1:2]))
## [1] 0.2793481

and from this the standard error is:

(SEb1 = sqrt(varBeta1))
## [1] 0.5285339

Therefore the confidence interval for \(\beta_1\) is:

(confint_beta1 = glmCHD$coefficients[2] + c(-1, 1) * qnorm(0.975) * SEb1)
## [1] 1.057639 3.129454

and exponentiating this we get the confidence interval for the odds ratio:

(confintOddsRatio = exp(confint_beta1))
## [1]  2.879563 22.861484

Alternatively we can turn to confint.default to get the confidence interval for \(\beta_1\) (and \(\beta_0\))

(confint_beta =  confint.default(level = 0.95, glmCHD))
##                 2.5 %     97.5 %
## (Intercept) -1.340717 -0.3408489
## AGE55        1.057639  3.1294530
confint_beta[2,]
##    2.5 %   97.5 % 
## 1.057639 3.129453

The confidence interval for \(\beta_1\) is the second row (use [2, ] to select it).


Thanks for your attention!