Introduction

These are my notes for the lectures of the Coursera course “Introduction to Logistic Regression” by Professor Stanley Lemeshow. The goal of these notes is to provide the R code to obtain the same results as the Stata code in the lectures. Please read the Preliminaries of the code for lecture 1 for some details.

R code for previous lectures:

Warning:

I have not been able to locate the ICU data set used for the last part of this week lecture (slides 19 to the end, video segments 4.5 and 4.6). That data set contains variables (e.g. MVENT) that do not appear in the data set with the same name that I have. Thus, I am not able to reproduce the results in that part of the lecture. If any one can help me locate the data, I’d appreciate that.

Polychotomous independent variable. Reference cell coding.

Approaching polychotomous data via contingency tables

The contingency table in slide 2 can be entered as a matrix in R:

CHDtable = matrix(c(5, 20, 15, 10, 20, 10, 10, 10), ncol = 2)
colnames(CHDtable) = c("PRESENT", "ABSENT")
rownames(CHDtable) = c("WHITE", "BLACK", "HISPANIC", "OTHER")
addmargins(CHDtable)
##          PRESENT ABSENT Sum
## WHITE          5     20  25
## BLACK         20     10  30
## HISPANIC      15     10  25
## OTHER         10     10  20
## Sum           50     50 100

To generate the raw data from this table I am going to use the expand.table function in the epitools library. Uncommenting (remove #) the first line will automatically download and install that library.

#install.packages("epitools")
library(epitools)
CHDdf = expand.table(CHDtable)
names(CHDdf) = c("race", "chd")
head(CHDdf)
##    race     chd
## 1 WHITE PRESENT
## 2 WHITE PRESENT
## 3 WHITE PRESENT
## 4 WHITE PRESENT
## 5 WHITE PRESENT
## 6 WHITE  ABSENT
tail(CHDdf)
##      race    chd
## 95  OTHER ABSENT
## 96  OTHER ABSENT
## 97  OTHER ABSENT
## 98  OTHER ABSENT
## 99  OTHER ABSENT
## 100 OTHER ABSENT

And we can check the result by converting it back to a table:

table(CHDdf$race, CHDdf$chd)
##           
##            PRESENT ABSENT
##   WHITE          5     20
##   BLACK         20     10
##   HISPANIC      15     10
##   OTHER         10     10

This conversion to data.frame has the added benefit that race and chd have been converted to factors:

str(CHDdf)
## 'data.frame':    100 obs. of  2 variables:
##  $ race: Factor w/ 4 levels "WHITE","BLACK",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ chd : Factor w/ 2 levels "PRESENT","ABSENT": 1 1 1 1 1 2 2 2 2 2 ...

and we saw that in that case R automatically handles the dummy variables needed for logistic regression. In fact, we can use contrasts to ask R about the dummy coding it is going to use for a factor:

contrasts(CHDdf$race)
##          BLACK HISPANIC OTHER
## WHITE        0        0     0
## BLACK        1        0     0
## HISPANIC     0        1     0
## OTHER        0        0     1

As you see the reference cell coding is the default coding that R uses for a factor. Below we will see that contrasts can be used to change the dummies coding as we want.

Logistic model. Defining the reference factor level.

And now we can fit the logistic model with glm:

glmCHD = glm(chd ~ race, family = binomial(link = "logit"), data = CHDdf)
summary(glmCHD)
## 
## Call:
## glm(formula = chd ~ race, family = binomial(link = "logit"), 
##     data = CHDdf)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7941  -1.0108  -0.1162   1.1774   1.4823  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)    1.3863     0.5000   2.773  0.00556 **
## raceBLACK     -2.0794     0.6325  -3.288  0.00101 **
## raceHISPANIC  -1.7918     0.6455  -2.776  0.00551 **
## raceOTHER     -1.3863     0.6708  -2.067  0.03878 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 138.63  on 99  degrees of freedom
## Residual deviance: 124.59  on 96  degrees of freedom
## AIC: 132.59
## 
## Number of Fisher Scoring iterations: 4
logLik(glmCHD)
## 'log Lik.' -62.29372 (df=4)

You can compare the results here with those on slide 5 of the lecture pdf.

Have you noticed something strange? The signs of the model coefficients are the opposite of the ones we obtained. Why did that happen? That’s because the order of the levels in the chd factor is:

levels(CHDdf$chd)
## [1] "PRESENT" "ABSENT"

The culprit here is probably the expand.table function. At first I did not notice this because the default rule in R is to use alphabetical order for the levels of a factor. But the expand.table function is doing somethng different, possibly ordering the levels in order of appearance in the data.

Anyway, now that we know, we can easily fix this by telling R which is the reference level. We use the relevel function:

CHDdf$chd = relevel(CHDdf$chd, ref = "ABSENT")
levels(CHDdf$chd)
## [1] "ABSENT"  "PRESENT"

In a more complex situation, if you want an arbitrary reordering of all the levels you will need to learn how to use the levels function. Beware however of the confussion between factor levels and the labels of those levels!

Alternatively you can fix the contingency table itslef at the beginning, by executing

#(CHDtable = CHDtable[ , 2:1])

To do that, uncomment this line and copy it under the line addmargins(CHDtable). Note, however, that this has may have undesired side effects, because many people expect the exposed group to be in the first column of the contingency table. I will just leave the table as it was.

So, having fixed the ordering of the factor levels, the model fit now agrees with the Stata output:

glmCHD = glm(chd ~ race, family = binomial(link = "logit"), data = CHDdf)
summary(glmCHD)
## 
## Call:
## glm(formula = chd ~ race, family = binomial(link = "logit"), 
##     data = CHDdf)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4823  -1.1774   0.1162   1.0108   1.7941  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -1.3863     0.5000  -2.773  0.00556 **
## raceBLACK      2.0794     0.6325   3.288  0.00101 **
## raceHISPANIC   1.7918     0.6455   2.776  0.00551 **
## raceOTHER      1.3863     0.6708   2.067  0.03878 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 138.63  on 99  degrees of freedom
## Residual deviance: 124.59  on 96  degrees of freedom
## AIC: 132.59
## 
## Number of Fisher Scoring iterations: 4
logLik(glmCHD)
## 'log Lik.' -62.29372 (df=4)

Odds ratio comparing each level to the reference level

Recall that our contingency table is:

CHDtable
##          PRESENT ABSENT
## WHITE          5     20
## BLACK         20     10
## HISPANIC      15     10
## OTHER         10     10

It’s probably my fault, but I have not found any library function in R to compute the odds ratios from a contingency table like this one. But a nice thing about systems like R is that if you need it and you cannot find it, you can make one yourself. Bonus point: we get to use one of the R functions in the apply family. So, here is my function:

oddsRatio = function(CT, refRow=1, exposedCol = 1){
   # The row vector has all the indices minus the reference group row index.  
   rows = (1: nrow(CT))[-refRow] 
   # We define a function that computes the odds-ratio of any row vs the 
   # reference row, and also labels the result. And we use sapply
   # to apply that function to each of the indices in the rows vector.
   sapply(rows, FUN = function(r){
    OR = (CT[r, 1] * CT[refRow, 2]) / (CT[r, 2] * CT[refRow, 1]) 
    names(OR) = paste0(row.names(CT)[r], " vs ", row.names(CT)[refRow], collapse = "")
    return(OR)  
   })
}

Note that, besides the contingency table CT, the function has two additional arguments for the row index of the reference group and the column index containing the exposed group. The default values are first row and first column, but they can be set by hand.

Using the function defaults we get:

oddsRatio(CHDtable)
##    BLACK vs WHITE HISPANIC vs WHITE    OTHER vs WHITE 
##                 8                 6                 4

but if we want to use the second row (race = "BLACK") as reference, we simply do this:

oddsRatio(CHDtable, refRow = 2)
##    WHITE vs BLACK HISPANIC vs BLACK    OTHER vs BLACK 
##             0.125             0.750             0.500

Of course, an alternative way to get the odds ratio is to exponentiate the coefficients of the logistic model we fitted before:

exp(glmCHD$coefficients)
##  (Intercept)    raceBLACK raceHISPANIC    raceOTHER 
##         0.25         8.00         6.00         4.00

Likelihood ratio test for the significance of the race variable

We have done this before (see the example of the low birth weight study in week 2 lecture code), but for the sake of getting all the numerical results in the lecture I’ll repeat the computation:

(LR = glmCHD$null.deviance - glmCHD$deviance)
## [1] 14.04199
(df = summary(glmCHD)$df.null - summary(glmCHD)$df.residual) #degress of freedom for the chi-sq statistic
## [1] 3
(pValue = pchisq(LR, df, lower.tail = FALSE))
## [1] 0.002848545

Confidence intervals for the odds ratios.

We already know that we can use confint.default to get the confidence intervals for the coefficients of the model:

confint.default(glmCHD)
##                    2.5 %     97.5 %
## (Intercept)  -2.36627611 -0.4063126
## raceBLACK     0.83985175  3.3190313
## raceHISPANIC  0.52660835  3.0569106
## raceOTHER     0.07151073  2.7010780

and exponentiate them to get confidence intervals for the odds ratios:

exp(confint.default(glmCHD))
##                   2.5 %     97.5 %
## (Intercept)  0.09382949  0.6661019
## raceBLACK    2.31602361 27.6335698
## raceHISPANIC 1.69317988 21.2617693
## raceOTHER    1.07412968 14.8957806

By the way, if you need a different confidence level (other than 95%), you can use an optional argument, e.g. level=0.99 for 99% intervals.

Standard errors from the contingency table

Alternatively we can get the standard errors for the log odds ratios using the formula on slide 6 of the lecture notes: \[ \hat{SE}(\ln(\hat{OR})) = \sqrt{\dfrac{1}{a} + \dfrac{1}{b} + \dfrac{1}{c} + \dfrac{1}{d} } \] I have repurposed for this the oddsRatio function that I have shown you before. Now, instead of sapply the function uses lapply to get a list containing, for each group, both the odds ratio (against the reference group) and the standard error for the odds ratio.

oddsRatio2 = function(CT, refRow=1, exposedCol = 1){
   # The row vector has all the indices minus the reference group row index.  
   rows = (1: nrow(CT))[-refRow] 
   # We define a function that computes the odds-ratio of any row vs the 
   # reference row, and also labels the result. And we use sapply
   # to apply that function to each of the indices in the rows vector.
   ORlist = lapply(rows, FUN = function(r){
    OR = (CT[r, 1] * CT[refRow, 2]) / (CT[r, 2] * CT[refRow, 1]) 
    names(OR) = paste0(row.names(CT)[r], " vs ", row.names(CT)[refRow], collapse = "")
    SE = sqrt(sum(1 / CT[c(r, refRow), ]))
    return(c(OR, SElog =SE))
   })
   return(ORlist)
}

In case you are wondering, the SElog = SE bit is just a trick to force R to use a named value and return the standard error with the name “SElog” attached.

The result of the function is

oddsRatio2(CHDtable)
## [[1]]
## BLACK vs WHITE          SElog 
##      8.0000000      0.6324555 
## 
## [[2]]
## HISPANIC vs WHITE             SElog 
##         6.0000000         0.6454972 
## 
## [[3]]
## OTHER vs WHITE          SElog 
##      4.0000000      0.6708204

You can check these values against the Std. Error column in the summary of the logistic model.

summGlmCHD = summary(glmCHD)
summGlmCHD$coefficients[ ,2]
##  (Intercept)    raceBLACK raceHISPANIC    raceOTHER 
##    0.4999999    0.6324554    0.6454971    0.6708203

Either way, from here it’s a small step to get the confidence intervals for the betas (remember that we have already obtained them with confint.default) and the odds ratios.

Using deviation from means coding.

As I told you before, we can use the contrasts function not only to see the dummy variables coding of a factor in R, but to change it. In this case I begin by creating the matrix for the deviation from means coding:

(devMeans = rbind(rep(-1, 3), diag(rep(1, 3), nrow = 3)))
##      [,1] [,2] [,3]
## [1,]   -1   -1   -1
## [2,]    1    0    0
## [3,]    0    1    0
## [4,]    0    0    1

You can use matrix and directly enter the elements of the matrix, but that gets tedious with large matrices. I used rbind and diag to show you another way to do this.

Now, simply use contrasts to inform R that this is the coding we would like to use:

contrasts(CHDdf$race) = devMeans

that’s it! Now when we fit the model R uses that coding and returns the corresponding coefficients and the associated info:

glmCHD2 = glm(chd ~ race, family = binomial(link = "logit"), data = CHDdf)
summary(glmCHD2)
## 
## Call:
## glm(formula = chd ~ race, family = binomial(link = "logit"), 
##     data = CHDdf)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4823  -1.1774   0.1162   1.0108   1.7941  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.07192    0.21890  -0.329   0.7425  
## race1        0.76507    0.35059   2.182   0.0291 *
## race2        0.47739    0.36228   1.318   0.1876  
## race3        0.07192    0.38460   0.187   0.8517  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 138.63  on 99  degrees of freedom
## Residual deviance: 124.59  on 96  degrees of freedom
## AIC: 132.59
## 
## Number of Fisher Scoring iterations: 4
logLik(glmCHD2)
## 'log Lik.' -62.29372 (df=4)

Note that the log likelihood has not changed, it is independent of the coding.

Variance-covariance matrix for this coding:

We get this with

vcov(glmCHD2)
##              (Intercept)       race1        race2        race3
## (Intercept)  0.047916655 -0.01041667 -0.006249989  0.002083345
## race1       -0.010416669  0.12291663 -0.031249997 -0.039583331
## race2       -0.006249989 -0.03125000  0.131249988 -0.043750011
## race3        0.002083345 -0.03958333 -0.043750011  0.147916655

Be careful of the row & column ordering when comparing this to the Stata output. To make that comparison easier

vcov(glmCHD2)[c(2:4,1), c(2:4,1)]
##                   race1        race2        race3  (Intercept)
## race1        0.12291663 -0.031249997 -0.039583331 -0.010416669
## race2       -0.03125000  0.131249988 -0.043750011 -0.006249989
## race3       -0.03958333 -0.043750011  0.147916655  0.002083345
## (Intercept) -0.01041667 -0.006249989  0.002083345  0.047916655

Interpretation of the \(\beta\) coefficients in the deviation from means coding.

The first step is to get the odds for each of the groups. Going back to the contingency table

CHDtable
##          PRESENT ABSENT
## WHITE          5     20
## BLACK         20     10
## HISPANIC      15     10
## OTHER         10     10

the odds are obtained simply as the quotient of the columns. The log odds are immediate from that:

(groupOdds = CHDtable[ , 1] / CHDtable[ , 2])
##    WHITE    BLACK HISPANIC    OTHER 
##     0.25     2.00     1.50     1.00
(groupLogOdds = log(groupOdds))
##      WHITE      BLACK   HISPANIC      OTHER 
## -1.3862944  0.6931472  0.4054651  0.0000000

The average log odds is

(meanLogOdd = mean(groupLogOdds))
## [1] -0.07192052

and then subtracting from the log odds for each group

groupLogOdds - meanLogOdd
##       WHITE       BLACK    HISPANIC       OTHER 
## -1.31437384  0.76506770  0.47738563  0.07192052

Compare this with the coefficients column (labelled Estimate) from the model fit:

summary(glmCHD2)$coefficients
##                Estimate Std. Error    z value   Pr(>|z|)
## (Intercept) -0.07192052  0.2188987 -0.3285561 0.74249122
## race1        0.76506770  0.3505947  2.1822000 0.02909478
## race2        0.47738563  0.3622844  1.3177096 0.18760089
## race3        0.07192052  0.3845993  0.1870011 0.85165973

The geometric mean of the group odds is:

prod(groupOdds)^(1/4)
## [1] 0.9306049

and the odds ratio of this average odds with the odds for each of the groups is:

groupOdds / prod(groupOdds)^(1/4)
##     WHITE     BLACK  HISPANIC     OTHER 
## 0.2686425 2.1491399 1.6118549 1.0745699

which is the same result that you get if you exponentiate the coefficients for the groups (recall that the first beta in R is the intercept):

betas = summary(glmCHD2)$coefficients[, 1]
exp(betas)
## (Intercept)       race1       race2       race3 
##   0.9306049   2.1491399   1.6118549   1.0745699

To get the odds ratio for blacks vs whites, as explained in slide 14 of the lecture, we have to do this rather non-intuitive computation:

exp(2 * betas[2] + betas[3] + betas[4] )
## race1 
##     8

which is the same as:

groupOdds[2] / groupOdds[1]
## BLACK 
##     8

As for the confidence intervals, notice that confint.default uses the same coding as the model fit.

confint.default(glmCHD2)
##                   2.5 %    97.5 %
## (Intercept) -0.50095415 0.3571131
## race1        0.07791476 1.4522206
## race2       -0.23267875 1.1874500
## race3       -0.68188034 0.8257214

Thanks for your attention!