race variableThese are my notes for the lectures of the Coursera course “Introduction to Logistic Regression” by Professor Stanley Lemeshow. The goal of these notes is to provide the R code to obtain the same results as the Stata code in the lectures. Please read the Preliminaries of the code for lecture 1 for some details.
I have not been able to locate the ICU data set used for the last part of this week lecture (slides 19 to the end, video segments 4.5 and 4.6). That data set contains variables (e.g. MVENT) that do not appear in the data set with the same name that I have. Thus, I am not able to reproduce the results in that part of the lecture. If any one can help me locate the data, I’d appreciate that.
https://github.com/fernandosansegundo/LogisticRegressionCoursera
The contingency table in slide 2 can be entered as a matrix in R:
CHDtable = matrix(c(5, 20, 15, 10, 20, 10, 10, 10), ncol = 2)
colnames(CHDtable) = c("PRESENT", "ABSENT")
rownames(CHDtable) = c("WHITE", "BLACK", "HISPANIC", "OTHER")
addmargins(CHDtable)
## PRESENT ABSENT Sum
## WHITE 5 20 25
## BLACK 20 10 30
## HISPANIC 15 10 25
## OTHER 10 10 20
## Sum 50 50 100
To generate the raw data from this table I am going to use the expand.table function in the epitools library. Uncommenting (remove #) the first line will automatically download and install that library.
#install.packages("epitools")
library(epitools)
CHDdf = expand.table(CHDtable)
names(CHDdf) = c("race", "chd")
head(CHDdf)
## race chd
## 1 WHITE PRESENT
## 2 WHITE PRESENT
## 3 WHITE PRESENT
## 4 WHITE PRESENT
## 5 WHITE PRESENT
## 6 WHITE ABSENT
tail(CHDdf)
## race chd
## 95 OTHER ABSENT
## 96 OTHER ABSENT
## 97 OTHER ABSENT
## 98 OTHER ABSENT
## 99 OTHER ABSENT
## 100 OTHER ABSENT
And we can check the result by converting it back to a table:
table(CHDdf$race, CHDdf$chd)
##
## PRESENT ABSENT
## WHITE 5 20
## BLACK 20 10
## HISPANIC 15 10
## OTHER 10 10
This conversion to data.frame has the added benefit that race and chd have been converted to factors:
str(CHDdf)
## 'data.frame': 100 obs. of 2 variables:
## $ race: Factor w/ 4 levels "WHITE","BLACK",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ chd : Factor w/ 2 levels "PRESENT","ABSENT": 1 1 1 1 1 2 2 2 2 2 ...
and we saw that in that case R automatically handles the dummy variables needed for logistic regression. In fact, we can use contrasts to ask R about the dummy coding it is going to use for a factor:
contrasts(CHDdf$race)
## BLACK HISPANIC OTHER
## WHITE 0 0 0
## BLACK 1 0 0
## HISPANIC 0 1 0
## OTHER 0 0 1
As you see the reference cell coding is the default coding that R uses for a factor. Below we will see that contrasts can be used to change the dummies coding as we want.
And now we can fit the logistic model with glm:
glmCHD = glm(chd ~ race, family = binomial(link = "logit"), data = CHDdf)
summary(glmCHD)
##
## Call:
## glm(formula = chd ~ race, family = binomial(link = "logit"),
## data = CHDdf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7941 -1.0108 -0.1162 1.1774 1.4823
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.3863 0.5000 2.773 0.00556 **
## raceBLACK -2.0794 0.6325 -3.288 0.00101 **
## raceHISPANIC -1.7918 0.6455 -2.776 0.00551 **
## raceOTHER -1.3863 0.6708 -2.067 0.03878 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138.63 on 99 degrees of freedom
## Residual deviance: 124.59 on 96 degrees of freedom
## AIC: 132.59
##
## Number of Fisher Scoring iterations: 4
logLik(glmCHD)
## 'log Lik.' -62.29372 (df=4)
You can compare the results here with those on slide 5 of the lecture pdf.
Have you noticed something strange? The signs of the model coefficients are the opposite of the ones we obtained. Why did that happen? That’s because the order of the levels in the chd factor is:
levels(CHDdf$chd)
## [1] "PRESENT" "ABSENT"
The culprit here is probably the expand.table function. At first I did not notice this because the default rule in R is to use alphabetical order for the levels of a factor. But the expand.table function is doing somethng different, possibly ordering the levels in order of appearance in the data.
Anyway, now that we know, we can easily fix this by telling R which is the reference level. We use the relevel function:
CHDdf$chd = relevel(CHDdf$chd, ref = "ABSENT")
levels(CHDdf$chd)
## [1] "ABSENT" "PRESENT"
In a more complex situation, if you want an arbitrary reordering of all the levels you will need to learn how to use the levels function. Beware however of the confussion between factor levels and the labels of those levels!
Alternatively you can fix the contingency table itslef at the beginning, by executing
#(CHDtable = CHDtable[ , 2:1])
To do that, uncomment this line and copy it under the line addmargins(CHDtable). Note, however, that this has may have undesired side effects, because many people expect the exposed group to be in the first column of the contingency table. I will just leave the table as it was.
So, having fixed the ordering of the factor levels, the model fit now agrees with the Stata output:
glmCHD = glm(chd ~ race, family = binomial(link = "logit"), data = CHDdf)
summary(glmCHD)
##
## Call:
## glm(formula = chd ~ race, family = binomial(link = "logit"),
## data = CHDdf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4823 -1.1774 0.1162 1.0108 1.7941
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.3863 0.5000 -2.773 0.00556 **
## raceBLACK 2.0794 0.6325 3.288 0.00101 **
## raceHISPANIC 1.7918 0.6455 2.776 0.00551 **
## raceOTHER 1.3863 0.6708 2.067 0.03878 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138.63 on 99 degrees of freedom
## Residual deviance: 124.59 on 96 degrees of freedom
## AIC: 132.59
##
## Number of Fisher Scoring iterations: 4
logLik(glmCHD)
## 'log Lik.' -62.29372 (df=4)
Recall that our contingency table is:
CHDtable
## PRESENT ABSENT
## WHITE 5 20
## BLACK 20 10
## HISPANIC 15 10
## OTHER 10 10
It’s probably my fault, but I have not found any library function in R to compute the odds ratios from a contingency table like this one. But a nice thing about systems like R is that if you need it and you cannot find it, you can make one yourself. Bonus point: we get to use one of the R functions in the apply family. So, here is my function:
oddsRatio = function(CT, refRow=1, exposedCol = 1){
# The row vector has all the indices minus the reference group row index.
rows = (1: nrow(CT))[-refRow]
# We define a function that computes the odds-ratio of any row vs the
# reference row, and also labels the result. And we use sapply
# to apply that function to each of the indices in the rows vector.
sapply(rows, FUN = function(r){
OR = (CT[r, 1] * CT[refRow, 2]) / (CT[r, 2] * CT[refRow, 1])
names(OR) = paste0(row.names(CT)[r], " vs ", row.names(CT)[refRow], collapse = "")
return(OR)
})
}
Note that, besides the contingency table CT, the function has two additional arguments for the row index of the reference group and the column index containing the exposed group. The default values are first row and first column, but they can be set by hand.
Using the function defaults we get:
oddsRatio(CHDtable)
## BLACK vs WHITE HISPANIC vs WHITE OTHER vs WHITE
## 8 6 4
but if we want to use the second row (race = "BLACK") as reference, we simply do this:
oddsRatio(CHDtable, refRow = 2)
## WHITE vs BLACK HISPANIC vs BLACK OTHER vs BLACK
## 0.125 0.750 0.500
Of course, an alternative way to get the odds ratio is to exponentiate the coefficients of the logistic model we fitted before:
exp(glmCHD$coefficients)
## (Intercept) raceBLACK raceHISPANIC raceOTHER
## 0.25 8.00 6.00 4.00
race variableWe have done this before (see the example of the low birth weight study in week 2 lecture code), but for the sake of getting all the numerical results in the lecture I’ll repeat the computation:
(LR = glmCHD$null.deviance - glmCHD$deviance)
## [1] 14.04199
(df = summary(glmCHD)$df.null - summary(glmCHD)$df.residual) #degress of freedom for the chi-sq statistic
## [1] 3
(pValue = pchisq(LR, df, lower.tail = FALSE))
## [1] 0.002848545
We already know that we can use confint.default to get the confidence intervals for the coefficients of the model:
confint.default(glmCHD)
## 2.5 % 97.5 %
## (Intercept) -2.36627611 -0.4063126
## raceBLACK 0.83985175 3.3190313
## raceHISPANIC 0.52660835 3.0569106
## raceOTHER 0.07151073 2.7010780
and exponentiate them to get confidence intervals for the odds ratios:
exp(confint.default(glmCHD))
## 2.5 % 97.5 %
## (Intercept) 0.09382949 0.6661019
## raceBLACK 2.31602361 27.6335698
## raceHISPANIC 1.69317988 21.2617693
## raceOTHER 1.07412968 14.8957806
By the way, if you need a different confidence level (other than 95%), you can use an optional argument, e.g. level=0.99 for 99% intervals.
Alternatively we can get the standard errors for the log odds ratios using the formula on slide 6 of the lecture notes: \[
\hat{SE}(\ln(\hat{OR})) = \sqrt{\dfrac{1}{a} + \dfrac{1}{b} + \dfrac{1}{c} + \dfrac{1}{d} }
\] I have repurposed for this the oddsRatio function that I have shown you before. Now, instead of sapply the function uses lapply to get a list containing, for each group, both the odds ratio (against the reference group) and the standard error for the odds ratio.
oddsRatio2 = function(CT, refRow=1, exposedCol = 1){
# The row vector has all the indices minus the reference group row index.
rows = (1: nrow(CT))[-refRow]
# We define a function that computes the odds-ratio of any row vs the
# reference row, and also labels the result. And we use sapply
# to apply that function to each of the indices in the rows vector.
ORlist = lapply(rows, FUN = function(r){
OR = (CT[r, 1] * CT[refRow, 2]) / (CT[r, 2] * CT[refRow, 1])
names(OR) = paste0(row.names(CT)[r], " vs ", row.names(CT)[refRow], collapse = "")
SE = sqrt(sum(1 / CT[c(r, refRow), ]))
return(c(OR, SElog =SE))
})
return(ORlist)
}
In case you are wondering, the SElog = SE bit is just a trick to force R to use a named value and return the standard error with the name “SElog” attached.
The result of the function is
oddsRatio2(CHDtable)
## [[1]]
## BLACK vs WHITE SElog
## 8.0000000 0.6324555
##
## [[2]]
## HISPANIC vs WHITE SElog
## 6.0000000 0.6454972
##
## [[3]]
## OTHER vs WHITE SElog
## 4.0000000 0.6708204
You can check these values against the Std. Error column in the summary of the logistic model.
summGlmCHD = summary(glmCHD)
summGlmCHD$coefficients[ ,2]
## (Intercept) raceBLACK raceHISPANIC raceOTHER
## 0.4999999 0.6324554 0.6454971 0.6708203
Either way, from here it’s a small step to get the confidence intervals for the betas (remember that we have already obtained them with confint.default) and the odds ratios.
As I told you before, we can use the contrasts function not only to see the dummy variables coding of a factor in R, but to change it. In this case I begin by creating the matrix for the deviation from means coding:
(devMeans = rbind(rep(-1, 3), diag(rep(1, 3), nrow = 3)))
## [,1] [,2] [,3]
## [1,] -1 -1 -1
## [2,] 1 0 0
## [3,] 0 1 0
## [4,] 0 0 1
You can use matrix and directly enter the elements of the matrix, but that gets tedious with large matrices. I used rbind and diag to show you another way to do this.
Now, simply use contrasts to inform R that this is the coding we would like to use:
contrasts(CHDdf$race) = devMeans
that’s it! Now when we fit the model R uses that coding and returns the corresponding coefficients and the associated info:
glmCHD2 = glm(chd ~ race, family = binomial(link = "logit"), data = CHDdf)
summary(glmCHD2)
##
## Call:
## glm(formula = chd ~ race, family = binomial(link = "logit"),
## data = CHDdf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4823 -1.1774 0.1162 1.0108 1.7941
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.07192 0.21890 -0.329 0.7425
## race1 0.76507 0.35059 2.182 0.0291 *
## race2 0.47739 0.36228 1.318 0.1876
## race3 0.07192 0.38460 0.187 0.8517
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138.63 on 99 degrees of freedom
## Residual deviance: 124.59 on 96 degrees of freedom
## AIC: 132.59
##
## Number of Fisher Scoring iterations: 4
logLik(glmCHD2)
## 'log Lik.' -62.29372 (df=4)
Note that the log likelihood has not changed, it is independent of the coding.
We get this with
vcov(glmCHD2)
## (Intercept) race1 race2 race3
## (Intercept) 0.047916655 -0.01041667 -0.006249989 0.002083345
## race1 -0.010416669 0.12291663 -0.031249997 -0.039583331
## race2 -0.006249989 -0.03125000 0.131249988 -0.043750011
## race3 0.002083345 -0.03958333 -0.043750011 0.147916655
Be careful of the row & column ordering when comparing this to the Stata output. To make that comparison easier
vcov(glmCHD2)[c(2:4,1), c(2:4,1)]
## race1 race2 race3 (Intercept)
## race1 0.12291663 -0.031249997 -0.039583331 -0.010416669
## race2 -0.03125000 0.131249988 -0.043750011 -0.006249989
## race3 -0.03958333 -0.043750011 0.147916655 0.002083345
## (Intercept) -0.01041667 -0.006249989 0.002083345 0.047916655
The first step is to get the odds for each of the groups. Going back to the contingency table
CHDtable
## PRESENT ABSENT
## WHITE 5 20
## BLACK 20 10
## HISPANIC 15 10
## OTHER 10 10
the odds are obtained simply as the quotient of the columns. The log odds are immediate from that:
(groupOdds = CHDtable[ , 1] / CHDtable[ , 2])
## WHITE BLACK HISPANIC OTHER
## 0.25 2.00 1.50 1.00
(groupLogOdds = log(groupOdds))
## WHITE BLACK HISPANIC OTHER
## -1.3862944 0.6931472 0.4054651 0.0000000
The average log odds is
(meanLogOdd = mean(groupLogOdds))
## [1] -0.07192052
and then subtracting from the log odds for each group
groupLogOdds - meanLogOdd
## WHITE BLACK HISPANIC OTHER
## -1.31437384 0.76506770 0.47738563 0.07192052
Compare this with the coefficients column (labelled Estimate) from the model fit:
summary(glmCHD2)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.07192052 0.2188987 -0.3285561 0.74249122
## race1 0.76506770 0.3505947 2.1822000 0.02909478
## race2 0.47738563 0.3622844 1.3177096 0.18760089
## race3 0.07192052 0.3845993 0.1870011 0.85165973
The geometric mean of the group odds is:
prod(groupOdds)^(1/4)
## [1] 0.9306049
and the odds ratio of this average odds with the odds for each of the groups is:
groupOdds / prod(groupOdds)^(1/4)
## WHITE BLACK HISPANIC OTHER
## 0.2686425 2.1491399 1.6118549 1.0745699
which is the same result that you get if you exponentiate the coefficients for the groups (recall that the first beta in R is the intercept):
betas = summary(glmCHD2)$coefficients[, 1]
exp(betas)
## (Intercept) race1 race2 race3
## 0.9306049 2.1491399 1.6118549 1.0745699
To get the odds ratio for blacks vs whites, as explained in slide 14 of the lecture, we have to do this rather non-intuitive computation:
exp(2 * betas[2] + betas[3] + betas[4] )
## race1
## 8
which is the same as:
groupOdds[2] / groupOdds[1]
## BLACK
## 8
As for the confidence intervals, notice that confint.default uses the same coding as the model fit.
confint.default(glmCHD2)
## 2.5 % 97.5 %
## (Intercept) -0.50095415 0.3571131
## race1 0.07791476 1.4522206
## race2 -0.23267875 1.1874500
## race3 -0.68188034 0.8257214
Thanks for your attention!