RACE levels.These are my notes for the lectures of the Coursera course “Introduction to Logistic Regression” by Professor Stanley Lemeshow. The goal of these notes is to provide the R code to obtain the same results as the Stata code in the lectures. Please read the Preliminaries of the code for lecture 1 for some details.
https://github.com/fernandosansegundo/LogisticRegressionCoursera
This example uses the LOWBWT data set. You can get the data set in txt format from the book web site, as described in my notes for Week1 of the course. We read the full data set into R:
LOWBWT = read.table("./data/LOWBWT.txt", header = TRUE)
We convert LOW, SMOKE and RACE to factors, to let R handle them automatically in the modelling process.
LOWBWT$SMOKE = factor(LOWBWT$SMOKE)
LOWBWT$RACE = factor(LOWBWT$RACE)
We are going to use strings to label the RACE factor levels. Some care is needed in the code below because of this choice, but in return the tables are much easier to read.
levels(LOWBWT$RACE) = c("White", "Black", "Other")
The global (non stratified) contingency table for LOW vs SMOKE is given by:
nonStraTable = with(LOWBWT, table(LOW, SMOKE))
addmargins(nonStraTable)
## SMOKE
## LOW 0 1 Sum
## 0 86 44 130
## 1 29 30 59
## Sum 115 74 189
Note that with can be used in an R sentence to avoid (ab)using the $ notation while accessing the variables in a data frame.
In what follows we will we computing the odds ratio for a number of contingency tables, so we may as well define a function to do it:
OddsRatio = function(T){
T[1, 1] * T[2, 2] / (T[1, 2] * T[2, 1])
}
Let’s check that it works by computing the crude (again, non stratified) odds ratio:
(crudeOR = OddsRatio(nonStraTable))
## [1] 2.021944
RACE levels.To get contingency tables stratified by levels of RACE we simply add the factor to the arguments of table.
(straTable = with(LOWBWT, table(LOW, SMOKE, RACE)))
## , , RACE = White
##
## SMOKE
## LOW 0 1
## 0 40 33
## 1 4 19
##
## , , RACE = Black
##
## SMOKE
## LOW 0 1
## 0 11 4
## 1 5 6
##
## , , RACE = Other
##
## SMOKE
## LOW 0 1
## 0 35 7
## 1 20 5
By the way, this is the first instance where we see the advantage of using strings to label the RACE levels.
The resulting R object straTable is an array, the multidimensional analogue of a matrix. You can access the individual tables using dimension numbers, but in this case it is probably better to use the labels as in:
straTable[, , "Black"]
## SMOKE
## LOW 0 1
## 0 11 4
## 1 5 6
We can now use our function to get the odds ratio for each of these tables:
sapply(levels(LOWBWT$RACE), FUN = function(x){OddsRatio(straTable[ , ,x])})
## White Black Other
## 5.757576 3.300000 1.250000
The function OR.MH whose code appears below outputs a list with the Mantel-Haenszel estimator of the common OR and the table used to compute that estimator (see the table in page 6 of the lecture pdf).
The arguments of the function are:
Y (LOW in the example),F (SMOKE in the example)S (RACE in the example)See below for some comments about the R code.
OR.MH = function(Y, F , S){
T = table(Y, F, S)
MHtable = mapply(seq(levels(S)), FUN = function(x){
Tx = straTable[ , ,x]
N = sum(Tx)
ADdivN = (Tx[1, 1] * Tx[2,2])/N
BCdivN = (Tx[1, 2] * Tx[2,1])/N
return(c( c(t(Tx)), N, ADdivN, BCdivN))
})
MHtable = t(MHtable)
MHtable = signif(MHtable, 4)
colnames(MHtable) = c(letters[1:4], "Ni", "a*d/N", "b*c/N")
row.names(MHtable) = levels(S)
numerOR = sum(MHtable[,"a*d/N"])
denomOR = sum(MHtable[,"b*c/N"])
OR = numerOR / denomOR
return(list(MHtable= MHtable, numerOR=numerOR, denomOR=denomOR, OR = OR))
}
## Applying the function:
with(LOWBWT, OR.MH(Y = LOW, F = SMOKE, S = RACE))
## $MHtable
## a b c d Ni a*d/N b*c/N
## White 40 33 4 19 96 7.917 1.3750
## Black 11 4 5 6 26 2.538 0.7692
## Other 35 7 20 5 67 2.612 2.0900
##
## $numerOR
## [1] 13.067
##
## $denomOR
## [1] 4.2342
##
## $OR
## [1] 3.086061
mapply function is used to apply the same set of operations to the contingency table for a fixed level x of the factor S. Note that we use seq to loop over the levels of S in the call to mapply.c(t(Tx)) part of the return value is a trick to get the values of a, b, c, d in order. t(Tx) is the transpose of the contingency table, and applying c converts any matrix (or table or array) to a vector, in column-first order. That is why we need to transpose, to get a, b, c, d instead of a, c, b, d which is R’s default behavior.Base R includes a function mantelhaen.test that returns the same common odds ratio estimator as part of its output:
(MH = mantelhaen.test(straTable, correct = FALSE))
##
## Mantel-Haenszel chi-squared test without continuity correction
##
## data: straTable
## Mantel-Haenszel X-squared = 9.4134, df = 1, p-value = 0.002154
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.490740 6.389949
## sample estimates:
## common odds ratio
## 3.086381
Be careful when interpreting the results of this function. The null hypothesis being tested here is not the homogeneity of the odds ratio. The null of this function (as the output remarks) is that the true common odds ratio is not equal to 1, assuming that homogeneity holds and there is such thing as a common odds ratio. Thus, this test should only be performed after homogeneity has been established. I’m including it here as a quick way to get to the value of the estimator.
As with the previous estimator, I have created a function to obtain the table on page 8 of the lecture pdf, that leads to the logit-based estimator for the common odds-ratio. The code is similar to the previous function, and the comments for that function apply here as well.
OR.logit = function(Y, F , S){
T = table(Y, F, S)
logitTable = mapply(seq(levels(S)), FUN = function(x){
Tx = straTable[ , ,x]
OR = OddsRatio(Tx)
logOR = log(OR)
VarLogOR = sum(1/Tx)
w = 1/VarLogOR
wLogOr = w * logOR
return(c( c(t(Tx)), OR, logOR, VarLogOR, w, wLogOr))
})
logitTable = t(logitTable)
logitTable = signif(logitTable, 4)
colnames(logitTable) = c(letters[1:4], "OR", "ln(OR)", "Var(ln(OR))", "w", "w ln(OR)")
row.names(logitTable) = levels(S)
ORestimate = exp(sum(logitTable[ , "w ln(OR)"]) / sum(logitTable[ ,"w"]))
return(list(logitTable= logitTable, ORestimate = ORestimate))
}
# Applying the function:
(OR.logit.LOWBWT = with(LOWBWT, OR.logit(LOW, SMOKE, RACE)))
## $logitTable
## a b c d OR ln(OR) Var(ln(OR)) w w ln(OR)
## White 40 33 4 19 5.758 1.7510 0.3579 2.794 4.8910
## Black 11 4 5 6 3.300 1.1940 0.7076 1.413 1.6870
## Other 35 7 20 5 1.250 0.2231 0.4214 2.373 0.5295
##
## $ORestimate
## [1] 2.945172
The table on page 9 of the lecture pdf is built upon the results of the logit-based estimator for the common OR. The following function takes the table that we get as a result of OR.logit and outputs a list with:
HomogTest = function(OR.logit){
K = cbind(OR.logit[[1]][, 6], rep(log(OR.logit[[2]]), nrow(OR.logit[[1]])))
K = cbind(K, (K[ ,1] - K[ , 2])^2)
K = cbind(K , OR.logit[[1]][, 8], K[ ,3] * OR.logit[[1]][, 8])
chiSq = sum(K[ , 5])
pValue = pchisq(chiSq, df= nrow(OR.logit[[1]]) - 1, lower.tail = FALSE)
return(list(values = K, chiSq = chiSq, pValue=pValue , df=nrow(OR.logit[[1]]) - 1) )
}
## Applying it to the data:
HomogTest(OR.logit.LOWBWT)
## $values
## [,1] [,2] [,3] [,4] [,5]
## White 1.7510 1.080167 0.45001668 2.794 1.25734661
## Black 1.1940 1.080167 0.01295791 1.413 0.01830953
## Other 0.2231 1.080167 0.73456414 2.373 1.74312070
##
## $chiSq
## [1] 3.018777
##
## $pValue
## [1] 0.2210451
##
## $df
## [1] 2
I have found the following page with R code for the Breslow Day homogeneity test.
http://www.math.montana.edu/~jimrc/classes/stat524/Rcode/breslowday.test.r
We can get that code into R directly from the URL using source. BUT BE CAREFUL!! You should always check it before running code from an external URL like this, to avoid exposing your system to potentially harmful code. Uncomment the first line below if you are sure that you want to run this code. I ave also included the output I got from that test.
# source("http://www.math.montana.edu/~jimrc/classes/stat524/Rcode/breslowday.test.r")
# breslowday.test(straTable)
## ### OUTPUT FOR OUR DATA:
## White Black Other
## log OR 1.7505165 1.1939225 0.2231436
## Weight 0.2653745 0.6401681 0.3989697
## OR Stat df pvalue
## 3.0863813 2.9227163 2.0000000 0.2319211
In order to use logistic regression for the homogeneity analysis we are going to build the three models described in the lecture. This modelling step is already well known to us, so I proceed directly to get the results shown in the lecture pdf:
The first (crude) model is the one without the variable RACE:
model1 = glm(LOW ~ SMOKE, family = binomial(link = "logit"), data = LOWBWT)
(summ1 = summary(model1))
##
## Call:
## glm(formula = LOW ~ SMOKE, family = binomial(link = "logit"),
## data = LOWBWT)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0197 -0.7623 -0.7623 1.3438 1.6599
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0871 0.2147 -5.062 4.14e-07 ***
## SMOKE1 0.7041 0.3196 2.203 0.0276 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 229.80 on 187 degrees of freedom
## AIC: 233.8
##
## Number of Fisher Scoring iterations: 4
(OR1 = exp(coefficients(model1)[2]))
## SMOKE1
## 2.021944
logLik(model1)
## 'log Lik.' -114.9023 (df=2)
The next model includes RACE without interaction term:
model2 = glm(LOW ~ SMOKE + RACE, family = binomial(link = "logit"), data = LOWBWT)
(summ2 = summary(model2))
##
## Call:
## glm(formula = LOW ~ SMOKE + RACE, family = binomial(link = "logit"),
## data = LOWBWT)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3442 -0.8862 -0.5428 1.4964 1.9939
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8405 0.3529 -5.216 1.83e-07 ***
## SMOKE1 1.1160 0.3692 3.023 0.00251 **
## RACEBlack 1.0841 0.4900 2.212 0.02693 *
## RACEOther 1.1086 0.4003 2.769 0.00562 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 219.97 on 185 degrees of freedom
## AIC: 227.97
##
## Number of Fisher Scoring iterations: 4
(OR2 = exp(coefficients(model2)["SMOKE1"]))
## SMOKE1
## 3.052631
logLik(model2)
## 'log Lik.' -109.9874 (df=4)
The odds ratio provided by this model (which equals 3.05) is an alternative estimate of the common odds ratio under the homogeneity hypothesis.
The likelihood ratio test comparing this model to the previous one is performed as follows:
(Gmodel2 = summ1$deviance - summ2$deviance)
## [1] 9.829889
(pValue2 = pchisq(Gmodel2, df = summ2$df[1] - summ1$df[1], lower.tail = FALSE))
## [1] 0.007336125
And the final model is the one with the interaction term:
model3 = glm(LOW ~ SMOKE * RACE, family = binomial(link = "logit"), data = LOWBWT)
(summ3 = summary(model3))
##
## Call:
## glm(formula = LOW ~ SMOKE * RACE, family = binomial(link = "logit"),
## data = LOWBWT)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3537 -0.9508 -0.4366 1.4190 2.1899
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3026 0.5243 -4.392 1.12e-05 ***
## SMOKE1 1.7505 0.5982 2.926 0.00343 **
## RACEBlack 1.5141 0.7522 2.013 0.04412 *
## RACEOther 1.7430 0.5945 2.932 0.00337 **
## SMOKE1:RACEBlack -0.5566 1.0322 -0.539 0.58972
## SMOKE1:RACEOther -1.5274 0.8828 -1.730 0.08359 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 216.82 on 183 degrees of freedom
## AIC: 228.82
##
## Number of Fisher Scoring iterations: 4
logLik(model3)
## 'log Lik.' -108.4089 (df=6)
In this case we are not really interested in the odds ratio estimated by this model.
The likelihood ratio test comparing this model to the previous one is:
(Gmodel3 = summ2$deviance - summ3$deviance)
## [1] 3.156937
(pValue3 = pchisq(Gmodel3, df = summ3$df[1] - summ2$df[1], lower.tail = FALSE))
## [1] 0.2062908
and the conclusion, as explained in the lecture, is that homogeneity holds (odds ratios across different RACE levels are within sampling variability).
Thanks for your attention!