Introduction

These are my notes for the lectures of the Coursera course “Introduction to Logistic Regression” by Professor Stanley Lemeshow. The goal of these notes is to provide the R code to obtain the same results as the Stata code in the lectures. Please read the Preliminaries of the code for lecture 1 for some details.

R code for previous lectures:

Reading the low birth weight data file and creating a data frame.

This example uses the LOWBWT data set. You can get the data set in txt format from the book web site, as described in my notes for Week1 of the course. We read the full data set into R:

LOWBWT = read.table("./data/LOWBWT.txt", header = TRUE)

We convert LOW, SMOKE and RACE to factors, to let R handle them automatically in the modelling process.

LOWBWT$SMOKE = factor(LOWBWT$SMOKE)

LOWBWT$RACE = factor(LOWBWT$RACE)

We are going to use strings to label the RACE factor levels. Some care is needed in the code below because of this choice, but in return the tables are much easier to read.

levels(LOWBWT$RACE) = c("White", "Black", "Other")

The global (non stratified) contingency table for LOW vs SMOKE is given by:

nonStraTable = with(LOWBWT, table(LOW, SMOKE))
addmargins(nonStraTable)
##      SMOKE
## LOW     0   1 Sum
##   0    86  44 130
##   1    29  30  59
##   Sum 115  74 189

Note that with can be used in an R sentence to avoid (ab)using the $ notation while accessing the variables in a data frame.

In what follows we will we computing the odds ratio for a number of contingency tables, so we may as well define a function to do it:

OddsRatio = function(T){
  T[1, 1] * T[2, 2] / (T[1, 2] * T[2, 1])
}

Let’s check that it works by computing the crude (again, non stratified) odds ratio:

(crudeOR = OddsRatio(nonStraTable))
## [1] 2.021944

Contingency tables and odds ratio for each of the RACE levels.

To get contingency tables stratified by levels of RACE we simply add the factor to the arguments of table.

(straTable = with(LOWBWT, table(LOW, SMOKE, RACE)))
## , , RACE = White
## 
##    SMOKE
## LOW  0  1
##   0 40 33
##   1  4 19
## 
## , , RACE = Black
## 
##    SMOKE
## LOW  0  1
##   0 11  4
##   1  5  6
## 
## , , RACE = Other
## 
##    SMOKE
## LOW  0  1
##   0 35  7
##   1 20  5

By the way, this is the first instance where we see the advantage of using strings to label the RACE levels.

The resulting R object straTable is an array, the multidimensional analogue of a matrix. You can access the individual tables using dimension numbers, but in this case it is probably better to use the labels as in:

straTable[, , "Black"]
##    SMOKE
## LOW  0 1
##   0 11 4
##   1  5 6

We can now use our function to get the odds ratio for each of these tables:

sapply(levels(LOWBWT$RACE), FUN = function(x){OddsRatio(straTable[ , ,x])})
##    White    Black    Other 
## 5.757576 3.300000 1.250000

Mantel-Haenszel estimator

The function OR.MH whose code appears below outputs a list with the Mantel-Haenszel estimator of the common OR and the table used to compute that estimator (see the table in page 6 of the lecture pdf).

The arguments of the function are:

See below for some comments about the R code.

OR.MH = function(Y, F , S){
  T = table(Y, F, S)
  MHtable = mapply(seq(levels(S)), FUN = function(x){
    Tx = straTable[ , ,x]
    N = sum(Tx)
    ADdivN = (Tx[1, 1] * Tx[2,2])/N
    BCdivN = (Tx[1, 2] * Tx[2,1])/N
    return(c( c(t(Tx)), N, ADdivN, BCdivN))
    })
  MHtable = t(MHtable)
  MHtable = signif(MHtable, 4)
  colnames(MHtable) = c(letters[1:4], "Ni", "a*d/N", "b*c/N")
  row.names(MHtable) = levels(S)
  numerOR = sum(MHtable[,"a*d/N"])
  denomOR = sum(MHtable[,"b*c/N"]) 
  OR =  numerOR / denomOR
  return(list(MHtable= MHtable, numerOR=numerOR, denomOR=denomOR, OR = OR))
}

## Applying the function:

with(LOWBWT, OR.MH(Y = LOW, F = SMOKE, S = RACE))
## $MHtable
##        a  b  c  d Ni a*d/N  b*c/N
## White 40 33  4 19 96 7.917 1.3750
## Black 11  4  5  6 26 2.538 0.7692
## Other 35  7 20  5 67 2.612 2.0900
## 
## $numerOR
## [1] 13.067
## 
## $denomOR
## [1] 4.2342
## 
## $OR
## [1] 3.086061

Base R includes a function mantelhaen.test that returns the same common odds ratio estimator as part of its output:

(MH = mantelhaen.test(straTable, correct = FALSE))
## 
##  Mantel-Haenszel chi-squared test without continuity correction
## 
## data:  straTable
## Mantel-Haenszel X-squared = 9.4134, df = 1, p-value = 0.002154
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.490740 6.389949
## sample estimates:
## common odds ratio 
##          3.086381

Be careful when interpreting the results of this function. The null hypothesis being tested here is not the homogeneity of the odds ratio. The null of this function (as the output remarks) is that the true common odds ratio is not equal to 1, assuming that homogeneity holds and there is such thing as a common odds ratio. Thus, this test should only be performed after homogeneity has been established. I’m including it here as a quick way to get to the value of the estimator.

Weighted logit-based estimator for the common (pooled) OR.

As with the previous estimator, I have created a function to obtain the table on page 8 of the lecture pdf, that leads to the logit-based estimator for the common odds-ratio. The code is similar to the previous function, and the comments for that function apply here as well.

OR.logit = function(Y, F , S){
  T = table(Y, F, S)
  logitTable = mapply(seq(levels(S)), FUN = function(x){
    Tx = straTable[ , ,x]
    OR = OddsRatio(Tx)
    logOR = log(OR)
    VarLogOR =  sum(1/Tx)
    w = 1/VarLogOR
    wLogOr = w * logOR
    return(c( c(t(Tx)), OR, logOR, VarLogOR, w, wLogOr))
    })
  logitTable = t(logitTable)
  logitTable = signif(logitTable, 4)
  colnames(logitTable) = c(letters[1:4], "OR", "ln(OR)", "Var(ln(OR))", "w", "w ln(OR)")
  row.names(logitTable) = levels(S)
  ORestimate = exp(sum(logitTable[ , "w ln(OR)"]) / sum(logitTable[ ,"w"])) 
  return(list(logitTable= logitTable, ORestimate = ORestimate))
}

# Applying the function:
(OR.logit.LOWBWT = with(LOWBWT, OR.logit(LOW, SMOKE, RACE)))
## $logitTable
##        a  b  c  d    OR ln(OR) Var(ln(OR))     w w ln(OR)
## White 40 33  4 19 5.758 1.7510      0.3579 2.794   4.8910
## Black 11  4  5  6 3.300 1.1940      0.7076 1.413   1.6870
## Other 35  7 20  5 1.250 0.2231      0.4214 2.373   0.5295
## 
## $ORestimate
## [1] 2.945172

Homogeneity test of the OR across strata, based on weighted sum of squared deviations.

The table on page 9 of the lecture pdf is built upon the results of the logit-based estimator for the common OR. The following function takes the table that we get as a result of OR.logit and outputs a list with:

HomogTest = function(OR.logit){
  K = cbind(OR.logit[[1]][, 6], rep(log(OR.logit[[2]]), nrow(OR.logit[[1]])))
  K = cbind(K, (K[ ,1] - K[ , 2])^2)
  K = cbind(K , OR.logit[[1]][, 8], K[ ,3] * OR.logit[[1]][, 8])
  chiSq = sum(K[ , 5]) 
  pValue = pchisq(chiSq, df= nrow(OR.logit[[1]]) - 1, lower.tail = FALSE)
  return(list(values = K, chiSq = chiSq, pValue=pValue , df=nrow(OR.logit[[1]]) - 1) )
}

## Applying it to the data:

HomogTest(OR.logit.LOWBWT)
## $values
##         [,1]     [,2]       [,3]  [,4]       [,5]
## White 1.7510 1.080167 0.45001668 2.794 1.25734661
## Black 1.1940 1.080167 0.01295791 1.413 0.01830953
## Other 0.2231 1.080167 0.73456414 2.373 1.74312070
## 
## $chiSq
## [1] 3.018777
## 
## $pValue
## [1] 0.2210451
## 
## $df
## [1] 2

Breslow-Day

I have found the following page with R code for the Breslow Day homogeneity test.

http://www.math.montana.edu/~jimrc/classes/stat524/Rcode/breslowday.test.r

We can get that code into R directly from the URL using source. BUT BE CAREFUL!! You should always check it before running code from an external URL like this, to avoid exposing your system to potentially harmful code. Uncomment the first line below if you are sure that you want to run this code. I ave also included the output I got from that test.

# source("http://www.math.montana.edu/~jimrc/classes/stat524/Rcode/breslowday.test.r")
# breslowday.test(straTable)

## ### OUTPUT FOR OUR DATA:
##            White     Black     Other
## log OR 1.7505165 1.1939225 0.2231436
## Weight 0.2653745 0.6401681 0.3989697
##        OR      Stat        df    pvalue 
## 3.0863813 2.9227163 2.0000000 0.2319211 

Homogeneity analysis through logistic regression

In order to use logistic regression for the homogeneity analysis we are going to build the three models described in the lecture. This modelling step is already well known to us, so I proceed directly to get the results shown in the lecture pdf:

The first (crude) model is the one without the variable RACE:

model1 = glm(LOW ~  SMOKE, family = binomial(link = "logit"), data = LOWBWT)
(summ1 = summary(model1))
## 
## Call:
## glm(formula = LOW ~ SMOKE, family = binomial(link = "logit"), 
##     data = LOWBWT)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0197  -0.7623  -0.7623   1.3438   1.6599  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.0871     0.2147  -5.062 4.14e-07 ***
## SMOKE1        0.7041     0.3196   2.203   0.0276 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 229.80  on 187  degrees of freedom
## AIC: 233.8
## 
## Number of Fisher Scoring iterations: 4
(OR1 = exp(coefficients(model1)[2]))
##   SMOKE1 
## 2.021944
logLik(model1)
## 'log Lik.' -114.9023 (df=2)

The next model includes RACE without interaction term:

model2 = glm(LOW ~  SMOKE + RACE, family = binomial(link = "logit"), data = LOWBWT)
(summ2 = summary(model2))
## 
## Call:
## glm(formula = LOW ~ SMOKE + RACE, family = binomial(link = "logit"), 
##     data = LOWBWT)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3442  -0.8862  -0.5428   1.4964   1.9939  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.8405     0.3529  -5.216 1.83e-07 ***
## SMOKE1        1.1160     0.3692   3.023  0.00251 ** 
## RACEBlack     1.0841     0.4900   2.212  0.02693 *  
## RACEOther     1.1086     0.4003   2.769  0.00562 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 219.97  on 185  degrees of freedom
## AIC: 227.97
## 
## Number of Fisher Scoring iterations: 4
(OR2 = exp(coefficients(model2)["SMOKE1"]))
##   SMOKE1 
## 3.052631
logLik(model2)
## 'log Lik.' -109.9874 (df=4)

The odds ratio provided by this model (which equals 3.05) is an alternative estimate of the common odds ratio under the homogeneity hypothesis.

The likelihood ratio test comparing this model to the previous one is performed as follows:

(Gmodel2 = summ1$deviance - summ2$deviance) 
## [1] 9.829889
(pValue2 = pchisq(Gmodel2, df = summ2$df[1] - summ1$df[1], lower.tail = FALSE))
## [1] 0.007336125

And the final model is the one with the interaction term:

model3 = glm(LOW ~  SMOKE * RACE, family = binomial(link = "logit"), data = LOWBWT)
(summ3 = summary(model3))
## 
## Call:
## glm(formula = LOW ~ SMOKE * RACE, family = binomial(link = "logit"), 
##     data = LOWBWT)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3537  -0.9508  -0.4366   1.4190   2.1899  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -2.3026     0.5243  -4.392 1.12e-05 ***
## SMOKE1             1.7505     0.5982   2.926  0.00343 ** 
## RACEBlack          1.5141     0.7522   2.013  0.04412 *  
## RACEOther          1.7430     0.5945   2.932  0.00337 ** 
## SMOKE1:RACEBlack  -0.5566     1.0322  -0.539  0.58972    
## SMOKE1:RACEOther  -1.5274     0.8828  -1.730  0.08359 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 216.82  on 183  degrees of freedom
## AIC: 228.82
## 
## Number of Fisher Scoring iterations: 4
logLik(model3)
## 'log Lik.' -108.4089 (df=6)

In this case we are not really interested in the odds ratio estimated by this model.

The likelihood ratio test comparing this model to the previous one is:

(Gmodel3 = summ2$deviance - summ3$deviance) 
## [1] 3.156937
(pValue3 = pchisq(Gmodel3, df = summ3$df[1] - summ2$df[1], lower.tail = FALSE))
## [1] 0.2062908

and the conclusion, as explained in the lecture, is that homogeneity holds (odds ratios across different RACE levels are within sampling variability).


Thanks for your attention!