Introduction

These are my notes for the lectures of the Coursera course “Introduction to Logistic Regression” by Professor Stanley Lemeshow. The goal of these notes is to provide the R code to obtain the same results as the Stata code in the lectures. Please read the Preliminaries of the code for lecture 1 for some details.

Reading the low birth weight data file and creating a data frame

In this lecture we will use again the LOWBWT data set. You can get the data set in txt format from the book web site, as described in my notes for Week1 of the course. We read the full data set into R:

LOWBWT = read.table("./data/LOWBWT.txt", header = TRUE)

In this lecture we wil consider a logistic model where the response variable is LOW and the covariates are LWT and RACE. In order to do that we convert RACE to a factor as usual, to let R handle it automatically in the modelling process.

LOWBWT$RACE = factor(LOWBWT$RACE)

As in the previous lecture we are going to use strings to label the RACE factor levels.

levels(LOWBWT$RACE) = c("White", "Black", "Other")

And now we update the data frame to contain only the variables we include in the model:

LOWBWT = with(LOWBWT, data.frame(LOW, LWT, RACE) )

Covariate patterns

Let us begin with the notion of covariate patterns. To illustrate this, we consider a cross table of the two covariates LWT and RACE:

table(LOWBWT$RACE ,LOWBWT$LWT)
##        
##         80 85 89 90 91 92 94 95 96 97 98 100 101 102 103 105 107 108 109
##   White  0  0  0  3  1  1  0  3  0  0  0   3   0   2   0   2   1   1   1
##   Black  0  0  0  0  0  0  0  0  0  0  1   0   0   0   0   1   0   0   0
##   Other  1  2  1  0  0  0  1  3  1  1  0   2   1   0   3   4   1   0   1
##        
##         110 112 113 115 116 117 118 119 120 121 122 123 124 125 127 128
##   White   5   3   1   2   1   2   2   0   7   1   1   2   1   1   0   0
##   Black   1   1   2   0   0   0   0   0   2   2   1   0   0   1   0   1
##   Other   5   0   0   5   0   0   0   3   8   1   0   1   1   1   1   1
##        
##         129 130 131 132 133 134 135 137 138 140 141 142 147 148 150 153
##   White   1   8   1   1   1   1   4   1   2   3   1   1   2   0   2   0
##   Black   0   2   0   0   0   0   0   0   0   0   0   1   0   0   0   0
##   Other   0   3   0   2   1   2   0   0   0   0   0   0   0   1   3   1
##        
##         154 155 158 160 165 167 168 169 170 175 182 184 185 186 187 189
##   White   1   2   1   1   1   1   0   1   3   1   0   1   0   1   0   1
##   Black   0   0   1   0   0   0   1   0   1   0   1   0   1   0   2   0
##   Other   1   1   0   1   0   0   0   1   0   0   0   0   0   0   0   0
##        
##         190 200 202 215 229 235 241 250
##   White   2   0   1   1   0   1   0   0
##   Black   0   1   0   0   1   0   1   0
##   Other   0   0   0   0   0   0   0   1

Every non-zero number in this table corresponds to a covariate pattern (a bin) and the number itself tells us how many observations belong to that particular pattern. For example, there are 5 women in the coavariate pattern defined by (LWT==110, RACE="Other"). The numbers in this table are the \(m_j\) on page 4 of the lecture notes.

We can count the number of different covariate patterns (that is, the number \(J\)) with a standard R trick with booleans as follows (R converts TRUE values to 1s and FALSE values to 0s):

(numCovPatt = sum(table(LOWBWT$RACE ,LOWBWT$LWT) >= 1))
## [1] 109

This result is used on page 21 of the lecture pdf to compute the Pearson \(\chi^2\) statistic.

Using epi.cp

Though the above manual solution can be used for our purposes sometimes you may prefer a more complete and safer approach, using the function epi.cp provided by the epiR library (remember to install it if you haven’t already). This function returns a list with two components, where the first one (called cov.pattern) is a data frame with information about the the covariate patterns. I’ll just show you the first and last rows of the data frame (check the help for the function if you want to know more details).

library(epiR)
## Loading required package: survival
## Package epiR 0.9-62 is loaded
## Type help(epi.about) for summary information
LOWBWT.cp = epi.cp(data.frame(LOWBWT$LWT, LOWBWT$RACE))
CovPatt = LOWBWT.cp$cov.pattern
names(CovPatt)[3:4] = c("LWT", "RACE")
head(CovPatt)
##   id n LWT  RACE
## 1  1 8 120 Other
## 2  2 8 130 White
## 3  3 2 187 Black
## 4  4 4 105 Other
## 5  5 2  85 Other
## 6  6 3 150 Other
tail(CovPatt)
##      id n LWT  RACE
## 169 104 1 170 Black
## 170 105 1 186 White
## 180 106 1 158 White
## 181 107 1 160 Other
## 183 108 1 129 White
## 188 109 1 116 White

The column labeled n contains the number that we denoted \(m_j\) for each covariate pattern. I’ll store them in a vector for later use.

mj = CovPatt$n

The column id is an identifier for the covariate pattern correspoding to each observation and we will return to it below. Again, you see that we have 109 covariate patterns. By the way, don’t be confused by the 188 appearing here: the component cov.pattern of the return value of epi.cp is a data frame, and the row names of that data frame are therows of the original data frame (in this case LOWBWT) where a particular covariate pattern was first found. Thus, 188 means that the last (that is, the 109th) covariate pattern was first found in the 188th line of LOWBWT.

The \(y_j\) values

The result provided by epi.cp has a second component (besides cov.pattern) called id, which is a vector with values from 1 to 109 that can be used to classify the observations into covariate patterns. If we add that vector to the data frame:

LOWBWT$cpId = LOWBWT.cp$id

we can then get the \(y_j\) numbers for each of the covariate patterns as follows:

with(LOWBWT, table(LOW, cpId))
##    cpId
## LOW 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
##   0 6 4 0 1 1 2 0 0 1  0  1  0  3  1  0  1  0  1  2  0  0  0  2  1  1  0
##   1 2 4 2 3 1 1 1 1 1  1  1  1  2  2  1  1  1  1  1  1  1  2  1  1  1  2
##    cpId
## LOW 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
##   0  1  0  2  0  0  0  0  0  0  0  0  0  6  0  1  0  2  0  0  1  1  1  1
##   1  1  1  3  1  1  1  1  2  1  1  1  1  1  1  1  1  1  1  1  0  0  0  0
##    cpId
## LOW 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
##   0  1  2  2  1  3  1  3  1  1  1  1  1  1  2  3  2  1  3  1  1  1  1  1
##   1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##    cpId
## LOW 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
##   0  1  1  2  1  5  1  3  1  1  1  1  1  1  4  1  1  1  3  1  1  1  1  1
##   1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##    cpId
## LOW 96 97 98 99 100 101 102 103 104 105 106 107 108 109
##   0  2  1  1  2   1   1   1   1   1   1   1   1   1   1
##   1  0  0  0  0   0   0   0   0   0   0   0   0   0   0

The header for each column in this table is an identifier for a particular covariate pattern. For example, the 3 observations with identifier 43 are

LOWBWT[LOWBWT$cpId==43, ]
##     LOW LWT  RACE cpId
## 55    1  95 White   43
## 113   0  95 White   43
## 153   0  95 White   43

and as the previous table shows, you can see that two of them have LOW==0 while the remaining one has LOW==1. Thus the \(y_j\) are the values that appear in the second row of the table.

CovPatt$yj = tapply(LOWBWT$LOW, LOWBWT$cpId, FUN = sum)

And we can check that their sum equals the number of subjects with \(y=1\) (recall \(y\) is LOW):

sum(CovPatt$yj)
## [1] 59
sum(LOWBWT$LOW)
## [1] 59

Logistic model and deciles of risk

The deciles of risk described on pages 6-8 of the lecture pdf are computed from the predicted probabilities of the logistic model. Thus, let us fit the model to our data:

glmLOWBWT = glm(LOW ~  LWT + RACE, family = binomial(link = "logit"), data = LOWBWT)
(summ1 = summary(glmLOWBWT))
## 
## Call:
## glm(formula = LOW ~ LWT + RACE, family = binomial(link = "logit"), 
##     data = LOWBWT)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3491  -0.8919  -0.7196   1.2526   2.0993  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.805753   0.845167   0.953   0.3404  
## LWT         -0.015223   0.006439  -2.364   0.0181 *
## RACEBlack    1.081066   0.488052   2.215   0.0268 *
## RACEOther    0.480603   0.356674   1.347   0.1778  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 223.26  on 185  degrees of freedom
## AIC: 231.26
## 
## Number of Fisher Scoring iterations: 4

The predicted probabilities can be obtained with the fitted function (we have seen other ways in the R code for previous lectures). Let’s add them to the data and see how they look like:

LOWBWT$fit = fitted(glmLOWBWT)
head(LOWBWT$fit)
## [1] 0.3680909 0.2362681 0.2768981 0.4226098 0.4980983 0.2695103
tail(LOWBWT$fit)
## [1] 0.2362681 0.2648290 0.1440349 0.2648290 0.2768529 0.2560334

To define the deciles we are going to use the quantile function. This is always a delicate step, since there are many methods to estimate the quantiles of a data set. In fact, the quantile functions has an argument called type which can be used to select between 9 different methods. See the help file for quantile. In order to reproduce the results of Stata I have used type = 6. A different method can lead to different values of the risk deciles (experiment this by yourself, changing the value of type).

(cutPoints = quantile(fitted(glmLOWBWT),probs = (0:10)/10, type = 6))
##        0%       10%       20%       30%       40%       50%       60% 
## 0.0588741 0.1680520 0.2228100 0.2531445 0.2707990 0.2955066 0.3334446 
##       70%       80%       90%      100% 
## 0.3680909 0.4078256 0.4769704 0.5974751

Now that the cut points for the deciles have been selected we use cut to define a factor riskDecile that classifies the values of the fitted probabilities into decilic classes. Then we add the factor to the data frame. Here again we have to be careful. The cut function has two optional arguments that determine the resulting classification:

In the code below these arguments are set to reproduce the results obtained with Stata that appear in the lecture pdf. If you want to reproduce the Systat results change right=TRUE to right=FALSE.

LOWBWT$riskDecile = cut(fitted(glmLOWBWT), breaks = cutPoints, include.lowest = TRUE, right = TRUE)
table(LOWBWT$riskDecile)
## 
## [0.0589,0.168]  (0.168,0.223]  (0.223,0.253]  (0.253,0.271]  (0.271,0.296] 
##             19             21             17             19             19 
##  (0.296,0.333]  (0.333,0.368]  (0.368,0.408]  (0.408,0.477]  (0.477,0.597] 
##             19             23             15             20             17

Next the values of LOW for the observations in each decilic class are sumed to obtain the observed values of \(y=1\). The tapply function is the tool for this job, applying an operation (the sum in this case) across the values of a factor.

(Obs1 = with(LOWBWT, tapply(LOW, INDEX = riskDecile, FUN = sum)) )
## [0.0589,0.168]  (0.168,0.223]  (0.223,0.253]  (0.253,0.271]  (0.271,0.296] 
##              2              4              5              4              8 
##  (0.296,0.333]  (0.333,0.368]  (0.368,0.408]  (0.408,0.477]  (0.477,0.597] 
##              6              6              3             12              9

The same operation, this time applied to the fitted probabilities gives the expected values of \(y=1\).

(Exp1 = with(LOWBWT, tapply(fit, INDEX = riskDecile, FUN = sum)) )
## [0.0589,0.168]  (0.168,0.223]  (0.223,0.253]  (0.253,0.271]  (0.271,0.296] 
##       2.366518       4.248317       4.030204       5.009915       5.420323 
##  (0.296,0.333]  (0.333,0.368]  (0.368,0.408]  (0.408,0.477]  (0.477,0.597] 
##       6.105065       8.231358       5.846565       8.854765       8.886970

And since we know the number of observations in each decilic class, the observed and expected values for \(y = 0\) are obtained by subtracting:

(Obs0 = table(LOWBWT$riskDecile) - Obs1)
## 
## [0.0589,0.168]  (0.168,0.223]  (0.223,0.253]  (0.253,0.271]  (0.271,0.296] 
##             17             17             12             15             11 
##  (0.296,0.333]  (0.333,0.368]  (0.368,0.408]  (0.408,0.477]  (0.477,0.597] 
##             13             17             12              8              8
(Exp0 = table(LOWBWT$riskDecile) -Exp1)
## 
## [0.0589,0.168]  (0.168,0.223]  (0.223,0.253]  (0.253,0.271]  (0.271,0.296] 
##      16.633482      16.751683      12.969796      13.990085      13.579677 
##  (0.296,0.333]  (0.333,0.368]  (0.368,0.408]  (0.408,0.477]  (0.477,0.597] 
##      12.894935      14.768642       9.153435      11.145235       8.113030

We can put all of the above values in a table like the one on page 21 of the lecture pdf with this commands.

tablePg21 = data.frame(
  Prob = levels(LOWBWT$riskDecile),
  Obs1 = as.vector(Obs1), 
  Exp1 = as.vector(Exp1), 
  Obs0 = as.vector(Obs0), 
  Exp0 = as.vector(Exp0), 
  Total = as.vector(table(LOWBWT$riskDecile) )
  )
row.names(tablePg21) = NULL

I will use the xtable library to get the table in a fancier HTML format (you can use it to get LaTeX format as well).

library(xtable)
print(xtable(tablePg21, align = "ccccccc"), type="html", comment=FALSE)

And you can play with CSS to fiddle with the column width, etc. Here I will go with the defaults:

Prob Obs1 Exp1 Obs0 Exp0 Total
1 [0.0589,0.168] 2 2.37 17 16.63 19
2 (0.168,0.223] 4 4.25 17 16.75 21
3 (0.223,0.253] 5 4.03 12 12.97 17
4 (0.253,0.271] 4 5.01 15 13.99 19
5 (0.271,0.296] 8 5.42 11 13.58 19
6 (0.296,0.333] 6 6.11 13 12.89 19
7 (0.333,0.368] 6 8.23 17 14.77 23
8 (0.368,0.408] 3 5.85 12 9.15 15
9 (0.408,0.477] 12 8.85 8 11.15 20
10 (0.477,0.597] 9 8.89 8 8.11 17

Plot of the observed vs expected pairs

We can use the above results to get this plot. I have labeled the points with the number of the correspondng decilic class to make it easier to check which ones fall far from the diagonal line.

(maxX = max(c(range(Obs1), range(Exp1))))
## [1] 12
plot(c(0, maxX), c(0, maxX), asp=1, type="l", xlab="Observed y=1", ylab="Expected y=1")
points(Obs1, Exp1, asp=1)
text(Obs1, Exp1, 1:10, pos=4, col="firebrick")
abline(a = 0, b=1)

Pearson Chi-Square Statistic

To obtain the Pearson \(\chi^2\) statistic we begin by constructing the fitted values \(\hat{y}_j\) for each covariate pattern. In order to do this we can apply the predict function to the CovPatt data frame that we created before (R will choose the relevant variables by their names). As we have already seen in previous lectures, when we use the option response the predicted values are the probabilities \(\hat{\pi}_j\).

CovPatt$prob = predict(glmLOWBWT, newdata = CovPatt, type = "response")

The \(m_j\) are stored in the n variable of CovPatt, so the \(\hat{y}_j\) are obtained as follows:

CovPatt$hat_yj = with(CovPatt, n * prob) 

And now the Pearson residuals \(r_j\) are:

CovPatt$residual = with(CovPatt, (yj - hat_yj)/sqrt(hat_yj * (1 - prob)))

Finally, the Pearson \(\chi^2\) statistic is:

(Pearson.statistic = sum((CovPatt$residual)^2))
## [1] 111.2166

You can check this result against the Stata output on page 21 of the lecture pdf. The degrees of freedom are:

(p = length(glmLOWBWT$coefficients) - 1) 
## [1] 3
(J = length(CovPatt$n))
## [1] 109
(Pearson.df = J - (p + 1))
## [1] 105

Here \(p\) is the number of variables in the logistic model (and remember that each dummy variables is counted individually). The p-value is therefore (again, check with page 21 in the pdf):

(Pearson.pvalue = pchisq(Pearson.statistic, df = Pearson.df, lower.tail = FALSE))
## [1] 0.320391

However, as we have been warned in the lecture, it is not a good idea to use this test to decide about the goodness of fit of our model.So let’s turn to something better.

Hosmer-Lemeshow test

To define the Hosmer - Lemeshow statistic we will use the first grouping strategy described in the lecture pdf. This means that we can directly use the \(2\times J\) contingency table of observed and expected values for decilic classes that we constructed before. The Hosmer - Lemeshow statistic is:

(HL.statistic = sum((Obs1 - Exp1)^2/Exp1) + sum((Obs0 - Exp0)^2/Exp0))
## [1] 7.606831

Check against the Stata result on page 21. The degrees of freedom are:

(HL.df = length(Obs1) - 2)
## [1] 8

And the p-value is

(HL.pValue =  pchisq(HL.statistic, df = HL.df, lower.tail = FALSE))
## [1] 0.4727863

Using the ResourceSelection library

If you google for “R Hosmer Lemeshow test” you will come across the ResourceSelection library. The library includes a hoslem.test function to perform the Hosmer - Lemeshow test. The result in our case is:

library(ResourceSelection)
## ResourceSelection 0.2-4   2014-05-19
(HL= hoslem.test(LOWBWT$LOW, fitted(glmLOWBWT)))
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  LOWBWT$LOW, fitted(glmLOWBWT)
## X-squared = 9.4643, df = 8, p-value = 0.3047

And as you can see the results are slightly different from what we got. The output of the function also includes a table of observed and expected values (I have rearranged the rows to make them look similar to the table in the pdf for this lecture)

t(cbind(HL$observed, HL$expected))[c(1, 3, 2, 4), ]
##       [0.0589,0.168] (0.168,0.223] (0.223,0.254] (0.254,0.271]
## y0         17.000000     17.000000     12.000000     15.000000
## yhat0      16.633482     16.751683     12.969796     13.990085
## y1          2.000000      4.000000      5.000000      4.000000
## yhat1       2.366518      4.248317      4.030204      5.009915
##       (0.271,0.296] (0.296,0.333] (0.333,0.368] (0.368,0.406]
## y0        11.000000     12.000000     18.000000     12.000000
## yhat0     13.579677     10.895268     16.768308      8.561261
## y1         8.000000      4.000000      8.000000      2.000000
## yhat1      5.420323      5.104732      9.231692      5.438739
##       (0.406,0.467] (0.467,0.597]
## y0          7.00000      9.000000
## yhat0      10.69135      9.159089
## y1         12.00000     10.000000
## yhat1       8.30865      9.840911

This makes me think that the difference is in the method used for the definition of the decilic classes. But I have not had the chance to find out what definition of decilic classes is being used in the hoslem.test function.

Update: I think that using this code to define the decilic risk classes will reproduce the results in the hoslem.testfunction.

g=10
(cutPoints = quantile(fitted(glmLOWBWT), probs=seq(0, 1, 1/g)))
LOWBWT$riskDecile = cut(fitted(glmLOWBWT), breaks = cutPoints, include.lowest = TRUE, right = TRUE)

ROC curve and discrimination

When dealing with the ROC curve in R the simplest way to proceed is to use the ROCR library:

library(ROCR)
## Warning: package 'ROCR' was built under R version 3.1.3
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## 
## The following object is masked from 'package:stats':
## 
##     lowess

Now, to use the library we begin by creating a prediction object, from the fitted values and the real values of the response variable LOW:

LOWBWT.pred = prediction(predictions = glmLOWBWT$fitted.values, labels = LOWBWT$LOW)

Using this prediction object we can ask ROCR to compute a series of performance measures for the logistic model classifier. In order to do that we use a function called precisely performance. It can be used for example to obtain vectors of values of sensibility (true positive rate) or specificity (false positive rate) for different cutoff values.

sens = performance(LOWBWT.pred, measure = "sens")

Before going further, some remarks are in order. R, as an object oriented language, has two systems of classes, called S3 and S4. The result of calling performance ia an object of type S4, consisting of slots. We can access these slots using @ (instead of the familiar $). For example, to access the cutoff values we use:

cutoffs = sens@x.values[[1]]
head(cutoffs)
##                  76        43        35       133        90 
##       Inf 0.5974751 0.5716029 0.5528712 0.5453332 0.5415562

Similarly for the sensibility values themselves:

sensibility = sens@y.values[[1]]
head(sensibility)
## [1] 0.00000000 0.00000000 0.01694915 0.03389831 0.03389831 0.03389831

And for the specificity (the cutoffs are the same):

spec = performance(LOWBWT.pred, measure = "spec")
specificity = spec@y.values[[1]]
head(specificity)
## [1] 1.0000000 0.9923077 0.9923077 0.9923077 0.9846154 0.9692308

I’m not going into many details here, you can check the help file for the ROCR or the many blog posts dealing with this library. We are going to use this library to obtain a joint plot of sensitivity and specificity versus probability cutoff. I have played with several graphics options to make the plot resemble the graph on page 26 of the lecture pdf.

plot(c(0, 1), c(0, 1), xlab="Probability cutoff", ylab="Sensitivity / Specificity", type="n", asp=1)
points(cutoffs, specificity, type="l", col="firebrick")
points(cutoffs, specificity, pch=".", cex=6, col="firebrick")
points(cutoffs, sensibility, type="l", col="dodgerblue4")
points(cutoffs, sensibility, pch=".", cex=6, col="dodgerblue4")
segments(x0 = max(cutoffs[cutoffs!=Inf]), y0 = 1, x1 = 1, y1 = 1, col="firebrick")
segments(x0 = max(cutoffs[cutoffs!=Inf]), y0 = 0, x1 = 1, y1 = 0, col="dodgerblue4")
points(1, 0, cex=6, pch=".", col="dodgerblue4")
points(1, 1, cex=6, pch=".", col="firebrick")
legend(x="right", legend=c("Specificity", "Sensitivity"),  
       col = c("red", "blue"), bty=1, lwd=3,cex=1.5)

Now, in order to plot the ROC curve we use performance in a slightly different way, with the two measures of performance (sensitivity and false positive rate) that appear along the axes of the ROC curve:

ROCcurve = performance( LOWBWT.pred, measure="sens", x.measure="fpr")

And now we can plot the curve:

plot(ROCcurve, lwd=3, lty=1, col="dodgerblue4", asp=1)
abline(a=0, b=1, lwd=3, lty="dashed")

and in order to compute the AUC (area under the curve) we only need to call performance again:

AUC = performance(LOWBWT.pred,"auc")
AUC@y.values
## [[1]]
## [1] 0.6473272

Note that the curve that appears on page 31 of the lecture pdf is not the ROC curve for the LOWBWT logistic model in this lecture.


Thanks for your attention!