A good reference about using R for Logistic Regression is http://www.ats.ucla.edu/stat/r/dae/logit.htm

Log Likelihood Ratio Test for the CHD Example

We are going to compute the log likelihood ratio for the logistic model vs the null (naïve) model.

We begin by loading the data in the CHDAGE.txt file. In my case the file is located in a subfolder called data, inside the R work folder. You may need to learn how to set the working directory in R (in RStudio there’s a menu option for this under Session) and adapt the following command to your setup.

CHDdata = read.table("./data/CHDAGE.txt", header = TRUE)

You can see the first few rows of the data table using the head function:

head(CHDdata)
##   ID AGE CHD
## 1  1  20   0
## 2  2  23   0
## 3  3  24   0
## 4  5  25   1
## 5  4  25   0
## 6  7  26   0

Let us do a basic plot of the data:

plot(CHDdata$AGE, CHDdata$CHD, , xlab="Age", ylab="Chd")

The next step is fitting the logistic model, using the glm function. I have stored the resulting model object in the glmCHD, for an easier access to the model components. The summary function provides an initial description of the model:

glmCHD = glm(CHD ~ AGE, family = binomial(link = "logit"), CHDdata)
(summGlmCHD = summary(glmCHD))
## 
## Call:
## glm(formula = CHD ~ AGE, family = binomial(link = "logit"), data = CHDdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9718  -0.8456  -0.4576   0.8253   2.2859  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.30945    1.13365  -4.683 2.82e-06 ***
## AGE          0.11092    0.02406   4.610 4.02e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 136.66  on 99  degrees of freedom
## Residual deviance: 107.35  on 98  degrees of freedom
## AIC: 111.35
## 
## Number of Fisher Scoring iterations: 4

The outer parenthesis are just a way of asking R to print the result of an assignment (otherwise R simply assigns the result to a variable silently). Next, in order to obtain the log likelihood ratio we are going to use two of the components of the model, the deviance of the null model and the deviance of the fitted model (the one including AGE as predictor variable).

glmCHD$null.deviance - glmCHD$deviance
## [1] 29.30989

The model curve can be added to the plot with the lines function:

lines(CHDdata$AGE, glmCHD$fitted, type="l", col="red", lwd=3)

In the sequel we will need the coefficients \(b_0, b_1\) of the fitted logistic curve. We store them in two conveniently named variables as follows:

(b0 = glmCHD$coefficients[1])
## (Intercept) 
##   -5.309453
(b1 = glmCHD$coefficients[2])
##       AGE 
## 0.1109211

Finding a confidence interval for \(\beta\) and \(\pi\)

Wald statistic.

The numerator and denominator of the Wald test statistic \[ \dfrac{\hat\beta_1}{SE(\hat\beta_1)} \] can be obtained from the model summary. This model summary has a component called coefficients which is just a (named) R matrix:

summGlmCHD$coefficients
##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -5.3094534 1.13365365 -4.683488 2.820338e-06
## AGE          0.1109211 0.02405982  4.610224 4.022356e-06

Thus we get the numerator of the Wald statistic W as:

(numW = summGlmCHD$coefficients[2, 1])
## [1] 0.1109211

The bracket notation is standard R code for accessing elements of a matrix. Here [2, 1] means second row, first column. The denominator is similar:

(denomW = summGlmCHD$coefficients[2, 2])
## [1] 0.02405982

And the Wald statistic is

(W = numW / denomW)
## [1] 4.610224

(or you can get it as the [2, 3] element in the matrix). The corresponding p-value is provided in the summary as

(denomW = summGlmCHD$coefficients[2, 4])
## [1] 4.022356e-06

Alternatively you can obtain it using the normal distribution directly:

2 * pnorm(W, lower.tail = FALSE)
## [1] 4.022356e-06

We multiplied by 2 because this is a two-tailed test for the null hypothesis \(H_0=\{\beta_1 = 0\}\).

Variance - covariance matrix.

The variance-covariance matrix of the logistic model can be obtained with:

(vcovCHD = vcov(glmCHD))
##             (Intercept)           AGE
## (Intercept)  1.28517059 -0.0266769747
## AGE         -0.02667697  0.0005788748

Note that the order of the variables in the matrix rows and columns is reversed with respect to the Stata output in the course lectures.

We can check that the square root of the diagonal elements of the variance-covariance matrix are the standard errors that appear in the coefficients matrix of the model summary:

sqrt(diag(vcovCHD))
## (Intercept)         AGE 
##  1.13365365  0.02405982

The same numbers appeared as standard errors in the second column of the coefficient matrix for the model summary (use [ ,2] to select the second column of a matrix in R)

summGlmCHD$coefficients[ , 2]
## (Intercept)         AGE 
##  1.13365365  0.02405982

Confidence intervals

Confidence intervals for the \(\beta\)

These are provided by the confint.default function:

confint.default(glmCHD)
##                   2.5 %     97.5 %
## (Intercept) -7.53137370 -3.0875331
## AGE          0.06376477  0.1580775

Confidence interval for the logit and probability with \(x=60\)

We begin by predicting the value of the logit for \(x=60\). We use the predict function for that. The value to be predicted is provided as a data.frame via the newdata argument of predict. I’ll explain the type=“link” argument below.

(logitX = predict(glmCHD, newdata = data.frame(AGE=60), type = "link"))
##        1 
## 1.345815

(suffice it to say now that if you omit it, you will get the same result). Alternatively we can also directly do:

x = 60
glmCHD$coefficients[1] + glmCHD$coefficients[2] * x
## (Intercept) 
##    1.345815

Don’t worry about the (Intercept) label that appears here. R is just keeping it because we started from a named vector.

Now we can get the predicted probability directly using the exponential:

exp(logitX) / (1 + exp(logitX))
##         1 
## 0.7934446

Alternatively, we can change the type argument of predict to response

(probX = predict(glmCHD, newdata = data.frame(AGE=60), type = "response"))
##         1 
## 0.7934446

As you see, the type argument controls whether we get an answer in the logit scale or in the probability scale (with link, which is the default and therefore can be omitted).

Now let’s move to the confidence intervals for the probability.

If you want to reproduce the direct computation of the variance for the logit when \(x= 60\) (see slide 11 in the pdf for the lecture notes) then you can proceed as follows:

vcovCHD[1, 1] + x^2 * vcovCHD[2, 2] + 2 * x * vcovCHD[1, 2]
## [1] 0.1678829

Note that there is a small difference with the result 0.16784 that appears in the slides. Besides if you are like me and you like matrix algebra :) then the same result is obtained as follows:

t(c(1, x)) %*% vcovCHD %*% c(1, x)
##           [,1]
## [1,] 0.1678829

t is for transpose, and %*% is for matrix product.

We can however do the same in a more natural way using again the predict fucntion with the argument se.fit=TRUE (and response in the logit scale).

(predX = predict(glmCHD, newdata = data.frame(AGE=60), type = "link", se.fit = TRUE))
## $fit
##        1 
## 1.345815 
## 
## $se.fit
## [1] 0.4097351
## 
## $residual.scale
## [1] 1

As you can see this tells R to include the standard error of the predicted value in the answer. The standard error can be accessed as:

predX$se.fit
## [1] 0.4097351

To recover the value of the variance for the logit (at \(x=60\)) simply square this standard error

predX$se.fit^2
## [1] 0.1678829

Thus we are ready to compute the confidence interval for the logit when \(x=60\).

(intervalLogitX = logitX + c(-1, 1) * predX$se.fit * qnorm(0.025, lower.tail = F))
## [1] 0.542749 2.148881

Some explanations about this formula:

  • qnorm provides the quantiles for the normal distribution (the \(z_{1 - \alpha/2}\approx 1.96\) in the lecture slides):

  • The part with c(-1, 1) is just a trick to use the vector arithmetic of R to get both endpoints of the confidence interval at the same time. The -1 gives the lower endpoint, the 1 gives the upper one.

The confidence interval for the probability can be obtained directly as follows:

exp(intervalLogitX) / (1 + exp(intervalLogitX) )
## [1] 0.6324517 0.8955642

But the function \[f(u) = \dfrac{e^u}{ 1 + e^u}\] can be easily obtained in R with plogis, and so we get the same result if we do:

plogis(intervalLogitX)
## [1] 0.6324517 0.8955642

The multiple logistic regression model

Fitting the multiple logistic model: low birth weight study I

Loading and exploring the data

For this part of the lecture we will be working with a new data set and a new model, so we begin by removing all the previous results from R memory. We can do this with this command:

rm(list = ls())

Now we are ready to read the data

LOWBWTdata = read.table("./data/LOWBWT.txt", header = TRUE)

head(LOWBWTdata)
##   ID LOW AGE LWT RACE SMOKE PTL HT UI FTV  BWT
## 1  4   1  28 120    3     1   1  0  1   0  709
## 2 10   1  29 130    1     0   0  0  1   2 1021
## 3 11   1  34 187    2     1   0  1  0   0 1135
## 4 13   1  25 105    3     0   1  1  0   0 1330
## 5 15   1  25  85    3     0   0  0  1   0 1474
## 6 16   1  27 150    3     0   0  0  0   0 1588

Now, in this case we have to be careful because the RACE categorical variable has been coded with integer numbers. You can ask R about the type of variable as follows:

typeof(LOWBWTdata$RACE)
## [1] "integer"

But it is much better to consider RACE as a factor. Fortunately, it is very easy to convert the variable into a factor:

LOWBWTdata$RACE = as.factor(LOWBWTdata$RACE)

The main advantage for us is that if R considers RACE to be a factor then it will automatically include the appropriate dummy variables in the logistic regression model (however, as usual in R, if you keep RACE as an integer you can code the dummy variables by hand if you want). Another way to achieve the same result is to use the argument colClasses of read.table to set the class of the variables while reading them from the file.

To find the number of women for each value of low do:

table(LOWBWTdata$LOW)
## 
##   0   1 
## 130  59

By the way, if you want to get the summary tables that appear in the lecture notes you may use this code (I’m not describing the details):

For LOW = 0

t(apply(LOWBWTdata[LOWBWTdata$LOW==0 , c(3, 4, 10)], MARGIN = 2, FUN = function(x)c(LENGTH = length(x), MEAN = mean(x), SD = sd(x), MIN = min(x), MAX= max(x))))
##     LENGTH        MEAN        SD MIN MAX
## AGE    130  23.6615385  5.584522  14  45
## LWT    130 133.3000000 31.724016  85 250
## FTV    130   0.8384615  1.069694   0   6

For LOW = 1

t(apply(LOWBWTdata[LOWBWTdata$LOW==1 , c(3, 4, 10)], MARGIN = 2, FUN = function(x)c(LENGTH = length(x), MEAN = mean(x), SD = sd(x), MIN = min(x), MAX= max(x))))
##     LENGTH        MEAN        SD MIN MAX
## AGE     59  22.3050847  4.511496  14  34
## LWT     59 122.1355932 26.559275  80 200
## FTV     59   0.6949153  1.038139   0   4

It’s not the more elegant or efficient way to do this but it is straightforward and it works in this case.

A very basic table of LOW by RACE can be obtained simply with:

table(LOWBWTdata$LOW, LOWBWTdata$RACE)
##    
##      1  2  3
##   0 73 15 42
##   1 23 11 25

But if you want a fancier table you can install the gmodels library with

install.packages("gmodels"")

In a normal R setup this will download and automatically install the library from the default R repository for your system.

Then we load the library and use the CrossTable function

library(gmodels)
CrossTable(LOWBWTdata$LOW, LOWBWTdata$RACE, format = "SPSS", prop.chisq = F)
## 
##    Cell Contents
## |-------------------------|
## |                   Count |
## |             Row Percent |
## |          Column Percent |
## |           Total Percent |
## |-------------------------|
## 
## Total Observations in Table:  189 
## 
##                | LOWBWTdata$RACE 
## LOWBWTdata$LOW |        1  |        2  |        3  | Row Total | 
## ---------------|-----------|-----------|-----------|-----------|
##              0 |       73  |       15  |       42  |      130  | 
##                |   56.154% |   11.538% |   32.308% |   68.783% | 
##                |   76.042% |   57.692% |   62.687% |           | 
##                |   38.624% |    7.937% |   22.222% |           | 
## ---------------|-----------|-----------|-----------|-----------|
##              1 |       23  |       11  |       25  |       59  | 
##                |   38.983% |   18.644% |   42.373% |   31.217% | 
##                |   23.958% |   42.308% |   37.313% |           | 
##                |   12.169% |    5.820% |   13.228% |           | 
## ---------------|-----------|-----------|-----------|-----------|
##   Column Total |       96  |       26  |       67  |      189  | 
##                |   50.794% |   13.757% |   35.450% |           | 
## ---------------|-----------|-----------|-----------|-----------|
## 
## 

But if you prefer a solution within base R, you can try something like this:

addmargins(100 * prop.table(table(LOWBWTdata$LOW, as.numeric(LOWBWTdata$RACE)), margin = 1))
##      
##               1         2         3       Sum
##   0    56.15385  11.53846  32.30769 100.00000
##   1    38.98305  18.64407  42.37288 100.00000
##   Sum  95.13690  30.18253  74.68057 200.00000

Set margin = 2 in the above command to get the column percents (in general, in R margin 1 refers to rows, margin 2 refers to columns).

Fitting the model with glm

The syntax for fitting the model is similar to the case where we had only one predictor variable:

glmLOWBWT = glm(LOW ~ AGE + LWT + RACE + FTV, family = binomial(link = "logit"), data = LOWBWTdata)
(summGlmLOWBWT = summary(glmLOWBWT))
## 
## Call:
## glm(formula = LOW ~ AGE + LWT + RACE + FTV, family = binomial(link = "logit"), 
##     data = LOWBWTdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4163  -0.8931  -0.7113   1.2454   2.0755  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  1.295366   1.071443   1.209   0.2267  
## AGE         -0.023823   0.033730  -0.706   0.4800  
## LWT         -0.014245   0.006541  -2.178   0.0294 *
## RACE2        1.003898   0.497859   2.016   0.0438 *
## RACE3        0.433108   0.362240   1.196   0.2318  
## FTV         -0.049308   0.167239  -0.295   0.7681  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 222.57  on 183  degrees of freedom
## AIC: 234.57
## 
## Number of Fisher Scoring iterations: 4

Note the syntax for the model formula where the explanatory models included in the model appear as a sum, with ine term for each variable. The summary output for the model shows the deviance for the fitted model and for the null (naïve or constant) model. The deviance is computed up to a constant, assuming that the deviance of a saturated model is 0. If we want to get the log likelihood of the fitted model we can use the logLil function:

logLik(glmLOWBWT)
## 'log Lik.' -111.2865 (df=6)

We can compute the likelihood ratio as the difference of the deviances:

(LikRatio = summGlmLOWBWT$null.deviance - summGlmLOWBWT$deviance)
## [1] 12.09909

To compute the p-value we firts obtain the degrees of freedom:

(df = summGlmLOWBWT$df.null - summGlmLOWBWT$df.residual)
## [1] 5

and now use the distribution function of a chi square variable with those degrees of freedom to get the p-value of a one-tail test (right tail).

(pValue = pchisq(LikRatio, lower.tail = FALSE, df))
## [1] 0.03345496

The confidence intervals for the coefficients of the model can be obtained with:

confint.default(glmLOWBWT)
##                   2.5 %       97.5 %
## (Intercept) -0.80462420  3.395355695
## AGE         -0.08993181  0.042285866
## LWT         -0.02706418 -0.001425038
## RACE2        0.02811220  1.979683451
## RACE3       -0.27686939  1.143086244
## FTV         -0.37709090  0.278474261

The reduced model

The model without obtained with:

glmLOWBWTred = glm(LOW ~ LWT + RACE, family = binomial(link = "logit"), data = LOWBWTdata)
(summGlmLOWBWTred = summary(glmLOWBWTred))
## 
## Call:
## glm(formula = LOW ~ LWT + RACE, family = binomial(link = "logit"), 
##     data = LOWBWTdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3491  -0.8919  -0.7196   1.2526   2.0993  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.805753   0.845167   0.953   0.3404  
## LWT         -0.015223   0.006439  -2.364   0.0181 *
## RACE2        1.081066   0.488052   2.215   0.0268 *
## RACE3        0.480603   0.356674   1.347   0.1778  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 223.26  on 185  degrees of freedom
## AIC: 231.26
## 
## Number of Fisher Scoring iterations: 4

And you can compare them as follows:

(LikRatio = glmLOWBWTred$deviance - glmLOWBWT$deviance)
## [1] 0.6861841
(df = summGlmLOWBWTred$df.residual - summGlmLOWBWT$df.residual)
## [1] 2
(pValue = pchisq(LikRatio, lower.tail = FALSE, df))
## [1] 0.7095729