Logistic Regression. Week01

Preliminaries
Loading and exploring the data
- Plotting the data
- Grouping the data into age classes
Fitting a Logistic Model to the Data
Extra: direct computation of the likelihood function.

A good reference about using R for Logistic Regression is http://www.ats.ucla.edu/stat/r/dae/logit.htm

Preliminaries

These are my notes for the lectures of the Coursera course “Introduction to Logistic Regression” by Professor Stanley Lemeshow. The goal of these notes is to provide the R code to obtain the same results as the Stata code in the lectures.
I use (and strongly recommend) RStudio to do all my work in R. I wrote them with a work-in-progress R user in mind (that means, non-absolute newbie but non-expert either). There are now hundreds of ways of learning R. I recommend the course R Programming by R. Peng, J. Leek and B. Caffo (the course is part of the Data Science Specialization, which I also recommend if you are willing to learn a lot about data science using R). Also, the Open Intro Statistics R labs can be useful. I have also often found the following websites very useful:
The data files for the course (in txt format) can be found at the Wiley website
http://wiley.mpstechnologies.com/wiley/BOBContent/searchLPBobContent.do
You can enter the authors name or the book title or the ISBN number 9780470582473.

Then click on the checkbox to select all the files and click on Download

You will get a zip file with the data files. Unzip the file. I assume that you have selected a folder for this course and that you have instructed R to use that folder as your working folder (in RStudio there’s a menu option for doing that; alternatively you can use the setwd function). I keep my R scripts in that folder, while the data files are located in a data subfolder. The code belows assumes that you have the same setup.
The notes are written in Rmarkdown, which is based on knitr. The R code appears in darker grey boxes, while the output appears below in white boxes (the output appears here prefixed with ##, but that’s not really part of R output; see the Rmarkdown and knitr references for details). The important thing is that you can copy the code from the grey boxes, paste it into R and execute it directly.

Loading and exploring the data

We begin by loading the data in the CHDAGE.txt file. In my case the file is located in a subfolder called data, inside the R work folder. You may need to learn how to set the working directory in R (in RStudio there’s a menu option for this under Session) and adapt the following command to your setup.

CHDdata = read.table("./data/CHDAGE.txt", header = TRUE)

Now CHDdata is a data.frame, which is the basic R data structure for holding data tables. You can see the first few rows (I chose 10 rows) of the data table using the head function (there’s also a tail function to see the last values):

head(CHDdata, 10)

##    ID AGE CHD
## 1   1  20   0
## 2   2  23   0
## 3   3  24   0
## 4   5  25   1
## 5   4  25   0
## 6   7  26   0
## 7   6  26   0
## 8   9  28   0
## 9   8  28   0
## 10 10  29   0

If you want a spreadsheet-like view of the data you can execute:

View(CHDdata)

The summary and str functions are helpful when starting to explore a new data set:

summary(CHDdata)

##        ID              AGE             CHD      
##  Min.   :  1.00   Min.   :20.00   Min.   :0.00  
##  1st Qu.: 25.75   1st Qu.:34.75   1st Qu.:0.00  
##  Median : 50.50   Median :44.00   Median :0.00  
##  Mean   : 50.50   Mean   :44.38   Mean   :0.43  
##  3rd Qu.: 75.25   3rd Qu.:55.00   3rd Qu.:1.00  
##  Max.   :100.00   Max.   :69.00   Max.   :1.00

str(CHDdata)

## 'data.frame':    100 obs. of  3 variables:
##  $ ID : int  1 2 3 5 4 7 6 9 8 10 ...
##  $ AGE: int  20 23 24 25 25 26 26 28 28 29 ...
##  $ CHD: int  0 0 0 1 0 0 0 0 0 0 ...

You can see from the output that the data set has three variables:

ID is simply an integer identifier.
AGE is coded as an integer.
CHD (for coronary heart disease) is also a (binary) integer variable, taking only the values 0 and 1.

In R you use the $ syntax to access any component variable of a data.frame. For example, if we type

CHDdata$CHD

##   [1] 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0
##  [36] 1 0 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 1 1 0 0 0 1 1 1 0 0 1 0 0 1 0
##  [71] 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1

then, as you see, we get a vector containing the values of the variable CHD (you can think of this as one of the columns in the data.frame). We will make extensive use of this $ syntax in the sequel.

Plotting the data

Let us do a basic plot of the data. With plot we can get a scatter plot of the CHD variable (vertical axis, thus second argument of plot) vs the AGE variable (horizontal axis, thus first argument of plot). The xlab and ylab options allow us to set labels for the plot axis:

plot(CHDdata$AGE, CHDdata$CHD, , xlab="Age", ylab="Chd")

Grouping the data into age classes

The range of the AGE variable is

range(CHDdata$AGE)

## [1] 20 69

The cut function is used in R to group quantitative variables into ordered classes. In order to use it we need to define the breaks that separate classes. In R you can get a basic sequence of integers using syntax such as -5:5 (that gives -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5 as output). But for more sophisticated sequencing you’ll want to learn to use seq, which is far more powerful. The first $20-29$ and last $60-69$ classes are longer than the others. Thus I begin by creating a sequence of breaks from 30 to 60 with a step size of 5:

(AGEbreaks = seq(from = 30, to = 60, by = 5))

## [1] 30 35 40 45 50 55 60

The outer parenthesis that I used here are just a way of asking R to print the result of an assignment (otherwise R simply assigns the result to a variable silently). Now I manually add 20 at the beginning and 70 at the end:

(AGEbreaks = c(20,AGEbreaks, 70))

## [1] 20 30 35 40 45 50 55 60 70

Now we are ready to use cut. The table in slide 6 of the course lecture notes implies that the age intervals are left-closed/rigth-open intervals, such as $[35, 40)$ (or $35 \leq AGE < 40$ if you like). To get that behavior in R you need the right = FALSE option in cut (the default is rigth-closed intervals).

CHDdata$AgeGroup = cut(CHDdata$AGE, breaks = AGEbreaks, right = FALSE)

You will notice that I used CHDdata$AgeGroup with a $ in it. This has the benefit of directly including the AgeGroup variable into the data.frame. You can check this with:

head(CHDdata)

##   ID AGE CHD AgeGroup
## 1  1  20   0  [20,30)
## 2  2  23   0  [20,30)
## 3  3  24   0  [20,30)
## 4  5  25   1  [20,30)
## 5  4  25   0  [20,30)
## 6  7  26   0  [20,30)

tail(CHDdata)

##      ID AGE CHD AgeGroup
## 95   95  62   1  [60,70)
## 96   96  63   1  [60,70)
## 97   98  64   1  [60,70)
## 98   97  64   0  [60,70)
## 99   99  65   1  [60,70)
## 100 100  69   1  [60,70)

Note that the rows are sorted by age in this data set.

Frequency Table for Age Groups

The basic frequency table is:

(table1 = table(CHDdata$AgeGroup))

## 
## [20,30) [30,35) [35,40) [40,45) [45,50) [50,55) [55,60) [60,70) 
##      10      15      12      15      13       8      17      10

You don’t really need here the assignment part table1 =. But I will be putting all the tables together below, and this assignment is an easy way to do that.

The table of CHD absence/presence by age group can be obtained with:

(table2 = table(CHDdata$CHD, CHDdata$AgeGroup))

##    
##     [20,30) [30,35) [35,40) [40,45) [45,50) [50,55) [55,60) [60,70)
##   0       9      13       9      10       7       3       4       2
##   1       1       2       3       5       6       5      13       8

and to get the mean (proportion) by age group we use this:

(table3 = tapply(CHDdata$CHD, INDEX = CHDdata$AgeGroup, FUN = mean))

##   [20,30)   [30,35)   [35,40)   [40,45)   [45,50)   [50,55)   [55,60) 
## 0.1000000 0.1333333 0.2500000 0.3333333 0.4615385 0.6250000 0.7647059 
##   [60,70) 
## 0.8000000

Finally, if you want to see them all in a single table, like the one in the lectures, you can use cbind to bind the tables as columns:

table4 = cbind(as.vector(table1), t(table2), table3)
colnames(table4) = c("n", "CHD absent", "CHD present", "mean % present")
addmargins(table4, margin = 1)

##           n CHD absent CHD present mean % present
## [20,30)  10          9           1      0.1000000
## [30,35)  15         13           2      0.1333333
## [35,40)  12          9           3      0.2500000
## [40,45)  15         10           5      0.3333333
## [45,50)  13          7           6      0.4615385
## [50,55)   8          3           5      0.6250000
## [55,60)  17          4          13      0.7647059
## [60,70)  10          2           8      0.8000000
## Sum     100         57          43      3.4679110

By the way, the total for the last column is meaningless, but I don’t want to make the code more complicated just to get rid of it.

We plot the last column of this table vs the midpoints of the age classes. The midpoints are obtained as follows:

midpoints = AGEbreaks[-length(AGEbreaks)] + c(5, rep(2.5, 6), 5)
plot(midpoints, table4[, 4], col="red", pch=19, xlab="", ylab="")

Explanation:

In R rows and columns of a matrix, or table or data.frame, are selected with brackets. For example A[ 2, ] selects the second row and A[ , 3] selects the third column of a tabular object called A.
We take all the brekas excluding the last one. length(AGEbreaks) gives the position of the last break. Thus, the last element is selected with AGEbreaks[-length(AGEbreaks)] while AGEbreaks[-length(AGEbreaks)] (note the minus sign) selects all the elements but the last one.
We add a vector with the semi-lengths of the intervals. The central ones are all equal to 2.5, so we use rep to repeat that value, and the first and last ones are equal to 5.

Fitting a Logistic Model to the Data

The next step is fitting the logistic model, using the glm function. I have stored the resulting model object in the glmCHD, for an easier access to the model components. The summary function provides an initial description of the model:

glmCHD = glm(CHD ~ AGE, family = binomial(link = "logit"), CHDdata)
(summGlmCHD = summary(glmCHD))

## 
## Call:
## glm(formula = CHD ~ AGE, family = binomial(link = "logit"), data = CHDdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9718  -0.8456  -0.4576   0.8253   2.2859  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.30945    1.13365  -4.683 2.82e-06 ***
## AGE          0.11092    0.02406   4.610 4.02e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 136.66  on 99  degrees of freedom
## Residual deviance: 107.35  on 98  degrees of freedom
## AIC: 111.35
## 
## Number of Fisher Scoring iterations: 4

The log likelihood of this model is:

logLik(glmCHD)

## 'log Lik.' -53.67655 (df=2)

The model curve can be added to the scatter plot of the data with the lines function:

plot(CHDdata$AGE, CHDdata$CHD, , xlab="Age", ylab="Chd")
lines(CHDdata$AGE, glmCHD$fitted, type="l", col="red", lwd=3)

The variance-covariance matrix of the logistic model can be obtained with:

(vcovCHD = vcov(glmCHD))

##             (Intercept)           AGE
## (Intercept)  1.28517059 -0.0266769747
## AGE         -0.02667697  0.0005788748

Note that the order of the variables in the matrix rows and columns is reversed with respect to the Stata output in the course lectures. In the next lecture we will see how to obtain confidence intervals for this model.

Extra: direct computation of the likelihood function.

This computation is not included in the Stata (or Systat) output shown in the lectures, but I find it instructive to obtain likelihood from first principles. Recall that the likelihood function is defined as (slide 14 in the lecture notes): \[ l(\beta) = \prod_{i=1}^n \pi(x_i)^{y_i}\left(1 - \pi(x_i)\right)^{1- y_i}. \] where \[ \pi(x_i) = \dfrac{e^{\beta_0 + \beta_1 x_i}}{1 + e^{\beta_0 + \beta_1 x_i}} \]

We can define a function in R to compute the likelihood. We will rename the variables to X and Y to simplify the expression (and to make it easier to apply the likelihood function to future examples):

X = CHDdata$AGE
Y = CHDdata$CHD

likelihood = function(b){
  prod(exp(b[1] + b[2] * X[Y==1]) /(1 + exp(b[1] + b[2] * X[Y==1]))) * prod(1 /(1 + exp(b[1] + b[2] * X[Y==0])))  
}

The likelihood function depends on the choice of the coefficients for the logit. We can use persp to obtain a three dimensional plot of the likelihood function in the a region close to the values returned by glm

b0 = seq(-8, -2, length.out = 60)
b1 = seq(0.06, 0.18, length.out = 60)

likelihood_vect = Vectorize(function(b0, b1)likelihood(c(b0, b1)))

par(mar = c(0.6, 0.1,0.6,0.1))
zp = outer(X = b0, Y = b1, likelihood_vect)
persp(b0, b1, zp, theta=75, phi=20, col="lightblue", xlab="b0", ylab="b1", zlab="Likelihood")

The plot confirms that the maximum value of the likelihood corresponds to the logit coefficients returned by glm. To evaluate it for the coefficients obtained with glm we store this coefficients in a vector b and run the likelihood function on b:

(b = glmCHD$coefficients)

## (Intercept)         AGE 
##  -5.3094534   0.1109211

likelihood(b)

## [1] 4.881712e-24

Then the log likelihood is:

log(likelihood(b))

## [1] -53.67655

The same result that appears in the lecture notes (and that we obtained before with logLik).

R includes (in the basic libraries) a function optim that can be used to optimize functions that depend on several variables. That is, to find the values of the variables that produce a minimum value of the function (if you want to find a maximum, use -f instead; that’s why we consider -likelihood below). The method used by glm is much more efficient, because it is designed for this specific knid of optimzation problems. But I still find it somehow nice to be able to get that optimal value in a different way.

optim(par = c(0, 0), fn = function(x) - log(likelihood(x)))

## $par
## [1] -5.309517  0.110920
## 
## $value
## [1] 53.67655
## 
## $counts
## function gradient 
##       85       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

The first component (labeled $par for parameters) contains the values that minimize -log(likelihood). As you can see these are (up to some rounding) the same values that we found using glm.

Thanks for your attention!