Applied Logistic Regression

Homework - Week 1

August/2016

Codes for knitr and ztable:

library(knitr)
knitr::opts_chunk$set(fig.align='center',
                      fig.width = 5,
                      fig.height = 3,
                      cache=TRUE, 
                      echo=T,
                      warning =F,
                      message= F,
                      results = 'asis')
library(ztable)
options(ztable.type="html")

Exercise One

For this exercise, we need the Myopia Study dataset. The data is available by Professor Lemeshow’s course on Canvas Network.

filename = "MYOPIA.csv"
if (!file.exists(filename)) {
        fileURL = "https://learn.canvas.net/courses/1179/files/461762/download?download_frd=1"
        download.file(fileURL, filename)
        }
data = read.csv("MYOPIA.csv", header = T, sep = ",")

The six first observations:

ztable(head(data))
  ID STUDYYEAR MYOPIC AGE GENDER SPHEQ AL ACD LT VCD SPORTHR READHR COMPHR STUDYHR TVHR DIOPTERHR MOMMY DADMY
1 1 1992 1 6 1 -0.05 21.89 3.69 3.50 14.70 45 8 0 0 10 34 1 1
2 2 1995 0 6 1 0.61 22.38 3.70 3.39 15.29 4 0 1 1 7 12 1 1
3 3 1991 0 6 1 1.18 22.49 3.46 3.51 15.52 14 0 2 0 10 14 0 0
4 4 1990 1 6 1 0.52 22.20 3.86 3.61 14.73 18 11 0 0 4 37 0 1
5 5 1995 0 5 0 0.70 23.29 3.68 3.45 16.16 14 0 0 0 4 4 1 0
6 6 1995 0 6 0 1.74 22.14 3.22 3.56 15.36 10 6 2 1 19 44 0 1

Last six observations of data:

ztable(tail(data))
  ID STUDYYEAR MYOPIC AGE GENDER SPHEQ AL ACD LT VCD SPORTHR READHR COMPHR STUDYHR TVHR DIOPTERHR MOMMY DADMY
613 613 1990 0 9 0 -0.03 22.96 3.76 3.34 15.86 9 9 5 0 16 53 1 0
614 614 1995 1 6 0 0.68 22.40 3.66 3.80 14.93 2 0 7 3 14 37 1 0
615 615 1993 0 6 1 0.67 22.50 3.57 3.38 15.56 6 0 1 0 8 10 1 1
616 616 1995 0 6 0 1.83 22.94 3.62 3.42 15.89 8 0 0 0 4 4 1 1
617 617 1991 0 6 1 0.67 21.92 3.69 3.60 14.64 12 2 1 0 15 23 0 0
618 618 1994 0 6 0 0.80 22.26 3.53 3.48 15.25 25 0 2 0 10 14 1 1

One variable that is clearly important is the initial value of spherical equivalent refraction (SPHEQ).

Complete the following:

A. Write down the equation for the logistic regression model of MYOPIA on SPHEQ. Write down the equation for the logit transformation of this logistic regression model. What characteristic of the outcome variable, MYOPIA, leads us to consider the logistic regression model as opposed to the usual linear regression model to describe the relationship between MYOPIA and SPHEQ?

B. Form a scatter plot of MYOPIA vs. SPHEQ.

C. Write down an expression for the likelihood and log likelihood for the logistic regression model in part (A) using the ungrouped, n=618, data. Obtain expressions for the two likelihood equations.

D. Obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (A). These estimates should be based on the ungrouped, n=618, data. Using these estimates, write down the equation for the fitted values, that is, the estimated logistic probabilities. Plot the equation for the fitted values on the axes used in the scatter plots in parts (B) and (C).

Solving

A. Equation for the logistic regression model of MYOPIA on SPHEQ:

\[\pi(SPHEQ) = \dfrac{e^{\beta_0 + \beta_{1}\times SPHEQ}}{1+e^{\beta_0 + \beta_{1}\times SPHEQ}},\] where \(\pi(SPHEQ)\) is the conditional mean of MYOPIC given SPHEQ (\(E[MYOPIC ~|~ SPHEQ]\)) and can be estimate as below1:

library(dplyr)
estimate = data %>%
        select(MYOPIC, SPHEQ) %>%
        group_by(SPHEQ, MYOPIC) %>%
        summarise( m = mean(MYOPIC))

Head and tail of estimate for MYOPIC given SPHEQ:

pander::pander(cbind(head(estimate), tail(estimate)))
SPHEQ MYOPIC m SPHEQ MYOPIC m
-0.699 1 1 3.422 0 0
-0.600 1 1 3.491 0 0
-0.571 1 1 3.615 0 0
-0.524 1 1 3.731 0 0
-0.502 1 1 4.228 0 0
-0.491 1 1 4.372 0 0

Equation for the logit transformation of this logistic regression model: \[\log \left(\dfrac{\pi(SPHEQ)}{1-\pi(SPHEQ)}\right) = \beta_0 + \beta_{1}\times SPHEQ,\] where log is the logarithm neperian function.

Myopic is a binary outcome variable, that is, the conditional mean of MYOPIC in regression equation is bounded between zero and one. So it’s appropriate for logistic regression.

B. We can form a scatter plot for MYOPIA vs. SPHEQ with the values computed in (A) as estimate for \(E[MYOPIC ~|~ SPHEQ]\):

library(ggplot2)
ggplot(estimate, aes(SPHEQ, m)) +
        geom_point() +
        xlab("SPHEQ") + 
        ggtitle("SPHEQ (X) x MYOPIC (Y)") +
        ylab("Estimate for E[MYOPIC  | SPHEQ]") 

C. Likelihood function for pair \((SPHEQ_{i}, MYOPIC_{i}), \quad 1 \leq i \leq 618\): \[ {\cal l}(\beta) = \prod_{i=1}^{618} \left\{ \pi(SPHEQ_{i})^{MYOPIC_{i}}\times [1-\pi(SPHEQ_{i})]^{1- MYOPIC_{i}} \right\}.\]

Log-likelihood function for pair \((SPHEQ_{i}, MYOPIC_{i}), \quad 1 \leq i \leq 618\):

\[ {\cal L}(\beta) = log({\cal l}(\beta)) = \sum_{i=1}^{618} \left\{ MYOPIC_{i}\times log(\pi(SPHEQ_{i})) + [1- MYOPIC_{i}] \times log(1-\pi(SPHEQ_{i})) \right\}.\]

Likelihood equations for pair \((SPHEQ_{i}, MYOPIC_{i}), \quad 1 \leq i \leq 618\):

\[\sum_{i=1}^{618} \left\{ MYOPIC_{i} - \pi(SPHEQ_{i}) \right\} = 0\] and \[\sum_{i=1}^{618} \left\{SPHEQ_{i} \times [ MYOPIC_{i} - \pi(SPHEQ_{i})] \right\} = 0.\]

D. In R2 we can obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (A) with code below:

mod = glm(MYOPIC ~ SPHEQ,
                  data = data,
                  family = binomial(link = "logit"))
ztable(mod)
  Estimate Std. Error z value Pr(>|z|) OR lcl ucl
(Intercept) 0.0540 0.2067 0.26 0.7940 1.06 0.71 1.59
SPHEQ -3.8331 0.4184 -9.16 0.0000 0.02 0.01 0.05
Call: glm(formula = MYOPIC ~ SPHEQ, family = binomial(link = “logit”), data = data)

So \(\hat{\beta_{0}}= 0.05\) and \(\hat{\beta_{1}}= -3.83\) are maximum likelihood estimates of the parameters of the logistic regression model in part (A).

Another way to obtain these predictors in R2:

library(bbmle)
(bino = mle2(MYOPIC~dbinom(prob=(exp(beta0+beta1*SPHEQ)/(1+exp(beta0+beta1*SPHEQ))), size = 1), 
            start = list(beta0=coef(mod)[1], beta1=coef(mod)[2]),
            data = data))
## 
## Call:
## mle2(minuslogl = MYOPIC ~ dbinom(prob = (exp(beta0 + beta1 * 
##     SPHEQ)/(1 + exp(beta0 + beta1 * SPHEQ))), size = 1), start = list(beta0 = coef(mod)[1], 
##     beta1 = coef(mod)[2]), data = data)
## 
## Coefficients:
##       beta0       beta1 
##  0.05397313 -3.83309759 
## 
## Log-likelihood: -168.67

Likelihood profiles:

library(sads)
plot(profile(bino))

Equation for the estimated logistic probabilities:

\[\hat{\pi}(SPHEQ) = \dfrac{e^{0.05 + (-3.83)\times SPHEQ}}{1+e^{0.05 + (-3.83)\times SPHEQ}}.\]

Finally, the plot of equation for the fitted values on the axes used in the scatter plots in parts (B) and (C):

ggplot(data, aes(SPHEQ, MYOPIC)) +
        geom_point() + 
        geom_smooth(method = "glm", 
                    method.args = list(family = "binomial"), se = F) +
        ggtitle("SPHEQ (X) x MYOPIC (Y)") 

Exercise Two

For this exercise, we need the ICU dataset. The data is available by Professor Lemeshow’s course on Canvas Network.

filename = "icu.csv"
if (!file.exists(filename)) {
        fileURL = "https://learn.canvas.net/courses/1179/files/461760/download?download_frd=1"
        download.file(fileURL, filename)
        }
data = read.csv("icu.csv", header = T, sep = ",")

The six first observations:

ztable(head(data))
  ID STA AGE SEX RACE SER CAN CRN INF CPR SYS HRA PRE TYP FRA PO2 PH PCO BIC CRE LOC
1 552 0 16 0 1 1 0 0 0 0 100 140 0 1 1 0 0 0 0 0 0
2 102 0 16 1 1 0 0 0 0 0 104 111 0 1 0 0 0 0 0 0 0
3 837 0 17 1 3 0 0 0 0 0 130 140 0 1 0 0 0 0 0 0 0
4 863 0 17 0 3 1 0 0 0 0 136 78 0 1 0 0 0 0 0 0 0
5 829 0 17 0 1 1 0 0 0 0 140 78 0 1 1 0 0 0 0 0 0
6 379 0 18 0 1 1 0 0 0 0 112 76 0 1 1 0 0 0 0 0 0

Last six observations of data:

ztable(tail(data))
  ID STA AGE SEX RACE SER CAN CRN INF CPR SYS HRA PRE TYP FRA PO2 PH PCO BIC CRE LOC
195 222 1 88 0 1 0 0 0 1 1 141 140 0 1 0 0 0 0 0 0 0
196 877 0 88 1 1 0 0 1 0 0 190 88 0 1 0 0 0 0 0 0 0
197 881 0 89 1 1 1 0 0 0 0 190 114 0 1 0 0 0 1 0 0 2
198 204 1 91 0 1 0 0 0 1 0 64 125 0 1 0 0 0 1 0 0 0
199 674 0 91 0 1 0 0 1 1 0 152 125 0 1 0 0 0 0 0 0 0
200 165 1 92 0 1 0 0 0 1 0 124 80 0 1 0 0 0 0 1 0 0

The ICU dataset consists of a sample of 200 subjects who were part of a much larger study on survival of patients following admission to an adult intensive care unit (ICU). The major goal of this study was to develop a logistic regression model to predict the probability of survival to hospital discharge of these patients. A number of publications have appeared which have focused on various facets of this problem.

Complete the following:

A. Write down the equation for the logistic regression model of STA on AGE. Write down the equation for the logit transformation of this logistic regression model. What characteristic of the outcome variable, STA, leads us to consider the logistic regression model as opposed to the usual linear regression model to describe the relationship between STA and AGE?

B. Form a scatter plot of STA versus AGE.

C. Write down an expression for the likelihood and log likelihood for the logistic regression model in part (A) using the ungrouped, n=200, data. Obtain expressions for the two likelihood equations.

D. Obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (A). These estimates should be based on the ungrouped, n=200, data. Using these estimates, write down the equation for the fitted values, that is, the estimated logistic probabilities. Plot the equation for the fitted values on the axes used in the scatter plots in part (B).

E. Summarize (describe in words) the results presented in the plot obtained from parts (B) and (D).

Solving

A. Equation for the logistic regression model of STA on AGE:

\[\pi(AGE) = \dfrac{e^{\beta_0 + \beta_{1}\times AGE}}{1+e^{\beta_0 + \beta_{1}\times AGE}},\] where \(\pi(STA)\) is the conditional mean of STA given AGE (\(E[STA ~|~ AGE]\)) and can be estimate as below1:

est = data %>%
        select(AGE, STA) %>%
        group_by(AGE, STA) %>%
        summarise( m = mean(STA))

Head and tail of estimate for STA given AGE:

pander::pander(cbind(head(est), tail(est)))
AGE STA m AGE STA m
16 0 0 88 0 0
17 0 0 88 1 1
18 0 0 89 0 0
19 0 0 91 0 0
19 1 1 91 1 1
20 0 0 92 1 1

Equation for the logit transformation of this logistic regression model: \[\log \left(\dfrac{\pi(AGE)}{1-\pi(AGE)}\right) = \beta_0 + \beta_{1}\times AGE,\] where log is the logarithm neperian function.

AGE is a binary outcome variable, that is, the conditional mean of AGE in regression equation is bounded between zero and one. So it’s appropriate for logistic regression.

B. We can form a scatter plot for AGE vs. STA with the values computed in (A) as estimate for \(E[STA ~|~ AGE]\):

ggplot(est, aes(AGE, m)) +
        geom_point() +
        xlab("AGE") + 
        ggtitle("AGE (X) x STA (Y)") +
        ylab("Estimate for E[STA  | AGE]")

C. Likelihood function for pair \((AGE_{i}, STA_{i}), \quad 1 \leq i \leq 200\): \[ {\cal l}(\beta) = \prod_{i=1}^{200} \left\{ \pi(AGE_{i})^{STA_{i}}\times [1-\pi(AGE_{i})]^{1- STA_{i}} \right\}.\]

Log-likelihood function for pair \((AGE_{i}, STA_{i}), \quad 1 \leq i \leq 200\):

\[ {\cal L}(\beta) = log({\cal l}(\beta)) = \sum_{i=1}^{200} \left\{ STA_{i}\times log(\pi(AGE_{i})) + [1- STA_{i}] \times log(1-\pi(AGE_{i})) \right\}.\]

Likelihood equations for pair \((AGE_{i},STA_{i}), \quad 1 \leq i \leq 200\):

\[\sum_{i=1}^{200} \left\{ STA_{i} - \pi(AGE_{i}) \right\} = 0\] and \[\sum_{i=1}^{200} \left\{STA_{i} \times [ STA_{i} - \pi(AGE_{i})] \right\} = 0.\]

D. In R2 we can obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (A) with code below:

(mod = glm(STA ~ AGE, data = data, family = binomial(link = "logit")))
## 
## Call:  glm(formula = STA ~ AGE, family = binomial(link = "logit"), data = data)
## 
## Coefficients:
## (Intercept)          AGE  
##    -3.05851      0.02754  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  198 Residual
## Null Deviance:       200.2 
## Residual Deviance: 192.3     AIC: 196.3

Below, plot of equation for the fitted values on the axes used in the scatter plots in parts (B) and (C):

ggplot(data, aes(AGE, STA)) +
        geom_point() + 
        geom_smooth(method = "glm", 
                    method.args = list(family = "binomial"), se = F) +
        ggtitle("AGE (X) x STAR (Y)") 

E. In plots we can see that old patients are more susceptible to die if admitted to an adult intensive care unit (ICU).

References

1. Hosmer DW, Lemeshow S, Sturdivant R. Applied Logistic regression. 2013.(http://onlinelibrary.wiley.com/book/10.1002/9781118548387)

2. R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2016.(https://www.R-project.org/)