Applied Logistic Regression Week 1 HW

Exercise One

For this exercise, you will need the Myopia Study dataset. Download the MYOPIA.dta Stata file, or you can also access the data through this CSV file.

One variable that is clearly important is the initial value of spherical equivalent refraction (SPHEQ).

Complete the following:

A.Write down the equation for the logistic regression model of SPHEQ on MYOPIA. Write down the equation for the logit transformation of this logistic regression model. What characteristic of the outcome variable, MYOPIA, leads us to consider the logistic regression model as opposed to the usual linear regression model to describe the relationship between MYOPIA and SPHEQ?
\[\pi(MYOPIC == Yes| SPHEQ) = \frac{e^{\beta_0+\beta_1*SPHEQ}}{1+e^{\beta_0+\beta_1*SPHEQ}}\] \[logit(\pi(MYOPIC == Yes| SPHEQ)) = \ln \left( \frac{\pi(x)}{1 - \pi(x)} \right) = \beta_0+\beta_1*SPHEQ\] The outcome variable, MYOPIA, is dichotomous so logistic regression model is appropriate to describe its relationship with SPHEQ.

B.Form a scatterplot of MYOPIA vs. SPHEQ.

library(aplore3)

## Warning: package 'aplore3' was built under R version 3.2.5

data("myopia")
library(ggplot2)
ggplot(data = myopia, aes(spheq, myopic))  +
    geom_point()

C.Write down an expression for the likelihood and log likelihood for the logistic regression model in part (A) using the ungrouped, n=618, data. Obtain expressions for the two likelihood equations.
\[\ell(\beta)) = \prod_{i=1}^{n=618}[\pi(x_i)]^{Y_i}*[1-\pi(x_i)]^{1-Y_i}\] \[L(\beta) = \sum_{i=1}^{n=618}\{Y_i*\ln \pi(x_i) + (1-Y_i)*\ln (1-\pi(x_i))\}\]

The two likelihood equations: \[\beta_0 ..... \sum_{i=1}^{n}[Y_i - \pi(x_i)] = 0\] \[\beta_1 ..... \sum_{i=1}^{n}x_i*[Y_i-\pi(x_i)] = 0\] D.Using Stata, obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (A). These estimates should be based on the ungrouped, n=618, data. Using these estimates, write down the equation for the fitted values, that is, the estimated logistic probabilities. Plot the equation for the fitted values on the axes used in the scatterplots in parts (B) and (C).

fit <- glm(myopic ~ spheq, data = myopia, family = binomial(link = "logit"))
summary(fit)$coef

##                Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)  0.05397315  0.2067483  0.2610573 7.940483e-01
## spheq       -3.83309762  0.4183696 -9.1619878 5.094998e-20

(log.likelihood <- logLik(fit))

## 'log Lik.' -168.6724 (df=2)

\[\pi(x) = \frac{e^{0.054-3.833*SPHEQ}}{1 + e^{0.054-3.833*SPHEQ}}\]

ggplot(data = myopia, aes(spheq, myopic))  +
    geom_point()  +
    geom_point(aes(x = spheq, y = fit$fitted + 1), color = "green", cex = 0.75)  +
    geom_line(aes(x = spheq, y = fit$fitted + 1), color = "red")

Exercise Two

For this exercise, you will need the ICU dataset. Download the icu.dta Stata file, or you can also access the data through this CSV file.

The ICU dataset consists of a sample of 200 subjects who were part of a much larger study on survival of patients following admission to an adult intensive care unit (ICU). The major goal of this study was to develop a logistic regression model to predict the probability of survival to hospital discharge of these patients. A number of publications have appeared which have focused on various facets of this problem.

Complete the following:

A.Write down the equation for the logistic regression model of STA on AGE. Write down the equation for the logit transformation of this logistic regression model. What characteristic of the outcome variable, STA, leads us to consider the logistic regression model as opposed to the usual linear regression model to describe the relationship between STA and AGE?
\[\pi(STA == Yes|AGE) = \frac{e^{\beta_0+\beta_1*AGE}}{1 + e^{\beta_0+\beta_1*AGE}}\] \[logit(\pi(STA == Yes|AGE)) = \ln \left(\frac{\pi(x)}{1-\pi(x)}\right) = \beta_0 + \beta_1*AGE\]

data(icu)
str(icu)

## 'data.frame':    200 obs. of  21 variables:
##  $ id    : int  4 8 12 14 27 28 32 38 40 41 ...
##  $ sta   : Factor w/ 2 levels "Died","Lived": 2 1 1 1 2 1 1 1 1 1 ...
##  $ age   : int  87 27 59 77 76 54 87 69 63 30 ...
##  $ gender: Factor w/ 2 levels "Male","Female": 2 2 1 1 2 1 2 1 1 2 ...
##  $ race  : Factor w/ 3 levels "White","Black",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ser   : Factor w/ 2 levels "Medical","Surgical": 2 1 1 2 2 1 2 1 2 1 ...
##  $ can   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ crn   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ inf   : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 2 2 2 1 1 ...
##  $ cpr   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sys   : int  80 142 112 100 128 142 110 110 104 144 ...
##  $ hra   : int  96 88 80 70 90 103 154 132 66 110 ...
##  $ pre   : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 1 2 1 1 1 ...
##  $ type  : Factor w/ 2 levels "Elective","Emergency": 2 2 2 1 2 2 2 2 1 2 ...
##  $ fra   : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 2 1 1 1 1 ...
##  $ po2   : Factor w/ 2 levels "> 60","<= 60": 2 1 1 1 1 1 1 2 1 1 ...
##  $ ph    : Factor w/ 2 levels ">= 7.25","< 7.25": 2 1 1 1 1 1 1 1 1 1 ...
##  $ pco   : Factor w/ 2 levels "<= 45","> 45": 2 1 1 1 1 1 1 1 1 1 ...
##  $ bic   : Factor w/ 2 levels ">= 18","< 18": 1 1 1 1 1 1 1 2 1 1 ...
##  $ cre   : Factor w/ 2 levels "<= 2.0","> 2.0": 1 1 1 1 1 1 1 1 1 1 ...
##  $ loc   : Factor w/ 3 levels "Nothing","Stupor",..: 1 1 1 1 1 1 1 1 1 1 ...

contrasts(icu$sta)

##       Lived
## Died      0
## Lived     1

B.Form a scatterplot of STA versus AGE.

ggplot(data = icu, aes(x = age, y = sta))  +
    geom_point()

C.Write down an expression for the likelihood and log likelihood for the logistic regression model in part (A) using the ungrouped, n=200, data. Obtain expressions for the two likelihood equations.
\[\ell(\beta) = \prod_{i=1}^{n = 200} \xi_i= \prod_{i = 1}^{n}\pi(x_i)^{Y_i}*[1-\pi(x_i)]^{1-Y_i}\] \[L(\beta) = \sum_{i=1}^{n}Y_i*\pi(x_i) + (1-Y_i)*[1 - \pi(x_i)]\] The two likelihood equations:
\[\beta_0 ..... \sum_{i = 1}^{n} [Y_i -\pi(x_i)] = 0\] \[\beta_1 ..... \sum_{i = 1}^{n} x_i*[Y_i -\pi(x_i)] = 0\]
D.Using Stata, obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (A). These estimates should be based on the ungrouped, n=200, data. Using these estimates, write down the equation for the fitted values, that is, the estimated logistic probabilities. Plot the equation for the fitted values on the axes used in the scatterplots in part (B).

fit2 <- glm(sta ~ age, data = icu, family = binomial(link = logit))
summary(fit2)$coef

##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -3.05851323 0.69608124 -4.393903 1.113337e-05
## age          0.02754261 0.01056416  2.607174 9.129303e-03

ggplot(data = icu, aes(x = age, y = sta))  +
    geom_point()  +
    geom_point(aes(x = age, y = fit2$fitted + 1), color = "green", cex = 0.75)  +
    # "y = fit2 + 1" because sta is coded 1 and 2, so we need to align the probs with that
    geom_line(aes(x = age, y = fit2$fitted + 1), color = "red")

\[\pi(STA == Lived | AGE) = \frac{e^{-3.059+0.028 * AGE}}{1 + e^{-3.059+0.028 * AGE}}\]

E.Summarize (describe in words) the results presented in the plot obtained from parts (B) and (D).

which(fit2$fitted >= 0.5)

## named integer(0)

In plot B we can see that the variable STA is dichotomous, while variable AGE is a continuous one. In plot D the red line is the line of the fitted probabilities that the outcome variable is STA == Lived. From the line we can see that the model predicts that none of the patients lived, and since clearly there are data points in the Lived category, we need a better model. Also, the output from the fitted probabilities says that none of the fitted probabilities is greater than or equal 0.5.