Exercise 1:

Use the Myopia Study (MYOPIA.dta)

One variable that is clearly important is the initial value of spherical equivalent refraction (SPHEQ).

  1. Write down the equation for the logistic regression model of SPHEQ on MYOPIA. Write down the equation for the logit transformation of this logistic regression model. What characteristic of the outcome variable, MYOPIA, leads us to consider the logistic regression model as opposed to the usual linear regression model to describe the relationship between MYOPIA and SPHEQ?

  2. Form a scatterplot of MYOPIA vs. SPHEQ.

  3. Write down an expression for the likelihood and log likelihood for the logistic regression model in part (a) using the ungrouped, , data. Obtain expressions for the two likelihood equations.

  4. Using Stata, obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (a). These estimates should be based on the ungrouped, , data. Using these estimates, write down the equation for the fitted values, that is, the estimated logistic probabilities. Plot the equation for the fitted values on the axes used in the scatterplots in parts (b) and (c).


Firstly let’s read in the data for the assignment.

MYOPIA<-read.csv(file = "https://learn.canvas.net/courses/1179/files/461762/download?download_frd=1")#read the file from adress in the assignment
knitr::kable(head(MYOPIA))#print first 6 rows with kable function from knitr
ID STUDYYEAR MYOPIC AGE GENDER SPHEQ AL ACD LT VCD SPORTHR READHR COMPHR STUDYHR TVHR DIOPTERHR MOMMY DADMY
1 1992 1 6 1 -0.052 21.89 3.690 3.498 14.70 45 8 0 0 10 34 1 1
2 1995 0 6 1 0.608 22.38 3.702 3.392 15.29 4 0 1 1 7 12 1 1
3 1991 0 6 1 1.179 22.49 3.462 3.514 15.52 14 0 2 0 10 14 0 0
4 1990 1 6 1 0.525 22.20 3.862 3.612 14.73 18 11 0 0 4 37 0 1
5 1995 0 5 0 0.697 23.29 3.676 3.454 16.16 14 0 0 0 4 4 1 0
6 1995 0 6 0 1.744 22.14 3.224 3.556 15.36 10 6 2 1 19 44 0 1

OK then, we see that the variables that we’ll been working with are MYOPIC and SPEQ.Let’s get started.


  1. MYOPIC is the variable that takes value 1 if the subject doesn’t have myopia (contrary to my intuition) and 0 if the subject has myopia.
    • the logistic regression model is:
      \[ \pi(x)=E(y|x)= \frac{exp(\beta_0 + \beta_1\cdot x) }{1+exp(\beta_0 + \beta_1 \cdot x) } \]
    • The logit transformation of this logistic regression model:
      \[ ln\left(\frac{\pi(x)}{1-\pi(x)}\right)=\beta_0+\beta_1 \cdot x \] MYOPIC is a binary response variable, using it in simple linear regression would break some of the assumption of linear regression (linearity, normal distribution, homoscedasticity..).
  2. Plot the scaterplot with ggplot2:
library(ggplot2)
ggplot(MYOPIA, aes(x = SPHEQ, y = MYOPIC)) + geom_jitter(shape = "O", position = position_jitter(height = 0)) +theme_bw()

  1. And now, lets write equations for likelihood functions:

    • Likelihood function:
      \[ l(\beta)=\prod_{i=1}^{n}\left[\pi(x_i)^{y_{i}}\right][1-\pi(1-x_i)]^{1-y_{i}} \]
    • Log - likelihood function:
      \[ l(\beta)=\ln(l(\beta))=\sum_{i=1}^{n}\Big\{y_{i}\ln[\pi(x_{i})]+(1-y_{i})\ln[1-\pi(x_{i})]\Big\} \]
    • likelihood equations:
      \[ \sum_{i=1}^{n}\left[y_{i}-\pi(x_{i})\right]=0 \] \[ \sum_{i=1}^{n}\left[y_{i}-\pi(x_{i})\right]=0 \]
  2. We are going to fit logistic regresion to MYOPIA dataset by regressing SPHEQ variable on MIOPIC by using glm function for generalized linear models and choosing binomial family.

library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2. http://CRAN.R-project.org/package=stargazer
#here we are using option result=asis in chunc because of the function stargazer which prints out a nice table summary
fit_hw1_ex1<-glm(MYOPIC~SPHEQ, data=MYOPIA, family = binomial())
stargazer(fit_hw1_ex1,type = "html")  
Dependent variable:
MYOPIC
SPHEQ -3.833***
(0.418)
Constant 0.054
(0.207)
Observations 618
Log Likelihood -168.672
Akaike Inf. Crit. 341.345
Note: p<0.1; p<0.05; p<0.01



…Continuing…

ggplot(MYOPIA, aes(x = SPHEQ, y = MYOPIC)) + geom_jitter(shape = "O", position = position_jitter(height = 0))+ stat_smooth(method="glm", se= FALSE,method.args = list(family = "binomial"))+xlim(-1,4.5)

Exercise 2:

Use the ICU study (icu.dta)

The ICU data set consists of a sample of 200 subjects who were part of a much larger study on survival of patients following admission to an adult intensive care unit (ICU). The major goal of this study was to develop a logistic regression model to predict the probability of survival to hospital discharge of these patients. A number of publications have appeared which have focused on various facets of this problem.

  1. Write down the equation for the logistic regression model of STA on AGE. Write down the equation for the logit transformation of this logistic regression model. What characteristic of the outcome variable, STA, leads us to consider the logistic regression model as opposed to the usual linear regression model to describe the relationship between STA and AGE?

  2. Form a scatterplot of STA versus AGE.
    n = 618 n = 618
  3. Write down an expression for the likelihood and log likelihood for the logistic regression model in part (a) using the ungrouped, data. Obtain expressions for the two likelihood equations.

  4. Using Stata, obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (a). These estimates should be based on the ungrouped, data. Using these estimates, write down the equation for the fitted values, that is, the estimated logistic probabilities. Plot the equation for the fitted values on the axes used in the scatterplots in part (b).

  5. Summarize (describe in words) the results presented in the plot obtained from parts (b) and (d)


Reading in the data from course site.

ICU<-read.csv(file = "https://learn.canvas.net/courses/1179/files/461760/download?download_frd=1")#read the file from adress in the assignment
knitr::kable(head(ICU))#print first 6 rows with kable function from knitr
ID STA AGE SEX RACE SER CAN CRN INF CPR SYS HRA PRE TYP FRA PO2 PH PCO BIC CRE LOC
552 0 16 0 1 1 0 0 0 0 100 140 0 1 1 0 0 0 0 0 0
102 0 16 1 1 0 0 0 0 0 104 111 0 1 0 0 0 0 0 0 0
837 0 17 1 3 0 0 0 0 0 130 140 0 1 0 0 0 0 0 0 0
863 0 17 0 3 1 0 0 0 0 136 78 0 1 0 0 0 0 0 0 0
829 0 17 0 1 1 0 0 0 0 140 78 0 1 1 0 0 0 0 0 0
379 0 18 0 1 1 0 0 0 0 112 76 0 1 1 0 0 0 0 0 0

Since questions a - c and e are literraly the same as in exercise one we’re going to skip them and jump right to model estimation and ploting.

ggplot(aes(x=AGE,y=STA),data = ICU)+geom_point(alpha=0.4,colour="red")+theme_bw()

Fitting the data:

fit_hw1_ex2<-glm(STA~AGE, data=ICU, family = binomial())
stargazer(fit_hw1_ex1,type = "html")
Dependent variable:
MYOPIC
SPHEQ -3.833***
(0.418)
Constant 0.054
(0.207)
Observations 618
Log Likelihood -168.672
Akaike Inf. Crit. 341.345
Note: p<0.1; p<0.05; p<0.01


ggplot(data = ICU,aes(x=AGE,y=STA))+geom_point(alpha=0.4,colour="red")+theme_bw()+stat_smooth(method="glm", se= FALSE,method.args = list(family = "binomial"))

It doesn’t look really s shaped but as proffesor said, it doesn’t allways have to be.