Applied logistic regression - Week One Homework

Exercise 1:

Use the Myopia Study (MYOPIA.dta)

One variable that is clearly important is the initial value of spherical equivalent refraction (SPHEQ).

Write down the equation for the logistic regression model of SPHEQ on MYOPIA. Write down the equation for the logit transformation of this logistic regression model. What characteristic of the outcome variable, MYOPIA, leads us to consider the logistic regression model as opposed to the usual linear regression model to describe the relationship between MYOPIA and SPHEQ?
Form a scatterplot of MYOPIA vs. SPHEQ.
Write down an expression for the likelihood and log likelihood for the logistic regression model in part (a) using the ungrouped, , data. Obtain expressions for the two likelihood equations.
Using Stata, obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (a). These estimates should be based on the ungrouped, , data. Using these estimates, write down the equation for the fitted values, that is, the estimated logistic probabilities. Plot the equation for the fitted values on the axes used in the scatterplots in parts (b) and (c).

Firstly let’s read in the data for the assignment.

MYOPIA<-read.csv(file = "https://learn.canvas.net/courses/1179/files/461762/download?download_frd=1")#read the file from adress in the assignment
knitr::kable(head(MYOPIA))#print first 6 rows with kable function from knitr

ID	STUDYYEAR	MYOPIC	AGE	GENDER	SPHEQ	AL	ACD	LT	VCD	SPORTHR	READHR	COMPHR	STUDYHR	TVHR	DIOPTERHR	MOMMY	DADMY
1	1992	1	6	1	-0.052	21.89	3.690	3.498	14.70	45	8	0	0	10	34	1	1
2	1995	0	6	1	0.608	22.38	3.702	3.392	15.29	4	0	1	1	7	12	1	1
3	1991	0	6	1	1.179	22.49	3.462	3.514	15.52	14	0	2	0	10	14	0	0
4	1990	1	6	1	0.525	22.20	3.862	3.612	14.73	18	11	0	0	4	37	0	1
5	1995	0	5	0	0.697	23.29	3.676	3.454	16.16	14	0	0	0	4	4	1	0
6	1995	0	6	0	1.744	22.14	3.224	3.556	15.36	10	6	2	1	19	44	0	1

OK then, we see that the variables that we’ll been working with are MYOPIC and SPEQ.Let’s get started.

MYOPIC is the variable that takes value 1 if the subject doesn’t have myopia (contrary to my intuition) and 0 if the subject has myopia.
- the logistic regression model is:
  \[ \pi(x)=E(y|x)= \frac{exp(\beta_0 + \beta_1\cdot x) }{1+exp(\beta_0 + \beta_1 \cdot x) } \]
- The logit transformation of this logistic regression model:
  \[ ln\left(\frac{\pi(x)}{1-\pi(x)}\right)=\beta_0+\beta_1 \cdot x \] MYOPIC is a binary response variable, using it in simple linear regression would break some of the assumption of linear regression (linearity, normal distribution, homoscedasticity..).
Plot the scaterplot with ggplot2:

library(ggplot2)
ggplot(MYOPIA, aes(x = SPHEQ, y = MYOPIC)) + geom_jitter(shape = "O", position = position_jitter(height = 0)) +theme_bw()

And now, lets write equations for likelihood functions:
- Likelihood function:
  \[ l(\beta)=\prod_{i=1}^{n}\left[\pi(x_i)^{y_{i}}\right][1-\pi(1-x_i)]^{1-y_{i}} \]
- Log - likelihood function:
  \[ l(\beta)=\ln(l(\beta))=\sum_{i=1}^{n}\Big\{y_{i}\ln[\pi(x_{i})]+(1-y_{i})\ln[1-\pi(x_{i})]\Big\} \]
- likelihood equations:
  \[ \sum_{i=1}^{n}\left[y_{i}-\pi(x_{i})\right]=0 \] \[ \sum_{i=1}^{n}\left[y_{i}-\pi(x_{i})\right]=0 \]
We are going to fit logistic regresion to MYOPIA dataset by regressing SPHEQ variable on MIOPIC by using glm function for generalized linear models and choosing binomial family.

library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2. http://CRAN.R-project.org/package=stargazer

#here we are using option result=asis in chunc because of the function stargazer which prints out a nice table summary
fit_hw1_ex1<-glm(MYOPIC~SPHEQ, data=MYOPIA, family = binomial())
stargazer(fit_hw1_ex1,type = "html")


	Dependent variable:

	MYOPIC

SPHEQ	-3.833^***
	(0.418)

Constant	0.054
	(0.207)


Observations	618
Log Likelihood	-168.672
Akaike Inf. Crit.	341.345

Note:	p<0.1; p<0.05; p<0.01

…Continuing…

Equation:
\[ \hat{\pi}(x)=\frac{ e^{\hspace{1mm} 0.054-3.833 \cdot X}}{1+e^{\hspace{1mm}0.054-3.833\cdot x}} \]
And finally a plot of fitted values:

ggplot(MYOPIA, aes(x = SPHEQ, y = MYOPIC)) + geom_jitter(shape = "O", position = position_jitter(height = 0))+ stat_smooth(method="glm", se= FALSE,method.args = list(family = "binomial"))+xlim(-1,4.5)

Exercise 2:

Use the ICU study (icu.dta)

The ICU data set consists of a sample of 200 subjects who were part of a much larger study on survival of patients following admission to an adult intensive care unit (ICU). The major goal of this study was to develop a logistic regression model to predict the probability of survival to hospital discharge of these patients. A number of publications have appeared which have focused on various facets of this problem.

Write down the equation for the logistic regression model of STA on AGE. Write down the equation for the logit transformation of this logistic regression model. What characteristic of the outcome variable, STA, leads us to consider the logistic regression model as opposed to the usual linear regression model to describe the relationship between STA and AGE?
Form a scatterplot of STA versus AGE.
n = 618 n = 618
Write down an expression for the likelihood and log likelihood for the logistic regression model in part (a) using the ungrouped, data. Obtain expressions for the two likelihood equations.
Using Stata, obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (a). These estimates should be based on the ungrouped, data. Using these estimates, write down the equation for the fitted values, that is, the estimated logistic probabilities. Plot the equation for the fitted values on the axes used in the scatterplots in part (b).
Summarize (describe in words) the results presented in the plot obtained from parts (b) and (d)

Reading in the data from course site.

ICU<-read.csv(file = "https://learn.canvas.net/courses/1179/files/461760/download?download_frd=1")#read the file from adress in the assignment
knitr::kable(head(ICU))#print first 6 rows with kable function from knitr

ID	AGE	SEX	RACE	SER	SYS	HRA	TYP	FRA
552	16	0	1	1	100	140	1	1
102	16	1	1	0	104	111	1	0
837	17	1	3	0	130	140	1	0
863	17	0	3	1	136	78	1	0
829	17	0	1	1	140	78	1	1
379	18	0	1	1	112	76	1	1

Since questions a - c and e are literraly the same as in exercise one we’re going to skip them and jump right to model estimation and ploting.

ggplot(aes(x=AGE,y=STA),data = ICU)+geom_point(alpha=0.4,colour="red")+theme_bw()

Fitting the data:

fit_hw1_ex2<-glm(STA~AGE, data=ICU, family = binomial())
stargazer(fit_hw1_ex1,type = "html")


	Dependent variable:

	MYOPIC

SPHEQ	-3.833^***
	(0.418)

Constant	0.054
	(0.207)


Observations	618
Log Likelihood	-168.672
Akaike Inf. Crit.	341.345

Note:	p<0.1; p<0.05; p<0.01

Logistic regression model:
\[ \hat{\pi}(x)=\frac{ e^{\hspace{1mm} -3.06-0.027 \cdot X}}{1+e^{\hspace{1mm}0.054-3.833\cdot x}} \]
Plot of fitted values:

ggplot(data = ICU,aes(x=AGE,y=STA))+geom_point(alpha=0.4,colour="red")+theme_bw()+stat_smooth(method="glm", se= FALSE,method.args = list(family = "binomial"))

It doesn’t look really s shaped but as proffesor said, it doesn’t allways have to be.

Applied logistic regression - Week One Homework

Milos Cipovic

August 21, 2016

Exercise 1:

Exercise 2:

ID	AGE	SEX	RACE	SER	SYS	HRA	TYP	FRA
552	16	0	1	1	100	140	1	1
102	16	1	1	0	104	111	1	0
837	17	1	3	0	130	140	1	0
863	17	0	3	1	136	78	1	0
829	17	0	1	1	140	78	1	1
379	18	0	1	1	112	76	1	1

ID	AGE	SEX	RACE	SER	SYS	HRA	TYP	FRA
552	16	0	1	1	100	140	1	1
102	16	1	1	0	104	111	1	0
837	17	1	3	0	130	140	1	0
863	17	0	3	1	136	78	1	0
829	17	0	1	1	140	78	1	1
379	18	0	1	1	112	76	1	1

ID	AGE	SEX	RACE	SER	SYS	HRA	TYP	FRA
552	16	0	1	1	100	140	1	1
102	16	1	1	0	104	111	1	0
837	17	1	3	0	130	140	1	0
863	17	0	3	1	136	78	1	0
829	17	0	1	1	140	78	1	1
379	18	0	1	1	112	76	1	1