Applied logistic regression - Week three Homework

Exercise 1

For this exercise, you will use the Myopia Study dataset and you will need to refer to the results from Week Two, Exercise 1. If you need the file you can download the MYOPIA.dta Stata file, or you can also access the data through the CSV file.

library(ggplot2)

MYOPIA<-read.csv(file = "https://learn.canvas.net/courses/1179/files/461762/download?download_frd=1")#read the file from adress in the assignment
#fit the LR
fit_hw3_ex1<-glm(MYOPIC~SPHEQ, data=MYOPIA, family = binomial())
#plot
ggplot(data=MYOPIA, aes(x = SPHEQ, y = MYOPIC)) + 
  geom_point(alpha=0.4,colour="red") +
  stat_smooth(method="glm", se= FALSE,method.args =
      list(family="binomial"),alpha=0.8) +
  xlim(-1,4.5) +
  theme_bw()

Complete the following:

Using the results from Week Two, Exercise 1, compute 95 percent confidence intervals for the slope coefficient SPHEQ. Write a sentence interpreting this confidence.
We are 95% confident that the slope coefficient for the SPEQ will be between -3.0131081 and -4.653087
Let’s elaborate:
- firstly we can use coefficient and standard error from regression and derive 97.5% quantile and 2.5% quantile from normal distribution. The code is as follows: UP=qnorm(p=0.975,mean=coef(fit_hw3_ex1)[2],sd=sqrt(vcov(fit_hw3_ex1)[2,2] )) LOW=qnorm(p=0.975,mean=coef(fit_hw3_ex1)[2],sd=sqrt(vcov(fit_hw3_ex1)[2,2] ))
- alternatively and more easily we could use confint.default() function:
```
    confint.default(fit_hw3_ex1)
```
```
                 2.5 %     97.5 %
(Intercept) -0.3512461  0.4591923
SPHEQ       -4.6530870 -3.0131081
```
Use Stata (or your preferred analysis software tool) to obtain the estimated covariance matrix. Then compute the logit and estimated logistic probability for a subject with SPHEQ = 2. Then evaluate the endpoints of the 95 percent confidence intervals for the logit and estimated logistic probability. Write a sentence interpreting the estimated probability and its confidence interval.

Let us first extract covariance matrix:
```
covariance_matrix<-vcov(fit_hw3_ex1)
knitr::kable(covariance_matrix,caption = "Covariance matrix of estimated parameters")
```
Covariance matrix of estimated parameters

(Intercept) SPHEQ

(Intercept) 0.0427449 -0.0633777

SPHEQ -0.0633777 0.1750332

Now lets predict logit for a subject with SPEQ = 2:
```
predicted_logit<-predict(fit_hw3_ex1,newdata = data.frame(SPHEQ=2),se.fit = T)
predicted_probability<-predict(fit_hw3_ex1,newdata = data.frame(SPHEQ=2),type = "response",se.fit = T)
```
Finally we calculate confidence intervalls for logit and then translate them back to probability:
```
logit<-c(
     Upper=qnorm(mean=predicted_logit$fit,sd=predicted_logit$se.fit,p=0.975),
     Lower=qnorm(mean=predicted_logit$fit,sd=predicted_logit$se.fit,p=0.025)
)#ci for logit
probability<-plogis(logit)#ci for probability
confidence_intervals<-cbind(rbind(predicted_logit[-3],predicted_probability[-3]),rbind(logit,probability))
knitr::kable(confidence_intervals)
```
fit se.fit Upper Lower

logit -7.612222 0.6995475 -6.241134 -8.98331

probability 0.0004941278 0.0003454951 0.00194386 0.0001254711

Covariance matrix of estimated parameters
	(Intercept)	SPHEQ
(Intercept)	0.0427449	-0.0633777
SPHEQ	-0.0633777	0.1750332

	fit	se.fit	Upper	Lower
logit	-7.612222	0.6995475	-6.241134	-8.98331
probability	0.0004941278	0.0003454951	0.00194386	0.0001254711

You can share your response and discuss this exercise in the discussion area.

Exercise Two

For this exercise, you will use the ICU Study dataset and you will need to refer to the results from Week One, Exercise 2, Part (d). If you need the file you can download the icu.dta Stata file, or you can also access the data through the CSV file.

ICU<-read.csv(file = "https://learn.canvas.net/courses/1179/files/461760/download?download_frd=1")#read the file from adress in the assignment
fit_hw3_ex2<-glm(STA~AGE, data=ICU, family = binomial())
ggplot(data = ICU,aes(x=AGE,y=STA))+
  geom_point(alpha=0.4,colour="red")+
  theme_bw()+
  stat_smooth(method="glm", se= FALSE,method.args =         list(family = "binomial")
)

Complete the following

Using the results from Week One, Exercise 2, Part (d), compute 95 percent confidence intervals for the slope and constant term. Write a sentence interpreting the confidence interval for the slope.
We are 95% confident that the slope coefficient for the AGE $\beta_1=$0.0275426 will be between 0.048248 and 0.0068372.
Analogically to first exercise:
- firstly, we can use coefficient and standard error from regression and derive 97.5% quantile and 2.5% quantile from normal distribution. The code is as follows: UP=qnorm(p=0.975,mean=coef(fit_hw3_ex2)[2],sd=sqrt(vcov(fit_hw3_ex2)[2,2] )) LOW=qnorm(p=0.975,mean=coef(fit_hw3_ex2)[2],sd=sqrt(vcov(fit_hw3_ex2)[2,2] ))
- alternatively and more easily we could use confint.default() function:
```
    confint.default(fit_hw3_ex2)
```
```
                  2.5 %      97.5 %
(Intercept) -4.42280739 -1.69421908
AGE          0.00683723  0.04824799
```

Obtain the estimated covariance matrix for the model fit from Week One, Exercise 2, Part (d). Then compute the logit and estimated logistic probability for a 60-year old subject. Then compute a 95 percent confidence intervals for the logit and estimated logistic probability. Write a sentence or two interpreting the estimated probability and its confidence interval.

Once again the approach can be analogous to exercise one but we are going to change it a litle bit:
```
covariance_matrix<-vcov(fit_hw3_ex2)
knitr::kable(covariance_matrix,caption = "Covariance matrix of estimated parameters")
```
Covariance matrix of estimated parameters

(Intercept) AGE

(Intercept) 0.4845291 -0.0071030

AGE -0.0071030 0.0001116
```
predicted_logit<-predict(fit_hw3_ex2,newdata =data.frame(AGE=60),se.fit = T)
confidence_intervals<-predicted_logit$fit+c(1,-1)*qnorm(0.975)*predicted_logit$se.fit
confidence_table<-data.frame(cbind(predicted_logit$fit,rbind(confidence_intervals)))
predicted_probability<-sapply(confidence_table[1, ], plogis)

final_table <- data.frame(rbind(predicted_probability, confidence_table))
rownames(final_table) <- c("Probability", "LOGIT")
colnames(final_table) <- c("Estimate", "U bound", "L bound")
knitr::kable(final_table)
```
Estimate U bound L bound

Probability 0.1968726 0.2602054 0.1459143

LOGIT -1.4059567 -1.0449011 -1.7670123

The estimated logistic probability of dying in the ICU for a 60 year old subject, 0.1969, is an estimate of the proportion of 60 year old subjects in the population sampled that die in the ICU. The confidence interval suggests that thismeancould be as low as 0.1450 or as high as 0.2616 with 95% confidence.

Covariance matrix of estimated parameters
	(Intercept)	AGE
(Intercept)	0.4845291	-0.0071030
AGE	-0.0071030	0.0001116

	Estimate	U bound	L bound
Probability	0.1968726	0.2602054	0.1459143
LOGIT	-1.4059567	-1.0449011	-1.7670123

Exercise Three

For this exercise, you will use the ICU Study dataset. If you need the file you can download the icu.dta Stata file, or you can also access the data through the CSV file.

First, using the ICU data, consider the multiple logistic regression model of vital status, STA, on age (AGE), cancer part of the present problem (CAN), CPR prior to ICU admission (CPR), infection probable at ICU admission (INF), and race (RACE).

Then, Complete the following:

The variable RACE is coded at three levels. Prepare a table showing the coding of the two design variables necessary for including this variable in a logistic regression model.

library(magrittr)
library(knitr)
attach(ICU)
table(RACE)

## RACE
##   1   2   3 
## 175  15  10

table.coding<-data.frame(white=c(0,0),black=c(0,1),other=c(1,1))
table.coding<-t(table.coding)
colnames(table.coding)=c("race_2","race_3")
table.coding%>%kable

	race_2	race_3
white	0	0
black	0	1
other	1	1

Write down the equation for the logistic regression model of STA on AGE, CAN, CPR, INF, and RACE. Write down the equation for the logit transformation of this logistic regression model. How many parameters does this model contain?

The equation of the model looks like this:
\[ \pi(X)=\frac{e^{\beta_0+\beta_1 \cdot AGE+\beta_2 \cdot CAN+\beta_3 \cdot CPR+\beta_4 \cdot INF +\beta_5 \cdot RACE_2+\beta_6 \cdot RACE_3}}{1+e^{\beta_0+\beta_1 \cdot AGE+\beta_2 \cdot CAN+\beta_3 \cdot CPR+\beta_4 \cdot INF +\beta_5 \cdot RACE_2+\beta_6 \cdot RACE_3}} \]
Where X is the vector of covariates. And the logit transformation is:
\[ g(X)=\beta_0+\beta_1 \cdot AGE+\beta_2 \cdot CAN+\beta_3 \cdot CPR+\beta_4 \cdot INF +\beta_5 \cdot RACE_2+\beta_6 \cdot RACE_3 \] This model contains 7 parameters.
Write down an expression for the likelihood and log likelihood for the logistic regression model in part (b). How many likelihood equations are there? Write down an expression for a typical likelihood equation for this problem.
Likelihood function: \[ l(\beta)=\prod_{i=1}^{n}\zeta(X_i) \] where, $\zeta(x_i)=\pi(X_i)^{Y_i}(1-\pi(X_i))^{1-Y_i}$, X is set of covariates and $Y_{i}=\begin{cases} 0 & \text{if patient lived}\\ 1 & \text{if patient died}\\ \end{cases}$
log likelihood function: \[L(\beta)=\ln(l(\beta))=\sum_{i=1}^{n}\left\{Y_i\ln[\pi(X_i)]+(1-Y_i)\ln[1-\pi(X_i)]\right\} \]
There wil be 7 likelihood equations to this probllem.
Likelihood equations that result may be expressed as follows:
\[\sum_{i=1}^{n}[y_i-\pi(x_i)]=0\] \[\sum_{i=1}^{n}x_i[y_i-\pi(x_i)]=0\]

Using a logistic regression package, obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (b). Using these estimates write down the equation for the fitted values, that is, the estimated logistic probabilities.

  RACE_2=as.factor(RACE==2)#create dummy variable1
  RACE_3=as.factor(RACE==3)  #create dummy variable 2
  ICU_1=cbind(ICU,RACE_2,RACE_3)
  fit_hw3_ex3=glm(STA~AGE+CAN+CPR+INF+RACE_2+RACE_3,data = ICU_1,family = binomial())
  stargazer::stargazer(fit_hw3_ex3,type = "html")


	Dependent variable:

	STA

AGE	0.027^**
	(0.012)

CAN	0.245
	(0.617)

CPR	1.646^***
	(0.623)

INF	0.681^*
	(0.380)

RACE_2	-0.957
	(1.084)

RACE_3	0.260
	(0.871)

Constant	-3.512^***
	(0.814)


Observations	200
Log Likelihood	-89.650
Akaike Inf. Crit.	193.301

Note:	p<0.1; p<0.05; p<0.01

The logit transformation:\[ g(x)=-3.512+0.027\cdot AGE+0.245 \cdot CAN+\cdot 1.646 \cdot CPR+0.681 \cdot INF-0.957\cdot RACE_2+0.260 \cdot RACE_2 \]
The logistic regression model:\[\pi(x)=\frac{e^{-3.512+0.027\cdot AGE+0.245 \cdot CAN+\cdot 1.646 \cdot CPR+0.681 \cdot INF-0.957\cdot RACE_2+0.260 \cdot RACE_2}}{1+e^{-3.512+0.027\cdot AGE+0.245 \cdot CAN+\cdot 1.646 \cdot CPR+0.681 \cdot INF-0.957\cdot RACE_2+0.260 \cdot RACE_2}} \]

Using the results of the output from the logistic regression package used in part (d), assess the significance of the slope coefficients for the variables in the model using the likelihood ratio test. What assumptions are needed for the p-values computed for this test to be valid? What is the value of the deviance for the fitted model?

Deviance:

$D=-2\ln{\Bigg[ \frac{likelihood \enspace of \enspace the \enspace current\enspace model}{likelihood\enspace of\enspace the\enspace saturated\enspace model}\Bigg] }=$ 179.3007274.
Since we have 6 variables then we will be using $\chi^2(6)$ distribution for the likelihood ratio test for which we construct next null hypothesis:\[H_o:\beta_0=\beta_1=\beta_2=\beta_3=\beta_4=\beta_5=\beta_6\]
```
statistic=fit_hw3_ex3$null.deviance-fit_hw3_ex3$deviance
pvalue=pchisq(statistic,df=6,lower.tail = F)
```
Since pvalue=0.0019437 is much smaller than 5% we reject null hypothesis and we conclude that together, AGE, CAN, CPR, INF, and RACE are significant predictors of STA. Assumption: the statistic G will follow a distribution with 6 degrees of freedom under the null hypothesis.