For this exercise, you will use the Myopia Study dataset and you will need to refer to the results from Week Two, Exercise 1. If you need the file you can download the MYOPIA.dta Stata file, or you can also access the data through the CSV file.
library(ggplot2)
MYOPIA<-read.csv(file = "https://learn.canvas.net/courses/1179/files/461762/download?download_frd=1")#read the file from adress in the assignment
#fit the LR
fit_hw3_ex1<-glm(MYOPIC~SPHEQ, data=MYOPIA, family = binomial())
#plot
ggplot(data=MYOPIA, aes(x = SPHEQ, y = MYOPIC)) +
geom_point(alpha=0.4,colour="red") +
stat_smooth(method="glm", se= FALSE,method.args =
list(family="binomial"),alpha=0.8) +
xlim(-1,4.5) +
theme_bw()
Using the results from Week Two, Exercise 1, compute 95 percent confidence intervals for the slope coefficient SPHEQ. Write a sentence interpreting this confidence.
We are 95% confident that the slope coefficient for theSPEQ will be between -3.0131081 and -4.653087firstly we can use coefficient and standard error from regression and derive 97.5% quantile and 2.5% quantile from normal distribution. The code is as follows: UP=qnorm(p=0.975,mean=coef(fit_hw3_ex1)[2],sd=sqrt(vcov(fit_hw3_ex1)[2,2] )) LOW=qnorm(p=0.975,mean=coef(fit_hw3_ex1)[2],sd=sqrt(vcov(fit_hw3_ex1)[2,2] ))
alternatively and more easily we could use confint.default() function:
confint.default(fit_hw3_ex1)
2.5 % 97.5 %
(Intercept) -0.3512461 0.4591923
SPHEQ -4.6530870 -3.0131081
Use Stata (or your preferred analysis software tool) to obtain the estimated covariance matrix. Then compute the logit and estimated logistic probability for a subject with SPHEQ = 2. Then evaluate the endpoints of the 95 percent confidence intervals for the logit and estimated logistic probability. Write a sentence interpreting the estimated probability and its confidence interval.
Let us first extract covariance matrix:
covariance_matrix<-vcov(fit_hw3_ex1)
knitr::kable(covariance_matrix,caption = "Covariance matrix of estimated parameters")
| (Intercept) | SPHEQ | |
|---|---|---|
| (Intercept) | 0.0427449 | -0.0633777 |
| SPHEQ | -0.0633777 | 0.1750332 |
Now lets predict logit for a subject with SPEQ = 2:
predicted_logit<-predict(fit_hw3_ex1,newdata = data.frame(SPHEQ=2),se.fit = T)
predicted_probability<-predict(fit_hw3_ex1,newdata = data.frame(SPHEQ=2),type = "response",se.fit = T)
Finally we calculate confidence intervalls for logit and then translate them back to probability:
logit<-c(
Upper=qnorm(mean=predicted_logit$fit,sd=predicted_logit$se.fit,p=0.975),
Lower=qnorm(mean=predicted_logit$fit,sd=predicted_logit$se.fit,p=0.025)
)#ci for logit
probability<-plogis(logit)#ci for probability
confidence_intervals<-cbind(rbind(predicted_logit[-3],predicted_probability[-3]),rbind(logit,probability))
knitr::kable(confidence_intervals)
| fit | se.fit | Upper | Lower | |
|---|---|---|---|---|
| logit | -7.612222 | 0.6995475 | -6.241134 | -8.98331 |
| probability | 0.0004941278 | 0.0003454951 | 0.00194386 | 0.0001254711 |
You can share your response and discuss this exercise in the discussion area.
For this exercise, you will use the ICU Study dataset and you will need to refer to the results from Week One, Exercise 2, Part (d). If you need the file you can download the icu.dta Stata file, or you can also access the data through the CSV file.
ICU<-read.csv(file = "https://learn.canvas.net/courses/1179/files/461760/download?download_frd=1")#read the file from adress in the assignment
fit_hw3_ex2<-glm(STA~AGE, data=ICU, family = binomial())
ggplot(data = ICU,aes(x=AGE,y=STA))+
geom_point(alpha=0.4,colour="red")+
theme_bw()+
stat_smooth(method="glm", se= FALSE,method.args = list(family = "binomial")
)
Using the results from Week One, Exercise 2, Part (d), compute 95 percent confidence intervals for the slope and constant term. Write a sentence interpreting the confidence interval for the slope.
We are 95% confident that the slope coefficient for theAGE \(\beta_1=\)0.0275426 will be between 0.048248 and 0.0068372.firstly, we can use coefficient and standard error from regression and derive 97.5% quantile and 2.5% quantile from normal distribution. The code is as follows: UP=qnorm(p=0.975,mean=coef(fit_hw3_ex2)[2],sd=sqrt(vcov(fit_hw3_ex2)[2,2] )) LOW=qnorm(p=0.975,mean=coef(fit_hw3_ex2)[2],sd=sqrt(vcov(fit_hw3_ex2)[2,2] ))
alternatively and more easily we could use confint.default() function:
confint.default(fit_hw3_ex2)
2.5 % 97.5 %
(Intercept) -4.42280739 -1.69421908
AGE 0.00683723 0.04824799Obtain the estimated covariance matrix for the model fit from Week One, Exercise 2, Part (d). Then compute the logit and estimated logistic probability for a 60-year old subject. Then compute a 95 percent confidence intervals for the logit and estimated logistic probability. Write a sentence or two interpreting the estimated probability and its confidence interval.
Once again the approach can be analogous to exercise one but we are going to change it a litle bit:
covariance_matrix<-vcov(fit_hw3_ex2)
knitr::kable(covariance_matrix,caption = "Covariance matrix of estimated parameters")
| (Intercept) | AGE | |
|---|---|---|
| (Intercept) | 0.4845291 | -0.0071030 |
| AGE | -0.0071030 | 0.0001116 |
predicted_logit<-predict(fit_hw3_ex2,newdata =data.frame(AGE=60),se.fit = T)
confidence_intervals<-predicted_logit$fit+c(1,-1)*qnorm(0.975)*predicted_logit$se.fit
confidence_table<-data.frame(cbind(predicted_logit$fit,rbind(confidence_intervals)))
predicted_probability<-sapply(confidence_table[1, ], plogis)
final_table <- data.frame(rbind(predicted_probability, confidence_table))
rownames(final_table) <- c("Probability", "LOGIT")
colnames(final_table) <- c("Estimate", "U bound", "L bound")
knitr::kable(final_table)
| Estimate | U bound | L bound | |
|---|---|---|---|
| Probability | 0.1968726 | 0.2602054 | 0.1459143 |
| LOGIT | -1.4059567 | -1.0449011 | -1.7670123 |
The estimated logistic probability of dying in the ICU for a 60 year old subject, 0.1969, is an estimate of the proportion of 60 year old subjects in the population sampled that die in the ICU. The confidence interval suggests that thismeancould be as low as 0.1450 or as high as 0.2616 with 95% confidence.
For this exercise, you will use the ICU Study dataset. If you need the file you can download the icu.dta Stata file, or you can also access the data through the CSV file.
First, using the ICU data, consider the multiple logistic regression model of vital status, STA, on age (AGE), cancer part of the present problem (CAN), CPR prior to ICU admission (CPR), infection probable at ICU admission (INF), and race (RACE).
The variable RACE is coded at three levels. Prepare a table showing the coding of the two design variables necessary for including this variable in a logistic regression model.
library(magrittr)
library(knitr)
attach(ICU)
table(RACE)
## RACE
## 1 2 3
## 175 15 10
table.coding<-data.frame(white=c(0,0),black=c(0,1),other=c(1,1))
table.coding<-t(table.coding)
colnames(table.coding)=c("race_2","race_3")
table.coding%>%kable
| race_2 | race_3 | |
|---|---|---|
| white | 0 | 0 |
| black | 0 | 1 |
| other | 1 | 1 |
Write down the equation for the logistic regression model of STA on AGE, CAN, CPR, INF, and RACE. Write down the equation for the logit transformation of this logistic regression model. How many parameters does this model contain?
The equation of the model looks like this:
\[ \pi(X)=\frac{e^{\beta_0+\beta_1 \cdot AGE+\beta_2 \cdot CAN+\beta_3 \cdot CPR+\beta_4 \cdot INF +\beta_5 \cdot RACE_2+\beta_6 \cdot RACE_3}}{1+e^{\beta_0+\beta_1 \cdot AGE+\beta_2 \cdot CAN+\beta_3 \cdot CPR+\beta_4 \cdot INF +\beta_5 \cdot RACE_2+\beta_6 \cdot RACE_3}} \]
Where X is the vector of covariates. And the logit transformation is:
\[ g(X)=\beta_0+\beta_1 \cdot AGE+\beta_2 \cdot CAN+\beta_3 \cdot CPR+\beta_4 \cdot INF +\beta_5 \cdot RACE_2+\beta_6 \cdot RACE_3 \] This model contains 7 parameters.
Write down an expression for the likelihood and log likelihood for the logistic regression model in part (b). How many likelihood equations are there? Write down an expression for a typical likelihood equation for this problem.
Likelihood function: \[ l(\beta)=\prod_{i=1}^{n}\zeta(X_i) \] where, \(\zeta(x_i)=\pi(X_i)^{Y_i}(1-\pi(X_i))^{1-Y_i}\), X is set of covariates and \(Y_{i}=\begin{cases} 0 & \text{if patient lived}\\ 1 & \text{if patient died}\\ \end{cases}\)
log likelihood function: \[L(\beta)=\ln(l(\beta))=\sum_{i=1}^{n}\left\{Y_i\ln[\pi(X_i)]+(1-Y_i)\ln[1-\pi(X_i)]\right\} \]
There wil be 7 likelihood equations to this probllem.
Likelihood equations that result may be expressed as follows:
\[\sum_{i=1}^{n}[y_i-\pi(x_i)]=0\] \[\sum_{i=1}^{n}x_i[y_i-\pi(x_i)]=0\]
Using a logistic regression package, obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (b). Using these estimates write down the equation for the fitted values, that is, the estimated logistic probabilities.
RACE_2=as.factor(RACE==2)#create dummy variable1
RACE_3=as.factor(RACE==3) #create dummy variable 2
ICU_1=cbind(ICU,RACE_2,RACE_3)
fit_hw3_ex3=glm(STA~AGE+CAN+CPR+INF+RACE_2+RACE_3,data = ICU_1,family = binomial())
stargazer::stargazer(fit_hw3_ex3,type = "html")
| Dependent variable: | |
| STA | |
| AGE | 0.027** |
| (0.012) | |
| CAN | 0.245 |
| (0.617) | |
| CPR | 1.646*** |
| (0.623) | |
| INF | 0.681* |
| (0.380) | |
| RACE_2 | -0.957 |
| (1.084) | |
| RACE_3 | 0.260 |
| (0.871) | |
| Constant | -3.512*** |
| (0.814) | |
| Observations | 200 |
| Log Likelihood | -89.650 |
| Akaike Inf. Crit. | 193.301 |
| Note: | p<0.1; p<0.05; p<0.01 |
Using the results of the output from the logistic regression package used in part (d), assess the significance of the slope coefficients for the variables in the model using the likelihood ratio test. What assumptions are needed for the p-values computed for this test to be valid? What is the value of the deviance for the fitted model?
Deviance:
\(D=-2\ln{\Bigg[ \frac{likelihood \enspace of \enspace the \enspace current\enspace model}{likelihood\enspace of\enspace the\enspace saturated\enspace model}\Bigg] }=\) 179.3007274.
Since we have 6 variables then we will be using \(\chi^2(6)\) distribution for the likelihood ratio test for which we construct next null hypothesis:\[H_o:\beta_0=\beta_1=\beta_2=\beta_3=\beta_4=\beta_5=\beta_6\]
statistic=fit_hw3_ex3$null.deviance-fit_hw3_ex3$deviance
pvalue=pchisq(statistic,df=6,lower.tail = F)
Since pvalue=0.0019437 is much smaller than 5% we reject null hypothesis and we conclude that together, AGE, CAN, CPR, INF, and RACE are significant predictors of STA. Assumption: the statistic G will follow a distribution with 6 degrees of freedom under the null hypothesis.