Codes for knitr and ztable:
library(knitr)
knitr::opts_chunk$set(collapse = TRUE,
comment = "#>",
fig.align='center',
fig.width = 5,
fig.height = 3,
cache=TRUE,
echo=T,
warning =F,
message= F,
results = 'asis')
library(ztable)
options(ztable.type="html")For this exercise, we need the Myopia Study dataset. The data is available by Professor Lemeshow’s course on Canvas Network and it has been studied in Exercise One - week 1.
Using the results of the output from software, assess the significance of the slope coefficient for SPHEQ using the likelihood ratio test and the Wald test. What assumptions are needed for the p-values computed for each of these tests to be valid? Are the results of these tests consistent with one another? What is the value of the deviance for the fitted model?
For access the data in R1:
filename = "MYOPIA.csv"
if (!file.exists(filename)) {
fileURL="https://learn.canvas.net/courses/1179/files/461762/download?download_frd=1"
download.file(fileURL, filename)
}
data = read.csv("MYOPIA.csv", header = T, sep = ",")We can obtain the results required with codes:
mod = glm(MYOPIC ~ SPHEQ,
data = data,
family = binomial(link = "logit"))
ztable(mod)| Estimate | Std. Error | z value | Pr(>|z|) | OR | lcl | ucl | |
|---|---|---|---|---|---|---|---|
| (Intercept) | 0.0540 | 0.2067 | 0.26 | 0.7940 | 1.06 | 0.71 | 1.59 |
and
pander::pander(summary(mod))| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| SPHEQ | -3.833 | 0.4184 | -9.162 | 5.095e-20 |
| (Intercept) | 0.05397 | 0.2067 | 0.2611 | 0.794 |
(Dispersion parameter for binomial family taken to be 1 )
| Null deviance: | 480.1 on 617 degrees of freedom |
| Residual deviance: | 337.3 on 616 degrees of freedom |
The hypothesis for Likelihood Ratio Test (LRT) are \(H_{0} :\beta_{1} = 0\) or \(H_{1} :\beta_{1} \neq 0\) and the statistic for LRT is given for:
\[G = D(\mbox{model without the variable})-D(\mbox{model with the variable}),\] where \(D\) (Deviance) is given for
\[ D = -2\times log\left(\dfrac{\mbox{likelihood of the fitted model}}{\mbox{likelihood of the saturated model}}\right),\] that is, for which pair \((SPHEQ_{i}, MYOPIC_{i}), \quad 1 \leq i \leq 618\),
\[ D = -2\times \sum_{i=1}^{618} \left[ MYOPIC_{i} \times log\left( \dfrac{\hat{\pi}(SPHEQ_{i})}{MYOPIC_{i}}\right) + [1-MYOPIC_{i}] \times log\left( \dfrac{[1-\hat{\pi}(SPHEQ_{i})]}{[1-MYOPIC_{i}]}\right) \right],\] with
\[\hat{\pi}(SPHEQ_{i}) = \dfrac{e^{\beta_0 + \beta_{1}\times SPHEQ_{i}}}{1+e^{\beta_0 + \beta_{1}\times SPHEQ_{i}}}.\]
With results from code of the function glm we can do \[D = [\mbox{Null deviance} - \mbox{Residual deviance}]\] and, once \(G\) follows a chi-square distribution with \(1\) degree of freedom, we can find the significance of SPHEQ in the model:
test = summary(mod)
(G = test$null.deviance - test$deviance)[1] 142.7321
(p = pchisq(G, 1, lower.tail = F))[1] 6.72664e-33
As the probability of finding such G assuming true \(H_{0}\) is very small (\(p <0.05\)) then reject the null hypotesis.
Another way:
ztable(anova(mod, test = "LRT"))| Df | Deviance | Resid. Df | Resid. Dev | Pr(>Chi) | |
|---|---|---|---|---|---|
| NULL | 617 | 480.08 | |||
For Wald Test given for \(W = \dfrac{\hat{\beta_{1}}}{\hat{SE}(\hat{\beta_{1}})},\) the hypothesis are \(H_{0} :\beta_{1} = 0\) or \(H_{1} :\beta_{1} \neq 0\) .
We can see in summary table that z value and p value for SPHEQ are concording with TRV test.
summary(mod)$coefficients[2,3] # z value[1] -9.161988
summary(mod)$coefficients[2,4] # p value < 0.05[1] 5.094999e-20
Below, the predicted values by the model:
library(ggplot2)
predit = predict(mod, type = "response")
qplot(data$SPHEQ, predit,
ylab = "Probability of MYOPIC",
xlab = "SPHEQ", ylim = c(0,1)) Given an explanation model for MYOPIC, the LRT Test evaluates whether the inclusion of SPHEQ variable in this model contributes significantly to greater explanation of MYOPIC. Thus, the test presupposes that models differ only by the variable of interest, in this case, SPHEQ. Under the hypothesis that \(\beta_{1}\) is equal to zero, the statistic of LRT Test follows a chi-square distribution with 1 degree of freedom.2
The Wald test evaluates how much significance is the slope of the straight SPHEQ. Under the null hypothesis and the sample size assumptions, this ratio follows a standard normal distribution.2
The value of the deviance for the fitted model is 337.34.
For this exercise, we need the ICU dataset. The data is available by Professor Lemeshow’s course on Canvas Network and it has been studied in Exercise Two - week 1.
filename = "icu.csv"
if (!file.exists(filename)) {
fileURL = "https://learn.canvas.net/courses/1179/files/461760/download?download_frd=1"
download.file(fileURL, filename)
}
data = read.csv("icu.csv", header = T, sep = ",")Using the results of the output from the logistic regression package used for Exercise Two - Part (D) of week one, assess the significance of the slope coefficient for AGE using the likelihood ratio test and the Wald test. What assumptions are needed for the p-values computed for each of these tests to be valid? Are the results of these tests consistent with one another? What is the value of the deviance for the fitted model?
To recall from week one: exercise two-D prompted you to obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (A). These estimates should be based on the ungrouped, n=200, data. Using these estimates, write down the equation for the fitted values, that is, the estimated logistic probabilities. Plot the equation for the fitted values on the axes used in the scatterplots in part (B).
We can obtain the results required with codes:
mod = glm(STA ~ AGE,
data = data,
family = binomial(link = "logit"))
ztable(mod)| Estimate | Std. Error | z value | Pr(>|z|) | OR | lcl | ucl | |
|---|---|---|---|---|---|---|---|
| (Intercept) | -3.0585 | 0.6961 | -4.39 | 0.0000 | 0.05 | 0.01 | 0.17 |
and
pander::pander(summary(mod))| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| AGE | 0.02754 | 0.01056 | 2.607 | 0.009129 |
| (Intercept) | -3.059 | 0.6961 | -4.394 | 1.113e-05 |
(Dispersion parameter for binomial family taken to be 1 )
| Null deviance: | 200.2 on 199 degrees of freedom |
| Residual deviance: | 192.3 on 198 degrees of freedom |
The hypothesis for Likelihood Ratio Test (LRT) are \(H_{0} :\beta_{1} = 0\) or \(H_{1} :\beta_{1} \neq 0\) and the statistic for LRT is given similarly as done in exercise one:
\[G = D(\mbox{model without the variable})-D(\mbox{model with the variable}),\] where \(D\) (Deviance) is given for
\[ D = -2\times log\left(\dfrac{\mbox{likelihood of the fitted model}}{\mbox{likelihood of the saturated model}}\right),\] that is, for which pair \((AGE_{i}, STA_{i}), \quad 1 \leq i \leq 200\),
\[ D = -2\times \sum_{i=1}^{200} \left[ STA_{i} \times log\left( \dfrac{\hat{\pi}(AGE_{i})}{STA_{i}}\right) + [1-STA_{i}] \times log\left( \dfrac{[1-\hat{\pi}(AGE_{i})]}{[1-STA_{i}]}\right) \right],\] with
\[\hat{\pi}(AGE_{i}) = \dfrac{e^{\beta_0 + \beta_{1}\times AGE_{i}}}{1+e^{\beta_0 + \beta_{1}\times AGE_{i}}}.\]
With results from code of the function glm we can do \[D = [\mbox{Null deviance} - \mbox{Residual deviance}]\] and, once \(G\) follows a chi-square distribution with \(1\) degree of freedom, we can find the significance of AGE in the model:
test = summary(mod)
(G = test$null.deviance - test$deviance)[1] 7.854589
(p = pchisq(G, 1, lower.tail = F))[1] 0.005069187
As the probability of finding such G assuming true \(H_{0}\) is very small (\(p <0.05\)) then reject the null hypotesis.
For Wald Test given for \(W = \dfrac{\hat{\beta_{1}}}{\hat{SE}(\hat{\beta_{1}})},\) the hypothesis are \(H_{0} :\beta_{1} = 0\) or \(H_{1} :\beta_{1} \neq 0\) .
We can see in summary table that z value and p value for AGE are concording with TRV test.
summary(mod)$coefficients[2,3] # z value[1] 2.607174
summary(mod)$coefficients[2,4] # p value < 0.05[1] 0.009129303
Given an explanation model for STA, the LRT Test evaluates whether the inclusion of AGE variable in this model contributes significantly to greater explanation of STA. Thus, the test presupposes that models differ only by the variable of interest, in this case, AGE. Under the hypothesis that \(\beta_{1}\) is equal to zero, the statistic of LRT Test follows a chi-square distribution with 1 degree of freedom.2
The Wald test evaluates how much significance is the slope of the straight AGE. Under the null hypothesis and the sample size assumptions, this ratio follows a standard normal distribution.2
The value of the deviance for the fitted model is 192.31.
Equation for fitted values (estimated logistic probabilities) of STA on AGE:
\[\hat{\pi}(AGE) = \dfrac{e^{ -3.06 + 0.03\times AGE}}{1+e^{-3.06 + 0.03\times AGE}}.\]
predit = predict(mod, type = "response")
ggplot() +
geom_point(data = data, aes(x = AGE, y = STA, colour="blue")) +
geom_point(data = data, aes(x = AGE, y= predit, colour="red")) +
ylab("Probability of STA") +
xlab("AGE") +
ylim(0,1) +
scale_colour_discrete(name="Plots",
labels=c("Observed", "Predict"))1. R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2016.(https://www.R-project.org/)
2. Hosmer DW, Lemeshow S, Sturdivant R. Applied Logistic regression. 2013.(http://onlinelibrary.wiley.com/book/10.1002/9781118548387)