Applied Logistic Regression

Exercise One:

For this exercise, we need the Myopia Study dataset. The data is available by Professor Lemeshow’s course on Canvas Network and it has been studied in Exercise One - week 1.

Complete the following:

Using the results of the output from software, assess the significance of the slope coefficient for SPHEQ using the likelihood ratio test and the Wald test. What assumptions are needed for the p-values computed for each of these tests to be valid? Are the results of these tests consistent with one another? What is the value of the deviance for the fitted model?

Solving

For access the data in R¹:

filename = "MYOPIA.csv"
if (!file.exists(filename)) {
        fileURL="https://learn.canvas.net/courses/1179/files/461762/download?download_frd=1"
        download.file(fileURL, filename)
        }
data = read.csv("MYOPIA.csv", header = T, sep = ",")

We can obtain the results required with codes:

mod = glm(MYOPIC ~ SPHEQ,
                  data = data,
                  family = binomial(link = "logit"))
ztable(mod)


	Estimate	Std. Error	z value	Pr(>\|z\|)	OR	lcl	ucl
(Intercept)	0.0540	0.2067	0.26	0.7940	1.06	0.71	1.59
SPHEQ	-3.8331	0.4184	-9.16	0.0000	0.02	0.01	0.05
Call: glm(formula = MYOPIC ~ SPHEQ, family = binomial(link = “logit”), data = data)

and

pander::pander(summary(mod))

	Estimate	Std. Error	z value	Pr(>\|z\|)
SPHEQ	-3.833	0.4184	-9.162	5.095e-20
(Intercept)	0.05397	0.2067	0.2611	0.794

(Dispersion parameter for binomial family taken to be 1 )

Null deviance:	480.1 on 617 degrees of freedom
Residual deviance:	337.3 on 616 degrees of freedom

The hypothesis for Likelihood Ratio Test (LRT) are \(H_{0} :\beta_{1} = 0\) or \(H_{1} :\beta_{1} \neq 0\) and the statistic for LRT is given for:

\[G = D(\mbox{model without the variable})-D(\mbox{model with the variable}),\] where \(D\) (Deviance) is given for

\[ D = -2\times log\left(\dfrac{\mbox{likelihood of the fitted model}}{\mbox{likelihood of the saturated model}}\right),\] that is, for which pair \((SPHEQ_{i}, MYOPIC_{i}), \quad 1 \leq i \leq 618\),

\[ D = -2\times \sum_{i=1}^{618} \left[ MYOPIC_{i} \times log\left( \dfrac{\hat{\pi}(SPHEQ_{i})}{MYOPIC_{i}}\right) + [1-MYOPIC_{i}] \times log\left( \dfrac{[1-\hat{\pi}(SPHEQ_{i})]}{[1-MYOPIC_{i}]}\right) \right],\] with

\[\hat{\pi}(SPHEQ_{i}) = \dfrac{e^{\beta_0 + \beta_{1}\times SPHEQ_{i}}}{1+e^{\beta_0 + \beta_{1}\times SPHEQ_{i}}}.\]

With results from code of the function glm we can do \[D = [\mbox{Null deviance} - \mbox{Residual deviance}]\] and, once \(G\) follows a chi-square distribution with \(1\) degree of freedom, we can find the significance of SPHEQ in the model:

test = summary(mod)
(G = test$null.deviance - test$deviance)

[1] 142.7321

(p = pchisq(G, 1, lower.tail = F))

[1] 6.72664e-33

As the probability of finding such G assuming true \(H_{0}\) is very small (\(p <0.05\)) then reject the null hypotesis.

Another way:

ztable(anova(mod, test = "LRT"))


	Df	Deviance	Resid. Df	Resid. Dev	Pr(>Chi)
Analysis of Deviance Table
Model: binomial, link: logit
Response: MYOPIC
Terms added sequentially (first to last)
NULL			617	480.08
SPHEQ	1	142.73	616	337.34	0.0000

For Wald Test given for \(W = \dfrac{\hat{\beta_{1}}}{\hat{SE}(\hat{\beta_{1}})},\) the hypothesis are \(H_{0} :\beta_{1} = 0\) or \(H_{1} :\beta_{1} \neq 0\) .

We can see in summary table that z value and p value for SPHEQ are concording with TRV test.

summary(mod)$coefficients[2,3] # z value

[1] -9.161988

summary(mod)$coefficients[2,4] # p value < 0.05

[1] 5.094999e-20

Below, the predicted values by the model:

library(ggplot2)
predit = predict(mod, type = "response")
qplot(data$SPHEQ, predit, 
         ylab = "Probability of MYOPIC",
         xlab = "SPHEQ", ylim = c(0,1))

Given an explanation model for MYOPIC, the LRT Test evaluates whether the inclusion of SPHEQ variable in this model contributes significantly to greater explanation of MYOPIC. Thus, the test presupposes that models differ only by the variable of interest, in this case, SPHEQ. Under the hypothesis that \(\beta_{1}\) is equal to zero, the statistic of LRT Test follows a chi-square distribution with 1 degree of freedom.²
The Wald test evaluates how much significance is the slope of the straight SPHEQ. Under the null hypothesis and the sample size assumptions, this ratio follows a standard normal distribution.²
The value of the deviance for the fitted model is 337.34.

Exercise Two:

For this exercise, we need the ICU dataset. The data is available by Professor Lemeshow’s course on Canvas Network and it has been studied in Exercise Two - week 1.

filename = "icu.csv"
if (!file.exists(filename)) {
        fileURL = "https://learn.canvas.net/courses/1179/files/461760/download?download_frd=1"
        download.file(fileURL, filename)
        }
data = read.csv("icu.csv", header = T, sep = ",")

Complete the following:

Using the results of the output from the logistic regression package used for Exercise Two - Part (D) of week one, assess the significance of the slope coefficient for AGE using the likelihood ratio test and the Wald test. What assumptions are needed for the p-values computed for each of these tests to be valid? Are the results of these tests consistent with one another? What is the value of the deviance for the fitted model?

To recall from week one: exercise two-D prompted you to obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (A). These estimates should be based on the ungrouped, n=200, data. Using these estimates, write down the equation for the fitted values, that is, the estimated logistic probabilities. Plot the equation for the fitted values on the axes used in the scatterplots in part (B).

Solving

We can obtain the results required with codes:

mod = glm(STA ~ AGE,
          data = data,
          family = binomial(link = "logit"))
ztable(mod)


	Estimate	Std. Error	z value	Pr(>\|z\|)	OR	lcl	ucl
(Intercept)	-3.0585	0.6961	-4.39	0.0000	0.05	0.01	0.17
AGE	0.0275	0.0106	2.61	0.0091	1.03	1.01	1.05
Call: glm(formula = STA ~ AGE, family = binomial(link = “logit”), data = data)

and

pander::pander(summary(mod))

	Estimate	Std. Error	z value	Pr(>\|z\|)
AGE	0.02754	0.01056	2.607	0.009129
(Intercept)	-3.059	0.6961	-4.394	1.113e-05

(Dispersion parameter for binomial family taken to be 1 )

Null deviance:	200.2 on 199 degrees of freedom
Residual deviance:	192.3 on 198 degrees of freedom

The hypothesis for Likelihood Ratio Test (LRT) are \(H_{0} :\beta_{1} = 0\) or \(H_{1} :\beta_{1} \neq 0\) and the statistic for LRT is given similarly as done in exercise one:

\[G = D(\mbox{model without the variable})-D(\mbox{model with the variable}),\] where \(D\) (Deviance) is given for

\[ D = -2\times log\left(\dfrac{\mbox{likelihood of the fitted model}}{\mbox{likelihood of the saturated model}}\right),\] that is, for which pair \((AGE_{i}, STA_{i}), \quad 1 \leq i \leq 200\),

\[ D = -2\times \sum_{i=1}^{200} \left[ STA_{i} \times log\left( \dfrac{\hat{\pi}(AGE_{i})}{STA_{i}}\right) + [1-STA_{i}] \times log\left( \dfrac{[1-\hat{\pi}(AGE_{i})]}{[1-STA_{i}]}\right) \right],\] with

\[\hat{\pi}(AGE_{i}) = \dfrac{e^{\beta_0 + \beta_{1}\times AGE_{i}}}{1+e^{\beta_0 + \beta_{1}\times AGE_{i}}}.\]

With results from code of the function glm we can do \[D = [\mbox{Null deviance} - \mbox{Residual deviance}]\] and, once \(G\) follows a chi-square distribution with \(1\) degree of freedom, we can find the significance of AGE in the model:

test = summary(mod)
(G = test$null.deviance - test$deviance)

[1] 7.854589

(p = pchisq(G, 1, lower.tail = F))

[1] 0.005069187

As the probability of finding such G assuming true \(H_{0}\) is very small (\(p <0.05\)) then reject the null hypotesis.

For Wald Test given for \(W = \dfrac{\hat{\beta_{1}}}{\hat{SE}(\hat{\beta_{1}})},\) the hypothesis are \(H_{0} :\beta_{1} = 0\) or \(H_{1} :\beta_{1} \neq 0\) .

We can see in summary table that z value and p value for AGE are concording with TRV test.

summary(mod)$coefficients[2,3] # z value

[1] 2.607174

summary(mod)$coefficients[2,4] # p value < 0.05

[1] 0.009129303

Given an explanation model for STA, the LRT Test evaluates whether the inclusion of AGE variable in this model contributes significantly to greater explanation of STA. Thus, the test presupposes that models differ only by the variable of interest, in this case, AGE. Under the hypothesis that \(\beta_{1}\) is equal to zero, the statistic of LRT Test follows a chi-square distribution with 1 degree of freedom.²
The Wald test evaluates how much significance is the slope of the straight AGE. Under the null hypothesis and the sample size assumptions, this ratio follows a standard normal distribution.²
The value of the deviance for the fitted model is 192.31.
Equation for fitted values (estimated logistic probabilities) of STA on AGE:

\[\hat{\pi}(AGE) = \dfrac{e^{ -3.06 + 0.03\times AGE}}{1+e^{-3.06 + 0.03\times AGE}}.\]

Predicted values by the model:

predit = predict(mod, type = "response")
ggplot() +
        geom_point(data = data, aes(x = AGE, y = STA, colour="blue")) + 
        geom_point(data = data, aes(x = AGE, y= predit, colour="red")) + 
        ylab("Probability of STA") +
        xlab("AGE") +
        ylim(0,1) +
        scale_colour_discrete(name="Plots",
                              labels=c("Observed", "Predict"))

References

1. R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2016.(https://www.R-project.org/)

2. Hosmer DW, Lemeshow S, Sturdivant R. Applied Logistic regression. 2013.(http://onlinelibrary.wiley.com/book/10.1002/9781118548387)

Applied Logistic Regression

Homework - Week 2

September/2016

Exercise One:

Complete the following:

Solving

Exercise Two:

Complete the following:

Solving

References