Applied Logistic Regression

Homework - Week 2

September/2016

Codes for knitr and ztable:

library(knitr)
knitr::opts_chunk$set(collapse = TRUE, 
                      comment = "#>",
                      fig.align='center',
                      fig.width = 5,
                      fig.height = 3,
                      cache=TRUE,
                      echo=T,
                      warning =F,
                      message= F,
                      results = 'asis')
library(ztable)
options(ztable.type="html")

Exercise One:

For this exercise, we need the Myopia Study dataset. The data is available by Professor Lemeshow’s course on Canvas Network and it has been studied in Exercise One - week 1.

Complete the following:

Using the results of the output from software, assess the significance of the slope coefficient for SPHEQ using the likelihood ratio test and the Wald test. What assumptions are needed for the p-values computed for each of these tests to be valid? Are the results of these tests consistent with one another? What is the value of the deviance for the fitted model?

Solving

For access the data in R1:

filename = "MYOPIA.csv"
if (!file.exists(filename)) {
        fileURL="https://learn.canvas.net/courses/1179/files/461762/download?download_frd=1"
        download.file(fileURL, filename)
        }
data = read.csv("MYOPIA.csv", header = T, sep = ",")

We can obtain the results required with codes:

mod = glm(MYOPIC ~ SPHEQ,
                  data = data,
                  family = binomial(link = "logit"))
ztable(mod)
  Estimate Std. Error z value Pr(>|z|) OR lcl ucl
(Intercept) 0.0540 0.2067 0.26 0.7940 1.06 0.71 1.59
SPHEQ -3.8331 0.4184 -9.16 0.0000 0.02 0.01 0.05
Call: glm(formula = MYOPIC ~ SPHEQ, family = binomial(link = “logit”), data = data)

and

pander::pander(summary(mod))
  Estimate Std. Error z value Pr(>|z|)
SPHEQ -3.833 0.4184 -9.162 5.095e-20
(Intercept) 0.05397 0.2067 0.2611 0.794

(Dispersion parameter for binomial family taken to be 1 )

Null deviance: 480.1 on 617 degrees of freedom
Residual deviance: 337.3 on 616 degrees of freedom

The hypothesis for Likelihood Ratio Test (LRT) are \(H_{0} :\beta_{1} = 0\) or \(H_{1} :\beta_{1} \neq 0\) and the statistic for LRT is given for:

\[G = D(\mbox{model without the variable})-D(\mbox{model with the variable}),\] where \(D\) (Deviance) is given for

\[ D = -2\times log\left(\dfrac{\mbox{likelihood of the fitted model}}{\mbox{likelihood of the saturated model}}\right),\] that is, for which pair \((SPHEQ_{i}, MYOPIC_{i}), \quad 1 \leq i \leq 618\),

\[ D = -2\times \sum_{i=1}^{618} \left[ MYOPIC_{i} \times log\left( \dfrac{\hat{\pi}(SPHEQ_{i})}{MYOPIC_{i}}\right) + [1-MYOPIC_{i}] \times log\left( \dfrac{[1-\hat{\pi}(SPHEQ_{i})]}{[1-MYOPIC_{i}]}\right) \right],\] with

\[\hat{\pi}(SPHEQ_{i}) = \dfrac{e^{\beta_0 + \beta_{1}\times SPHEQ_{i}}}{1+e^{\beta_0 + \beta_{1}\times SPHEQ_{i}}}.\]

With results from code of the function glm we can do \[D = [\mbox{Null deviance} - \mbox{Residual deviance}]\] and, once \(G\) follows a chi-square distribution with \(1\) degree of freedom, we can find the significance of SPHEQ in the model:

test = summary(mod)
(G = test$null.deviance - test$deviance)

[1] 142.7321

(p = pchisq(G, 1, lower.tail = F))

[1] 6.72664e-33

As the probability of finding such G assuming true \(H_{0}\) is very small (\(p <0.05\)) then reject the null hypotesis.

Another way:

ztable(anova(mod, test = "LRT"))
Analysis of Deviance Table
Model: binomial, link: logit
Response: MYOPIC
Terms added sequentially (first to last)
  Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 617 480.08
SPHEQ 1 142.73 616 337.34 0.0000

For Wald Test given for \(W = \dfrac{\hat{\beta_{1}}}{\hat{SE}(\hat{\beta_{1}})},\) the hypothesis are \(H_{0} :\beta_{1} = 0\) or \(H_{1} :\beta_{1} \neq 0\) .

We can see in summary table that z value and p value for SPHEQ are concording with TRV test.

summary(mod)$coefficients[2,3] # z value

[1] -9.161988

summary(mod)$coefficients[2,4] # p value < 0.05

[1] 5.094999e-20

Below, the predicted values by the model:

library(ggplot2)
predit = predict(mod, type = "response")
qplot(data$SPHEQ, predit, 
         ylab = "Probability of MYOPIC",
         xlab = "SPHEQ", ylim = c(0,1)) 

Exercise Two:

For this exercise, we need the ICU dataset. The data is available by Professor Lemeshow’s course on Canvas Network and it has been studied in Exercise Two - week 1.

filename = "icu.csv"
if (!file.exists(filename)) {
        fileURL = "https://learn.canvas.net/courses/1179/files/461760/download?download_frd=1"
        download.file(fileURL, filename)
        }
data = read.csv("icu.csv", header = T, sep = ",")

Complete the following:

Using the results of the output from the logistic regression package used for Exercise Two - Part (D) of week one, assess the significance of the slope coefficient for AGE using the likelihood ratio test and the Wald test. What assumptions are needed for the p-values computed for each of these tests to be valid? Are the results of these tests consistent with one another? What is the value of the deviance for the fitted model?

To recall from week one: exercise two-D prompted you to obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (A). These estimates should be based on the ungrouped, n=200, data. Using these estimates, write down the equation for the fitted values, that is, the estimated logistic probabilities. Plot the equation for the fitted values on the axes used in the scatterplots in part (B).

Solving

We can obtain the results required with codes:

mod = glm(STA ~ AGE,
          data = data,
          family = binomial(link = "logit"))
ztable(mod)
  Estimate Std. Error z value Pr(>|z|) OR lcl ucl
(Intercept) -3.0585 0.6961 -4.39 0.0000 0.05 0.01 0.17
AGE 0.0275 0.0106 2.61 0.0091 1.03 1.01 1.05
Call: glm(formula = STA ~ AGE, family = binomial(link = “logit”), data = data)

and

pander::pander(summary(mod))
  Estimate Std. Error z value Pr(>|z|)
AGE 0.02754 0.01056 2.607 0.009129
(Intercept) -3.059 0.6961 -4.394 1.113e-05

(Dispersion parameter for binomial family taken to be 1 )

Null deviance: 200.2 on 199 degrees of freedom
Residual deviance: 192.3 on 198 degrees of freedom

The hypothesis for Likelihood Ratio Test (LRT) are \(H_{0} :\beta_{1} = 0\) or \(H_{1} :\beta_{1} \neq 0\) and the statistic for LRT is given similarly as done in exercise one:

\[G = D(\mbox{model without the variable})-D(\mbox{model with the variable}),\] where \(D\) (Deviance) is given for

\[ D = -2\times log\left(\dfrac{\mbox{likelihood of the fitted model}}{\mbox{likelihood of the saturated model}}\right),\] that is, for which pair \((AGE_{i}, STA_{i}), \quad 1 \leq i \leq 200\),

\[ D = -2\times \sum_{i=1}^{200} \left[ STA_{i} \times log\left( \dfrac{\hat{\pi}(AGE_{i})}{STA_{i}}\right) + [1-STA_{i}] \times log\left( \dfrac{[1-\hat{\pi}(AGE_{i})]}{[1-STA_{i}]}\right) \right],\] with

\[\hat{\pi}(AGE_{i}) = \dfrac{e^{\beta_0 + \beta_{1}\times AGE_{i}}}{1+e^{\beta_0 + \beta_{1}\times AGE_{i}}}.\]

With results from code of the function glm we can do \[D = [\mbox{Null deviance} - \mbox{Residual deviance}]\] and, once \(G\) follows a chi-square distribution with \(1\) degree of freedom, we can find the significance of AGE in the model:

test = summary(mod)
(G = test$null.deviance - test$deviance)

[1] 7.854589

(p = pchisq(G, 1, lower.tail = F))

[1] 0.005069187

As the probability of finding such G assuming true \(H_{0}\) is very small (\(p <0.05\)) then reject the null hypotesis.

For Wald Test given for \(W = \dfrac{\hat{\beta_{1}}}{\hat{SE}(\hat{\beta_{1}})},\) the hypothesis are \(H_{0} :\beta_{1} = 0\) or \(H_{1} :\beta_{1} \neq 0\) .

We can see in summary table that z value and p value for AGE are concording with TRV test.

summary(mod)$coefficients[2,3] # z value

[1] 2.607174

summary(mod)$coefficients[2,4] # p value < 0.05

[1] 0.009129303

\[\hat{\pi}(AGE) = \dfrac{e^{ -3.06 + 0.03\times AGE}}{1+e^{-3.06 + 0.03\times AGE}}.\]

predit = predict(mod, type = "response")
ggplot() +
        geom_point(data = data, aes(x = AGE, y = STA, colour="blue")) + 
        geom_point(data = data, aes(x = AGE, y= predit, colour="red")) + 
        ylab("Probability of STA") +
        xlab("AGE") +
        ylim(0,1) +
        scale_colour_discrete(name="Plots",
                              labels=c("Observed", "Predict"))

References

1. R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2016.(https://www.R-project.org/)

2. Hosmer DW, Lemeshow S, Sturdivant R. Applied Logistic regression. 2013.(http://onlinelibrary.wiley.com/book/10.1002/9781118548387)