Week Two Homework

Questions accessed at:
Week Two Homework. (n.d.). Retrieved August 26, 2016, from https://learn.canvas.net/courses/1179/pages/week-two-homework?module_item _id=159930

Exercise One

For this exercise, you will need the Myopia Study dataset that we also used in week one. Download the MYOPIA.dta Stata file, or you can also access the data through this CSV file.

library(aplore3)
## Warning: package 'aplore3' was built under R version 3.2.5
library(ggplot2)
data(myopia)
str(myopia)
## 'data.frame':    618 obs. of  18 variables:
##  $ id       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ studyyear: int  1992 1995 1991 1990 1995 1995 1993 1991 1991 1991 ...
##  $ myopic   : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 1 1 1 1 ...
##  $ age      : int  6 6 6 6 5 6 6 6 7 6 ...
##  $ gender   : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 1 2 2 1 2 ...
##  $ spheq    : num  -0.052 0.608 1.179 0.525 0.697 ...
##  $ al       : num  21.9 22.4 22.5 22.2 23.3 ...
##  $ acd      : num  3.69 3.7 3.46 3.86 3.68 ...
##  $ lt       : num  3.5 3.39 3.51 3.61 3.45 ...
##  $ vcd      : num  14.7 15.3 15.5 14.7 16.2 ...
##  $ sporthr  : int  45 4 14 18 14 10 12 12 4 30 ...
##  $ readhr   : int  8 0 0 11 0 6 7 0 0 5 ...
##  $ comphr   : int  0 1 2 0 0 2 2 0 3 1 ...
##  $ studyhr  : int  0 1 0 0 0 1 1 0 1 0 ...
##  $ tvhr     : int  10 7 10 4 4 19 8 8 3 10 ...
##  $ diopterhr: int  34 12 14 37 4 44 36 8 12 27 ...
##  $ mommy    : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 1 1 1 ...
##  $ dadmy    : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 2 1 1 1 ...
summary(myopia)
##        id          studyyear    myopic         age           gender   
##  Min.   :  1.0   Min.   :1990   No :537   Min.   :5.000   Male  :316  
##  1st Qu.:155.2   1st Qu.:1991   Yes: 81   1st Qu.:6.000   Female:302  
##  Median :309.5   Median :1992             Median :6.000               
##  Mean   :309.5   Mean   :1992             Mean   :6.299               
##  3rd Qu.:463.8   3rd Qu.:1994             3rd Qu.:6.000               
##  Max.   :618.0   Max.   :1995             Max.   :9.000               
##      spheq               al             acd              lt       
##  Min.   :-0.6990   Min.   :19.90   Min.   :2.772   Min.   :2.960  
##  1st Qu.: 0.4562   1st Qu.:22.04   1st Qu.:3.424   1st Qu.:3.436  
##  Median : 0.7290   Median :22.46   Median :3.585   Median :3.542  
##  Mean   : 0.8010   Mean   :22.50   Mean   :3.579   Mean   :3.541  
##  3rd Qu.: 1.0340   3rd Qu.:22.97   3rd Qu.:3.730   3rd Qu.:3.640  
##  Max.   : 4.3720   Max.   :24.56   Max.   :4.250   Max.   :4.112  
##       vcd           sporthr          readhr           comphr      
##  Min.   :13.38   Min.   : 0.00   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:14.93   1st Qu.: 6.00   1st Qu.: 0.000   1st Qu.: 0.000  
##  Median :15.36   Median :10.00   Median : 2.000   Median : 1.000  
##  Mean   :15.38   Mean   :11.95   Mean   : 2.796   Mean   : 2.105  
##  3rd Qu.:15.84   3rd Qu.:16.00   3rd Qu.: 4.000   3rd Qu.: 3.000  
##  Max.   :17.30   Max.   :45.00   Max.   :20.000   Max.   :30.000  
##     studyhr           tvhr          diopterhr      mommy     dadmy    
##  Min.   : 0.00   Min.   : 0.000   Min.   :  2.00   No :305   No :310  
##  1st Qu.: 0.00   1st Qu.: 4.250   1st Qu.: 15.00   Yes:313   Yes:308  
##  Median : 1.00   Median : 8.000   Median : 23.00                      
##  Mean   : 1.49   Mean   : 8.948   Mean   : 26.02                      
##  3rd Qu.: 2.00   3rd Qu.:12.000   3rd Qu.: 34.00                      
##  Max.   :15.00   Max.   :31.000   Max.   :101.00

Complete the following:

Using the results of the output from Stata, assess the significance of the slope coefficient for SPHEQ using the likelihood ratio test and the Wald test. What assumptions are needed for the p-values computed for each of these tests to be valid? Are the results of these tests consistent with one another? What is the value of the deviance for the fitted model?

fit <- glm(myopic ~ spheq, data = myopia, family = binomial(link = logit))
fit$coef
## (Intercept)       spheq 
##  0.05397315 -3.83309762
summary(fit)$coef
##                Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)  0.05397315  0.2067483  0.2610573 7.940483e-01
## spheq       -3.83309762  0.4183696 -9.1619878 5.094998e-20
(vcov.fit <- vcov(fit))
##             (Intercept)       spheq
## (Intercept)  0.04274486 -0.06337768
## spheq       -0.06337768  0.17503316
(wald <- (fit$coef[2]-0)/sqrt(vcov.fit[2,2]))
##     spheq 
## -9.161988
(p.wald <- pnorm(wald, lower.tail = TRUE))
##        spheq 
## 2.547499e-20
(p.wald <- p.wald * 2)
##        spheq 
## 5.094998e-20
summary(fit)$coef[2,4]
## [1] 5.094998e-20
null.model <- glm(myopic ~ 1, data = myopia, family = binomial(link = logit))
(LR <- -2*(logLik(null.model) -logLik(fit)))
## 'log Lik.' 142.7321 (df=1)
(p.val <- pchisq(LR, df = 1, lower.tail = FALSE))
## 'log Lik.' 6.726636e-33 (df=1)
hist(myopia$spheq, prob = TRUE, ylim = c(0, 1))
lines(density(myopia$spheq), col = "red")

fit$null.deviance
## [1] 480.077
fit$deviance
## [1] 337.3449

For Wald test we test the null hypothesis that the slope of the variable is equal to zero. Wald statistic follows normal distribution with mean zero and standard deviaton of 1.
For likelihood ratio test we test the null hypothesis that adding a variable of interest to the model doesn’t significantly contribute to prediction of the outcome, i.e. that the deviance of the model with the added variable isn’t significantly smaller than the deviance without the variable. This test indirectly test the null hypothesis that \(\beta_1 = 0\). Likelihood Ratio test statistic is distributed like a chi-square distribution with 1 degree of freedom if the added variable is a continuous variable. For likelihood ratio test we assume that the two models are nested, i.e. that the bigger model is the same as the smaller model plus only 1 variable added.

Exercise Two

For this exercise, you will need the ICU Study dataset that we also used in week one. Download the icu.dta Stata file, or you can also access the data through this CSV file.

Using the results of the output from the logistic regression package used for Exercise Two - Part (D) of week one, assess the significance of the slope coefficient for AGE using the likelihood ratio test and the Wald test. What assumptions are needed for the p-values computed for each of these tests to be valid? Are the results of these tests consistent with one another? What is the value of the deviance for the fitted model?

To recall from week one: exercise two-D prompted you to obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (A). These estimates should be based on the ungrouped, n=200, data. Using these estimates, write down the equation for the fitted values, that is, the estimated logistic probabilities. Plot the equation for the fitted values on the axes used in the scatterplots in part (B).

data(icu)
fit2 <- glm(sta ~ age, data = icu, family = binomial(link = logit))
summary(fit2)
## 
## Call:
## glm(formula = sta ~ age, family = binomial(link = logit), data = icu)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9536  -0.7391  -0.6145  -0.3905   2.2854  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.05851    0.69608  -4.394 1.11e-05 ***
## age          0.02754    0.01056   2.607  0.00913 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 200.16  on 199  degrees of freedom
## Residual deviance: 192.31  on 198  degrees of freedom
## AIC: 196.31
## 
## Number of Fisher Scoring iterations: 4
fit2$coef
## (Intercept)         age 
## -3.05851323  0.02754261
(vcov.fit2 <- vcov(fit2))
##              (Intercept)           age
## (Intercept)  0.484529087 -0.0071029945
## age         -0.007102994  0.0001116015
(wald2 <- (fit2$coef[2]-0)/sqrt(vcov.fit2[2,2]))
##      age 
## 2.607174
(p.wald2 <- pnorm(wald2, lower.tail = FALSE))
##         age 
## 0.004564651
(p.wald2 <- p.wald2 * 2)
##         age 
## 0.009129303
summary(fit2)$coef[2,4]
## [1] 0.009129303
null.model2 <- glm(sta ~ 1, data = icu, family = binomial(link = logit))
(LR2 <- -2 * (logLik(null.model2) - logLik(fit2)))
## 'log Lik.' 7.854589 (df=1)
(p.val2 <- pchisq(LR2, df = 1, lower.tail = FALSE))
## 'log Lik.' 0.005069187 (df=1)
hist(icu$age, prob = TRUE)
lines(density(icu$age), col = "red")

fit2$deviance
## [1] 192.3064
fit2$null.deviance
## [1] 200.161
(deviance <- -2*logLik(fit2))
## 'log Lik.' 192.3064 (df=2)