Questions accessed at:
Week Two Homework. (n.d.). Retrieved August 26, 2016, from https://learn.canvas.net/courses/1179/pages/week-two-homework?module_item _id=159930
For this exercise, you will need the Myopia Study dataset that we also used in week one. Download the MYOPIA.dta Stata file, or you can also access the data through this CSV file.
library(aplore3)
## Warning: package 'aplore3' was built under R version 3.2.5
library(ggplot2)
data(myopia)
str(myopia)
## 'data.frame': 618 obs. of 18 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ studyyear: int 1992 1995 1991 1990 1995 1995 1993 1991 1991 1991 ...
## $ myopic : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 1 1 1 1 ...
## $ age : int 6 6 6 6 5 6 6 6 7 6 ...
## $ gender : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 1 2 2 1 2 ...
## $ spheq : num -0.052 0.608 1.179 0.525 0.697 ...
## $ al : num 21.9 22.4 22.5 22.2 23.3 ...
## $ acd : num 3.69 3.7 3.46 3.86 3.68 ...
## $ lt : num 3.5 3.39 3.51 3.61 3.45 ...
## $ vcd : num 14.7 15.3 15.5 14.7 16.2 ...
## $ sporthr : int 45 4 14 18 14 10 12 12 4 30 ...
## $ readhr : int 8 0 0 11 0 6 7 0 0 5 ...
## $ comphr : int 0 1 2 0 0 2 2 0 3 1 ...
## $ studyhr : int 0 1 0 0 0 1 1 0 1 0 ...
## $ tvhr : int 10 7 10 4 4 19 8 8 3 10 ...
## $ diopterhr: int 34 12 14 37 4 44 36 8 12 27 ...
## $ mommy : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 1 1 1 ...
## $ dadmy : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 2 1 1 1 ...
summary(myopia)
## id studyyear myopic age gender
## Min. : 1.0 Min. :1990 No :537 Min. :5.000 Male :316
## 1st Qu.:155.2 1st Qu.:1991 Yes: 81 1st Qu.:6.000 Female:302
## Median :309.5 Median :1992 Median :6.000
## Mean :309.5 Mean :1992 Mean :6.299
## 3rd Qu.:463.8 3rd Qu.:1994 3rd Qu.:6.000
## Max. :618.0 Max. :1995 Max. :9.000
## spheq al acd lt
## Min. :-0.6990 Min. :19.90 Min. :2.772 Min. :2.960
## 1st Qu.: 0.4562 1st Qu.:22.04 1st Qu.:3.424 1st Qu.:3.436
## Median : 0.7290 Median :22.46 Median :3.585 Median :3.542
## Mean : 0.8010 Mean :22.50 Mean :3.579 Mean :3.541
## 3rd Qu.: 1.0340 3rd Qu.:22.97 3rd Qu.:3.730 3rd Qu.:3.640
## Max. : 4.3720 Max. :24.56 Max. :4.250 Max. :4.112
## vcd sporthr readhr comphr
## Min. :13.38 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.:14.93 1st Qu.: 6.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median :15.36 Median :10.00 Median : 2.000 Median : 1.000
## Mean :15.38 Mean :11.95 Mean : 2.796 Mean : 2.105
## 3rd Qu.:15.84 3rd Qu.:16.00 3rd Qu.: 4.000 3rd Qu.: 3.000
## Max. :17.30 Max. :45.00 Max. :20.000 Max. :30.000
## studyhr tvhr diopterhr mommy dadmy
## Min. : 0.00 Min. : 0.000 Min. : 2.00 No :305 No :310
## 1st Qu.: 0.00 1st Qu.: 4.250 1st Qu.: 15.00 Yes:313 Yes:308
## Median : 1.00 Median : 8.000 Median : 23.00
## Mean : 1.49 Mean : 8.948 Mean : 26.02
## 3rd Qu.: 2.00 3rd Qu.:12.000 3rd Qu.: 34.00
## Max. :15.00 Max. :31.000 Max. :101.00
Using the results of the output from Stata, assess the significance of the slope coefficient for SPHEQ using the likelihood ratio test and the Wald test. What assumptions are needed for the p-values computed for each of these tests to be valid? Are the results of these tests consistent with one another? What is the value of the deviance for the fitted model?
fit <- glm(myopic ~ spheq, data = myopia, family = binomial(link = logit))
fit$coef
## (Intercept) spheq
## 0.05397315 -3.83309762
summary(fit)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.05397315 0.2067483 0.2610573 7.940483e-01
## spheq -3.83309762 0.4183696 -9.1619878 5.094998e-20
(vcov.fit <- vcov(fit))
## (Intercept) spheq
## (Intercept) 0.04274486 -0.06337768
## spheq -0.06337768 0.17503316
(wald <- (fit$coef[2]-0)/sqrt(vcov.fit[2,2]))
## spheq
## -9.161988
(p.wald <- pnorm(wald, lower.tail = TRUE))
## spheq
## 2.547499e-20
(p.wald <- p.wald * 2)
## spheq
## 5.094998e-20
summary(fit)$coef[2,4]
## [1] 5.094998e-20
null.model <- glm(myopic ~ 1, data = myopia, family = binomial(link = logit))
(LR <- -2*(logLik(null.model) -logLik(fit)))
## 'log Lik.' 142.7321 (df=1)
(p.val <- pchisq(LR, df = 1, lower.tail = FALSE))
## 'log Lik.' 6.726636e-33 (df=1)
hist(myopia$spheq, prob = TRUE, ylim = c(0, 1))
lines(density(myopia$spheq), col = "red")
fit$null.deviance
## [1] 480.077
fit$deviance
## [1] 337.3449
For Wald test we test the null hypothesis that the slope of the variable is equal to zero. Wald statistic follows normal distribution with mean zero and standard deviaton of 1.
For likelihood ratio test we test the null hypothesis that adding a variable of interest to the model doesn’t significantly contribute to prediction of the outcome, i.e. that the deviance of the model with the added variable isn’t significantly smaller than the deviance without the variable. This test indirectly test the null hypothesis that \(\beta_1 = 0\). Likelihood Ratio test statistic is distributed like a chi-square distribution with 1 degree of freedom if the added variable is a continuous variable. For likelihood ratio test we assume that the two models are nested, i.e. that the bigger model is the same as the smaller model plus only 1 variable added.
For this exercise, you will need the ICU Study dataset that we also used in week one. Download the icu.dta Stata file, or you can also access the data through this CSV file.
Using the results of the output from the logistic regression package used for Exercise Two - Part (D) of week one, assess the significance of the slope coefficient for AGE using the likelihood ratio test and the Wald test. What assumptions are needed for the p-values computed for each of these tests to be valid? Are the results of these tests consistent with one another? What is the value of the deviance for the fitted model?
To recall from week one: exercise two-D prompted you to obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (A). These estimates should be based on the ungrouped, n=200, data. Using these estimates, write down the equation for the fitted values, that is, the estimated logistic probabilities. Plot the equation for the fitted values on the axes used in the scatterplots in part (B).
data(icu)
fit2 <- glm(sta ~ age, data = icu, family = binomial(link = logit))
summary(fit2)
##
## Call:
## glm(formula = sta ~ age, family = binomial(link = logit), data = icu)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9536 -0.7391 -0.6145 -0.3905 2.2854
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.05851 0.69608 -4.394 1.11e-05 ***
## age 0.02754 0.01056 2.607 0.00913 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 200.16 on 199 degrees of freedom
## Residual deviance: 192.31 on 198 degrees of freedom
## AIC: 196.31
##
## Number of Fisher Scoring iterations: 4
fit2$coef
## (Intercept) age
## -3.05851323 0.02754261
(vcov.fit2 <- vcov(fit2))
## (Intercept) age
## (Intercept) 0.484529087 -0.0071029945
## age -0.007102994 0.0001116015
(wald2 <- (fit2$coef[2]-0)/sqrt(vcov.fit2[2,2]))
## age
## 2.607174
(p.wald2 <- pnorm(wald2, lower.tail = FALSE))
## age
## 0.004564651
(p.wald2 <- p.wald2 * 2)
## age
## 0.009129303
summary(fit2)$coef[2,4]
## [1] 0.009129303
null.model2 <- glm(sta ~ 1, data = icu, family = binomial(link = logit))
(LR2 <- -2 * (logLik(null.model2) - logLik(fit2)))
## 'log Lik.' 7.854589 (df=1)
(p.val2 <- pchisq(LR2, df = 1, lower.tail = FALSE))
## 'log Lik.' 0.005069187 (df=1)
hist(icu$age, prob = TRUE)
lines(density(icu$age), col = "red")
fit2$deviance
## [1] 192.3064
fit2$null.deviance
## [1] 200.161
(deviance <- -2*logLik(fit2))
## 'log Lik.' 192.3064 (df=2)