Homework #8

Exercise 9.2–Consider the data set, DaytonSurvey (in Example 2.6), giving results of a survey of use of alcohol (A), cigarettes (C), and marijuana (M) among high school seniors. For this exercise, ignore the variables sex and race, by working with the marginal table Dayton.ACM,

a 2 X 2 X 2 table in frequency data frame form.

library(stats)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.1
library(vcd)
## Warning: package 'vcd' was built under R version 3.5.1
## Loading required package: grid
library(vcdExtra)
## Warning: package 'vcdExtra' was built under R version 3.5.1
## Loading required package: gnm
library(ca)
## Warning: package 'ca' was built under R version 3.5.1
library(logmult)
## Warning: package 'logmult' was built under R version 3.5.1
## 
## Attaching package: 'logmult'
## The following object is masked from 'package:gnm':
## 
##     se
## The following object is masked from 'package:vcd':
## 
##     assoc
library(MASS)
## Warning: package 'MASS' was built under R version 3.5.1

Look at the table of the Dayton.ACM:

Dayton.ACM <- aggregate(Freq ~ cigarette + alcohol + marijuana, data = DaytonSurvey, FUN = sum)
Dayton.ACM
##   cigarette alcohol marijuana Freq
## 1       Yes     Yes       Yes  911
## 2        No     Yes       Yes   44
## 3       Yes      No       Yes    3
## 4        No      No       Yes    2
## 5       Yes     Yes        No  538
## 6        No     Yes        No  456
## 7       Yes      No        No   43
## 8        No      No        No  279

Find the following models:

(a) Joint independence, [AC][M]:

joint.ind <- aggregate(Freq ~ alcohol*cigarette + marijuana, data = DaytonSurvey, FUN = sum)
joint.ind
##   alcohol cigarette marijuana Freq
## 1     Yes       Yes       Yes  911
## 2      No       Yes       Yes    3
## 3     Yes        No       Yes   44
## 4      No        No       Yes    2
## 5     Yes       Yes        No  538
## 6      No       Yes        No   43
## 7     Yes        No        No  456
## 8      No        No        No  279

(b) Conditional independence, [AM][CM]:

condition.ind <- aggregate(Freq ~ marijuana*(cigarette + alcohol), data = DaytonSurvey, FUN = sum)
condition.ind
##   marijuana cigarette alcohol Freq
## 1       Yes       Yes     Yes  911
## 2        No       Yes     Yes  538
## 3       Yes        No     Yes   44
## 4        No        No     Yes  456
## 5       Yes       Yes      No    3
## 6        No       Yes      No   43
## 7       Yes        No      No    2
## 8        No        No      No  279

(c) Homogenous model, [AC][AM][CM]:

homogen.mod <- aggregate(Freq ~ (alcohol + cigarette + marijuana)^2, data = DaytonSurvey, FUN = sum)
homogen.mod
##   alcohol cigarette marijuana Freq
## 1     Yes       Yes       Yes  911
## 2      No       Yes       Yes    3
## 3     Yes        No       Yes   44
## 4      No        No       Yes    2
## 5     Yes       Yes        No  538
## 6      No       Yes        No   43
## 7     Yes        No        No  456
## 8      No        No        No  279

(d) Prepare a table giving the goodness-of-fit tests for these models, as well as the model of mutual independence, [A][C][M], and the standard model, [ACM]. Hint: anova() is useful here. Which model appears to give the most reasonable fit? Explain your interpretation.

Use goodnee-of-fit tests on all models:

(1) Joint independence, [AC][M]:

Dayton.loglm0 <- loglm(Freq ~ alcohol*cigarette + marijuana, data = DaytonSurvey, param = TRUE, fitted = TRUE)
Dayton.loglm0
## Call:
## loglm(formula = Freq ~ alcohol * cigarette + marijuana, data = DaytonSurvey, 
##     param = TRUE, fitted = TRUE)
## 
## Statistics:
##                       X^2 df P(> X^2)
## Likelihood Ratio 2810.417 27        0
## Pearson          2869.945 27        0

(2) Conditional independence, [AM][CM]:

Dayton.loglm1 <- loglm(Freq ~ marijuana*(cigarette + alcohol), data = DaytonSurvey, param = TRUE, fitted = TRUE)
Dayton.loglm1
## Call:
## loglm(formula = Freq ~ marijuana * (cigarette + alcohol), data = DaytonSurvey, 
##     param = TRUE, fitted = TRUE)
## 
## Statistics:
##                       X^2 df P(> X^2)
## Likelihood Ratio 2154.344 26        0
## Pearson          1974.994 26        0

(3) Homogenous model, [AC][AM][CM]:

Dayton.loglm2 <- loglm(Freq ~ (alcohol + cigarette + marijuana)^2, data = DaytonSurvey, param = TRUE, fitted = TRUE)
Dayton.loglm2
## Call:
## loglm(formula = Freq ~ (alcohol + cigarette + marijuana)^2, data = DaytonSurvey, 
##     param = TRUE, fitted = TRUE)
## 
## Statistics:
##                       X^2 df P(> X^2)
## Likelihood Ratio 1966.964 25        0
## Pearson          1667.735 25        0

(4) mutual independence, [A][C][M]:

Dayton.loglm3 <- loglm(Freq ~ alcohol + cigarette + marijuana, data = DaytonSurvey, param = TRUE, fitted = TRUE) 
Dayton.loglm3
## Call:
## loglm(formula = Freq ~ alcohol + cigarette + marijuana, data = DaytonSurvey, 
##     param = TRUE, fitted = TRUE)
## 
## Statistics:
##                       X^2 df P(> X^2)
## Likelihood Ratio 3252.610 28        0
## Pearson          4055.973 28        0

(5) saturated model, [ACM]:

Dayton.loglm4 <- loglm(Freq ~ alcohol * cigarette * marijuana, data = DaytonSurvey, param = TRUE, fitted = TRUE)
Dayton.loglm4
## Call:
## loglm(formula = Freq ~ alcohol * cigarette * marijuana, data = DaytonSurvey, 
##     param = TRUE, fitted = TRUE)
## 
## Statistics:
##                       X^2 df P(> X^2)
## Likelihood Ratio 1966.590 24        0
## Pearson          1666.879 24        0

Let’s use anova to see which model appears to give the most reasonable fit:

anova(Dayton.loglm0, Dayton.loglm1, Dayton.loglm2, Dayton.loglm3, Dayton.loglm4, test = "Chisq")
## LR tests for hierarchical log-linear models
## 
## Model 1:
##  Freq ~ alcohol + cigarette + marijuana 
## Model 2:
##  Freq ~ alcohol * cigarette + marijuana 
## Model 3:
##  Freq ~ marijuana * (cigarette + alcohol) 
## Model 4:
##  Freq ~ (alcohol + cigarette + marijuana)^2 
## Model 5:
##  Freq ~ alcohol * cigarette * marijuana 
## 
##           Deviance df   Delta(Dev) Delta(df) P(> Delta(Dev)
## Model 1   3252.610 28                                      
## Model 2   2810.417 27  442.1933108         1        0.00000
## Model 3   2154.344 26  656.0723408         1        0.00000
## Model 4   1966.964 25  187.3803170         1        0.00000
## Model 5   1966.590 24    0.3739859         1        0.54084
## Saturated    0.000  0 1966.5899596        24        0.00000

Results: Based on the knowledge that G-square is always 0 for saturated models, we see that the closest fit is the saturated model. Albeit, if we excuse the saturated model, then the best fit, otherwise, is our Model #4 (Model 5 in anove), with Model #3 a close second.