Question 1

A recent General Social Survey asked subjects whether they believed in heaven and whether they believed in hell. Table 8.9 shows the results.

a. Test the hypothesis that the population proportions answering yes were identical for heaven and hell.

heaven.xtab <- matrix(c(160, 2, 125, 833), ncol=2, byrow=T)
heaven.xtab
##      [,1] [,2]
## [1,]  160    2
## [2,]  125  833
(125+833)/1120
## [1] 0.8553571

Probability that subject answered “yes” when asked if they believe in Heaven. P(Y1=1)=0.8553571

(2+833)/1120
## [1] 0.7455357

Probability that subject answered “yes” when asked if they believe in Hell. P(Y2=1)=0.7455357

mcnemar.test(heaven.xtab, correct=F)
## 
##  McNemar's Chi-squared test
## 
## data:  heaven.xtab
## McNemar's chi-squared = 119.13, df = 1, p-value < 2.2e-16

P-value < 0.0001, the evidence against marginal homogeneity is strong. We reject the null hypothesis and have sufficient evidence to suggest that the population proportions that answered “Yes” were not identical for Heaven and Hell.

b. Find a 90% confidence interval for the difference between the population proportions. Interpret.

library(PropCIs)
diffpropci.Wald.mp(2, 125, 1120, 0.95)
## 
## 
## 
## data:  
## 
## 95 percent confidence interval:
##  0.09117856 0.12846429
## sample estimates:
## [1] 0.1098214

95% CI = (0.09117856, 0.12846429). The probability of a “Yes” response was at least 0.0912 and at most 0.1285 higher for belief in Heavan than belief in Hell

c. Estimate and interpret the odds ratio for a logistic model for the probability of a yes response as a function of the item (heaven or hell), using the (i) marginal model (8.2), (ii) subject-specific model (8.3). Why are the estimates different?

n<-1120

# Marginal Model
beta.mm<-log((958/(n-958))/(835/(n-835)))
beta.mm
## [1] 0.7023089
exp(beta.mm)
## [1] 2.018408
# Subject-Specific Model
beta.ss<-125/2; beta.ss
## [1] 62.5

Marginal Model: The population odds of belief is Heaven are approximately exp(0.7023089)=2.02 times higher than the population odds of belief in Hell.
Subject-Specific Model: A subject’s estimated odds of a “Yes” response are 62.5 times higher for belief in Heaven than for belief in Hell.
These estimates can differ because the marginal model is collapsing over the 3rd variable (subject) in the subject-specific model. This is similar to when we looked at Simpson’s Paradox and the differences in estimates depending on whether or not you control for a third variable z. The subject-specific model is estimating the relationship between x and y, after controlling for z (subject). The marginal model estimates the relationship between x and y based on a 2-way contingency table, not accounting for subject.

Question 2

Table 8.12, from the 2016 General Social Survey, reports region of residence now and at age 16 for residents of the US.

a. Test marginal homogeneity by comparing fits of two models.

# Trying out Symmetry and Quasi-Symmetry models
survey2 <- data.frame(ne=c(1,1,1,0,0,0),
                      mw=c(-1,0,0,1,1,0),
                      s=c(0,-1,0,-1,0,1),
                      w=c(0,0,-1,0,-1,-1),
                      n.ij=c(17,81,38,74,59,35),
                      n.ji=c(8,29,10,32,24,35))

# Symmetry Model
symm <- glm(n.ij/(n.ij+n.ji) ~ -1, family=binomial, weights=n.ij+n.ji,data=survey2)
summary(symm)
## 
## Call:
## glm(formula = n.ij/(n.ij + n.ji) ~ -1, family = binomial, data = survey2, 
##     weights = n.ij + n.ji)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##  0.000   2.341   4.019   4.164   5.059  
## 
## No Coefficients
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 78.657  on 6  degrees of freedom
## Residual deviance: 78.657  on 6  degrees of freedom
## AIC: 105.38
## 
## Number of Fisher Scoring iterations: 0
1-pchisq(78.657, 6)
## [1] 6.77236e-15
survey2$std.R <- survey2$n.ij-survey2$n.ji/sqrt(survey2$n.ij+survey2$n.ji)
survey2[1:6,7]
## [1] 15.40000 78.23496 36.55662 70.89189 56.36566 30.81670
# Quasi-Symmetry Model
QS <- glm(n.ij/(n.ij+n.ji) ~ -1 + ne + mw + s + w, family=binomial,weights=n.ij+n.ji, data=survey2)
summary(QS)
## 
## Call:
## glm(formula = n.ij/(n.ij + n.ji) ~ -1 + ne + mw + s + w, family = binomial, 
##     data = survey2, weights = n.ij + n.ji)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6  
##  0.8708  -0.6262   0.2526   0.2935   0.1705  -0.3423  
## 
## Coefficients: (1 not defined because of singularities)
##    Estimate Std. Error z value Pr(>|z|)    
## ne  1.24600    0.20952   5.947 2.73e-09 ***
## mw  0.85831    0.17989   4.771 1.83e-06 ***
## s   0.08184    0.17330   0.472    0.637    
## w        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 78.6573  on 6  degrees of freedom
## Residual deviance:  1.4466  on 3  degrees of freedom
## AIC: 34.172
## 
## Number of Fisher Scoring iterations: 3
1-pchisq(1.4466, 3)
## [1] 0.6946487
1-pchisq(78.6573-1.4466, 6-3)
## [1] 1.110223e-16

The symmetry model has a residual deviance of 78.657 on 6 degrees of freedom which indicates that the model fits poorly. The standardized residuals are all extremely large (ranging from 15 to 78) and are all contributing to the lack of fit, although some more than others (i.e. n13 vs n31 has standardized residual of 78.2).
The quasi-symmetry model has a residual deviance of 1.4466 on 3 degrees of freedom, and therefore fits the data pretty well. The deviance difference between the symmetry model and quasi-symmetry model is (78.6573-1.4466)=77.2107 on 3 degrees of freedom, which provides very strong evidence of marginal heterogeneity (p<.0001).

c. In fitting the quasi-symmetry (QS) model to these data where the B4 parameter (for “West”) is set to zero (as in SAS), clearly interpret in context the value exp(B1) where B1 multiplies “Northeast” in the QS model

exp(1.24600)
## [1] 3.476409

The estimated probability for former Northeast residents (at age 16) moving and currently living in the West is 3.4764 times the estimated probability of former West residents currently living in the Northeast.

d. Clearly interpret in context the value exp(B2-B1) where B1 multiplies “Northeast” and B2 multiplies “Midwest” in the QS model

exp(1.24600-0.85831)
## [1] 1.473573

The estimated probability of moving from the Northeast to the Midwest is 1.4736 times the estimated probability of moving from the Midwest and currently living in the Northeast.

Question 3

Table 8.15 displays diagnoses of multiple sclerosis for two neurologists. The categories are (1) Certain multiple sclerosis, (2) Probable multiple sclerosis, (3) Possible multiple sclerosis, (4) Doubtful, unlikely, or definitely not multiple sclerosis. Analyze the agreement.

(a) Use the independence model and residuals to study the pattern of agreement and interpret; report the appropriate residuals into a 4 x 4 table as done e.g. on p.243

neuro.xtab <- matrix(c(38,5,0,1,33,11,3,0,10,14,5,6,3,7,3,10), ncol=4, byrow=T)
std.r <- chisq.test(neuro.xtab)$stdres
## Warning in chisq.test(neuro.xtab): Chi-squared approximation may be
## incorrect
round(std.r,2)
##       [,1]  [,2]  [,3]  [,4]
## [1,]  4.78 -2.46 -2.23 -2.27
## [2,]  2.31 -0.27 -0.32 -2.97
## [3,] -3.79  2.37  1.79  1.22
## [4,] -4.56  0.68  1.13  5.26

We know that positive standardized residuals indicate that there are higher frequencies than expected under independence, which exists when there is no association between the findings. We would expect to see large positive standardized risuals along the along the main diagonal. Interestingly, one of the cells (n22) on our main diagonal has a negative standardized residual, albeit a small negative number. The expected frequency was 11.7 and we observed 11 in our dataset. The other 3 cells on the main diagonal (n11, n33, n44) all showed a high degree of agreement with n11 and n44 having the two largest positive SD in the 4x4 table. The most common disagreements were due to Neurologist B determining a diagnosis of certain multiple sclerosis.

I was expecting the largest negative std. residuals to be located on the bottom left and upper right corners of the chart, where the neurologists findings conflicted to an extreme degree (certain MS vs doubtful, unlikely, or definitely not MS.). N41 had a negative SR=-4.56, which indicates a high level of disagreement and is to be expected. I was surprised to see that cells n31, n12, and n24 all had larger negative SR than cell n14. T

(c) Regardless of the quality of fit, fit the ordinal quasi-symmetry (OQS) model to these data and clearly interpret the estimated B parameter in context in terms of an odds ratio or probability

summary(OQS)
## 
## Call:
## glm(formula = nij/(nij + nji) ~ -1 + x, family = binomial, data = neuro2, 
##     weights = nij + nji)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6  
## -1.4175  -1.2483   1.7952  -0.4599  -1.0444   2.8512  
## 
## Coefficients:
##   Estimate Std. Error z value Pr(>|z|)    
## x  -1.2565     0.2458  -5.111  3.2e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 56.442  on 6  degrees of freedom
## Residual deviance: 16.222  on 5  degrees of freedom
## AIC: 28.684
## 
## Number of Fisher Scoring iterations: 5

The estimated probability that Neurosurgeon B’s diagnosis is x categories higher than Neurosurgeon A’s diagnosis equals exp(-1.2565x) times the reverse probability.

(d) Use kappa (whichever version makes the most sense here) to describe agreement and clearly interpret the estimated kappa in context

cohen.kappa(neuro.xtab)
## Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels)
## 
## Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries 
##                  lower estimate upper
## unweighted kappa  0.11     0.21  0.31
## weighted kappa    0.40     0.52  0.65
## 
##  Number of subjects = 149

We will use weighted Kappa given that the data is ordinal in it’s structure, going from certain MS to doubtful/unlikely/definitely not MS. The difference between the observed agreement and expected agreement under independence is roughly 52% of the maximum possible difference.

Quesion 4

[STAT-410 students only] The table below summarizes the results of tennis matches for several women professional players between 2003 and 2005.

(a) Fit the Bradley-Terry (BT) model, report parameter estimates, and rank the players

tennis <- data.frame( C=c(1,1,1,1,0,0,0,0,0,0),
                      D=c(-1,0,0,0,1,1,1,0,0,0),
                      P=c(0,-1,0,0,-1,0,0,1,1,0),
                      SW=c(0,0,-1,0,0,-1,0,-1,0,1),
                      VW=c(0,0,0,-1,0,0,-1,0,-1,-1),
                      nij=c(6,3,0,2,0,2,4,0,1,2),
                      nji=c(2,1,2,3,2,2,2,2,2,2))

tennis.fit <- glm(nij/(nij+nji) ~ -1 + C + D + P + VW + SW, family=binomial, weights=nij+nji, data=tennis)
summary(tennis.fit)
## 
## Call:
## glm(formula = nij/(nij + nji) ~ -1 + C + D + P + VW + SW, family = binomial, 
##     data = tennis, weights = nij + nji)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7732  -0.9924  -0.3752   0.7266   1.1643  
## 
## Coefficients: (1 not defined because of singularities)
##    Estimate Std. Error z value Pr(>|z|)
## C   -0.3918     0.7290  -0.537    0.591
## D   -0.8388     0.6986  -1.201    0.230
## P   -1.0167     0.8256  -1.232    0.218
## VW  -0.5592     0.6957  -0.804    0.422
## SW       NA         NA      NA       NA
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12.678  on 10  degrees of freedom
## Residual deviance: 10.117  on  6  degrees of freedom
## AIC: 32.068
## 
## Number of Fisher Scoring iterations: 4
c<-coef(tennis.fit)
round(exp(c[1:5]),2)
##    C    D    P   VW   SW 
## 0.68 0.43 0.36 0.57   NA
1-pchisq(10.117,6)
## [1] 0.1198106

The model has a deviance of 10.117 on 6 degrees of freedom and is an adequate fit for the data. The players ranked from best to worst are as follows: (1) S. Williams (2) Clijsters (3) V. Williams (4) Davenport (5) Pierce

(b) Using the fitted BT model, estimate the probability that Serena Williams beats Venus Williams, and compare the model estimate with the actual sample proportion

tennis.fitb <- glm(nij/(nij+nji) ~ -1 + C + D + P + SW + VW, family=binomial, weights=nij+nji, data=tennis)
summary(tennis.fitb)
## 
## Call:
## glm(formula = nij/(nij + nji) ~ -1 + C + D + P + SW + VW, family = binomial, 
##     data = tennis, weights = nij + nji)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7732  -0.9924  -0.3752   0.7266   1.1643  
## 
## Coefficients: (1 not defined because of singularities)
##    Estimate Std. Error z value Pr(>|z|)
## C    0.1674     0.5960   0.281    0.779
## D   -0.2795     0.5796  -0.482    0.630
## P   -0.4575     0.7249  -0.631    0.528
## SW   0.5592     0.6957   0.804    0.422
## VW       NA         NA      NA       NA
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12.678  on 10  degrees of freedom
## Residual deviance: 10.117  on  6  degrees of freedom
## AIC: 32.068
## 
## Number of Fisher Scoring iterations: 4
c<-coef(tennis.fitb)

# Probability
exp(c[4])/(1+exp(c[4]))
##        SW 
## 0.6362715

The estimated probability that Serena Williams will win a tennis match against Venus Williams is = 0.6362715.

(c) Showing relevant calculations and using the fitted BT model, construct the 90% confidence interval for the probability that Serena Williams beats Davenport, and interpret the results

tennis.fitc <- glm(nij/(nij+nji) ~ -1 + C + P + SW + VW + D, family=binomial, weights=nij+nji, data=tennis)
summary(tennis.fitc)
## 
## Call:
## glm(formula = nij/(nij + nji) ~ -1 + C + P + SW + VW + D, family = binomial, 
##     data = tennis, weights = nij + nji)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7732  -0.9924  -0.3752   0.7266   1.1643  
## 
## Coefficients: (1 not defined because of singularities)
##    Estimate Std. Error z value Pr(>|z|)
## C    0.4470     0.5560   0.804    0.421
## P   -0.1780     0.7337  -0.243    0.808
## SW   0.8388     0.6986   1.201    0.230
## VW   0.2795     0.5796   0.482    0.630
## D        NA         NA      NA       NA
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12.678  on 10  degrees of freedom
## Residual deviance: 10.117  on  6  degrees of freedom
## AIC: 32.068
## 
## Number of Fisher Scoring iterations: 4
# Std. Error for Serena Williams vs. Davenport
SE<-0.6986
BiBj.diff=0.8387604

CI<-BiBj.diff+c(-1,1)*qnorm(.90)*SE; CI
## [1] -0.05653152  1.73405232
exp(CI)
## [1] 0.9450367 5.6635580

With 90% confidence, we infer the probability that Serena Williams will win the tennis match is between 0.9450 and 5.6636 times the probability that Davenport will win the match. 95% CI: (0.9450367, 5.6635580)

Question 5

[All Students] p. 223, exercise 7.4. Submit R or SAS code/output and interpretations/answers. Note that the counts here are recorded in the “n” variable. At the website www.stat.ufl.edu/~aa/intro-cda/data for the second edition of this book, the MBTI data file cross-classifies the MBTI Step II National Sample on four binary scales of the Myers–Briggs personality test: Extroversion/Introversion (E/I), Sensing/iNtuitive (S/N), Thinking/Feeling (T/F), and Judging/Perceiving (J/P).

(a) Fit the loglinear model of homogeneous association and conduct a goodness-of-fit test. Based on the fit, show that

  1. the estimated conditional association is strongest between the S/N and J/P scales
  2. there is not strong evidence of conditional association between the E/I and T/F scale or between the E/I and J/P scales.
MBTI<-read.table("http://users.stat.ufl.edu/~aa/cat/data/MBTI.dat", header=T)
EI<-MBTI$EI
SN<-MBTI$SN
TF<-MBTI$TF
JP<-MBTI$JP


fit5 <- glm(n ~ EI + SN + TF + JP + EI:SN + EI:TF + EI:JP + SN:TF + SN:JP + TF:JP, family=poisson, data=MBTI)
summary(fit5)
## 
## Call:
## glm(formula = n ~ EI + SN + TF + JP + EI:SN + EI:TF + EI:JP + 
##     SN:TF + SN:JP + TF:JP, family = poisson, data = MBTI)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6         7  
## -0.72826   1.00215   0.05168  -0.01429   1.49947  -1.29325  -0.07596  
##        8         9        10        11        12        13        14  
##  0.00231   0.56850  -0.82975  -0.04948   0.01728  -1.57051   1.09960  
##       15        16  
##  0.08587  -0.00804  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.44760    0.13793  24.994  < 2e-16 ***
## EIi         -0.02907    0.15266  -0.190 0.848952    
## SNs          1.21082    0.14552   8.320  < 2e-16 ***
## TFt         -0.64194    0.16768  -3.828 0.000129 ***
## JPp          0.93417    0.14594   6.401 1.54e-10 ***
## EIi:SNs      0.30212    0.14233   2.123 0.033780 *  
## EIi:TFt      0.19449    0.13121   1.482 0.138258    
## EIi:JPp      0.01766    0.13160   0.134 0.893261    
## SNs:TFt      0.40920    0.15243   2.684 0.007265 ** 
## SNs:JPp     -1.22153    0.14547  -8.397  < 2e-16 ***
## TFt:JPp     -0.55936    0.13512  -4.140 3.48e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 399.944  on 15  degrees of freedom
## Residual deviance:  10.162  on  5  degrees of freedom
## AIC: 125
## 
## Number of Fisher Scoring iterations: 4
1-pchisq(10.162, 5)
## [1] 0.07077304

(b) Drop the two non-significant two-factor interaction terms, refit the new model, and clearly and thoroughly interpret the estimated odds ratio associated with the E/I by S/N interaction

fit5b <- glm(n ~ EI + SN + TF + JP + EI:SN  + SN:TF + SN:JP + TF:JP, family=poisson, data=MBTI)
summary(fit5b)
## 
## Call:
## glm(formula = n ~ EI + SN + TF + JP + EI:SN + SN:TF + SN:JP + 
##     TF:JP, family = poisson, data = MBTI)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.65487  -0.46916   0.00529   0.54208   1.47431  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.41362    0.12930  26.402  < 2e-16 ***
## EIi          0.03871    0.11361   0.341 0.733287    
## SNs          1.19414    0.14548   8.208 2.24e-16 ***
## TFt         -0.54137    0.15282  -3.543 0.000396 ***
## JPp          0.94292    0.13064   7.218 5.28e-13 ***
## EIi:SNs      0.32190    0.13598   2.367 0.017922 *  
## SNs:TFt      0.42366    0.15200   2.787 0.005318 ** 
## SNs:JPp     -1.22021    0.14513  -8.408  < 2e-16 ***
## TFt:JPp     -0.55853    0.13497  -4.138 3.50e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 399.944  on 15  degrees of freedom
## Residual deviance:  12.369  on  7  degrees of freedom
## AIC: 123.2
## 
## Number of Fisher Scoring iterations: 4
exp(0.32190)
## [1] 1.379747

The odds of being introverted and and sensitive is 1.3797 times the odds of being extroverted and intuitive.

(c) Clearly interpret in terms of odds ratios that the interaction between E/I and T/F is nonsignificant

exp(0.19449)
## [1] 1.214691

Question 6

[p. 223 ex 7.6] Table 7.15, from a General Social Survey, relates responses on R = religious service attendance (1 = at most a few times a year, 2 = at least several times a year), P = political views (1 = liberal, 2 = moderate, 3 = conservative), B = birth control availability to teenagers between the ages of 14 and 16 (1 = agree, 2 = disagree), S = sex relations before marriage (1 = wrong only sometimes or not wrong at all, 2 = always or almost always wrong).

a. Investigate the complexity needed for loglinear modeling by fitting models having only single-factor terms, all two-factor terms, and all three-factor terms. Select a model and interpret it by estimating conditional odds ratios.

BPRS<-read.table("http://users.stat.ufl.edu/~aa/cat/data/BPRS.dat", header=T)
B<-BPRS$B; P<-BPRS$P; R<-BPRS$R; S<-BPRS$S

# Model 1: Single-factor terms
fit6.one <- glm(count ~ B + factor(P) + R + S, family=poisson, data=BPRS)
summary(fit6.one)
## 
## Call:
## glm(formula = count ~ B + factor(P) + R + S, family = poisson, 
##     data = BPRS)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.5320  -2.5480  -1.4182  -0.0247   9.5001  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.15602    0.18305  22.704  < 2e-16 ***
## B           -0.49358    0.06774  -7.287 3.17e-13 ***
## factor(P)2   0.28768    0.08051   3.573 0.000352 ***
## factor(P)3   0.09194    0.08416   1.092 0.274624    
## R            0.54437    0.06817   7.985 1.40e-15 ***
## S           -0.54437    0.06817  -7.985 1.40e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 477.78  on 23  degrees of freedom
## Residual deviance: 277.08  on 18  degrees of freedom
## AIC: 413.52
## 
## Number of Fisher Scoring iterations: 5
1-pchisq(277.08, 18)
## [1] 0
# Model 2: Two-factor terms
fit6.two <- glm(count ~ B + factor(P) + R + S + B:factor(P) + B:R + B:S + factor(P):R + factor(P):S + R:S, family=poisson, data=BPRS)
summary(fit6.two)
## 
## Call:
## glm(formula = count ~ B + factor(P) + R + S + B:factor(P) + B:R + 
##     B:S + factor(P):R + factor(P):S + R:S, family = poisson, 
##     data = BPRS)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.25008  -0.34687  -0.06744   0.37143   1.07662  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   11.8342     0.5136  23.040  < 2e-16 ***
## B             -3.5271     0.3313 -10.647  < 2e-16 ***
## factor(P)2    -1.4559     0.3604  -4.039 5.36e-05 ***
## factor(P)3    -2.8121     0.3941  -7.136 9.59e-13 ***
## R             -1.9728     0.2784  -7.085 1.39e-12 ***
## S             -4.6599     0.3794 -12.281  < 2e-16 ***
## B:factor(P)2   0.3048     0.1893   1.610 0.107377    
## B:factor(P)3   0.9288     0.1936   4.797 1.61e-06 ***
## B:R            0.5979     0.1625   3.678 0.000235 ***
## B:S            1.1468     0.1532   7.488 6.99e-14 ***
## factor(P)2:R   0.2583     0.1729   1.494 0.135261    
## factor(P)3:R   0.3441     0.1883   1.827 0.067658 .  
## factor(P)2:S   0.7200     0.1952   3.688 0.000226 ***
## factor(P)3:S   0.8018     0.2031   3.948 7.87e-05 ***
## R:S            1.1459     0.1698   6.749 1.49e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 477.7846  on 23  degrees of freedom
## Residual deviance:   6.9631  on  9  degrees of freedom
## AIC: 161.4
## 
## Number of Fisher Scoring iterations: 4
1-pchisq(6.9631, 9)
## [1] 0.6409614
# Model 3: Three-factor terms
fit6.three <- glm(count ~ B + factor(P) + R + S + 
                    B:factor(P) + B:R + B:S + factor(P):R + factor(P):S + R:S + 
                    B:factor(P):R + B:factor(P):S + factor(P):R:S + R:S:B, family=poisson, data=BPRS)
summary(fit6.three)
## 
## Call:
## glm(formula = count ~ B + factor(P) + R + S + B:factor(P) + B:R + 
##     B:S + factor(P):R + factor(P):S + R:S + B:factor(P):R + B:factor(P):S + 
##     factor(P):R:S + R:S:B, family = poisson, data = BPRS)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6         7  
##  0.02321  -0.05927  -0.02698   0.04629  -0.08080   0.11766   0.04725  
##        8         9        10        11        12        13        14  
## -0.04902  -0.10758   0.20965   0.09931  -0.15050   0.20965  -0.25018  
##       15        16        17        18        19        20        21  
## -0.12974   0.11978   0.09738  -0.15693  -0.09650   0.11613  -0.27243  
##       22        23        24  
##  0.20382   0.12136  -0.07361  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    12.55137    1.35499   9.263  < 2e-16 ***
## B              -3.81480    0.97060  -3.930 8.48e-05 ***
## factor(P)2     -3.21602    1.25846  -2.556  0.01060 *  
## factor(P)3     -2.06744    1.35516  -1.526  0.12711    
## R              -2.51571    0.77414  -3.250  0.00116 ** 
## S              -5.00447    1.10556  -4.527 5.99e-06 ***
## B:factor(P)2    0.81442    0.81079   1.004  0.31515    
## B:factor(P)3    0.78630    0.82827   0.949  0.34246    
## B:R             0.85685    0.53641   1.597  0.11018    
## B:S             1.15987    0.70770   1.639  0.10123    
## factor(P)2:R    1.23207    0.69670   1.768  0.07699 .  
## factor(P)3:R    0.36755    0.75159   0.489  0.62482    
## factor(P)2:S    1.70933    0.91491   1.868  0.06172 .  
## factor(P)3:S   -0.09378    1.03826  -0.090  0.92803    
## R:S             1.43108    0.58766   2.435  0.01488 *  
## B:factor(P)2:R -0.25060    0.41109  -0.610  0.54212    
## B:factor(P)3:R -0.22874    0.43119  -0.530  0.59578    
## B:factor(P)2:S -0.04888    0.40244  -0.121  0.90332    
## B:factor(P)3:S  0.37398    0.42158   0.887  0.37503    
## factor(P)2:R:S -0.53008    0.43718  -1.213  0.22532    
## factor(P)3:R:S  0.18751    0.48483   0.387  0.69894    
## B:R:S          -0.07139    0.35480  -0.201  0.84054    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 477.78464  on 23  degrees of freedom
## Residual deviance:   0.45051  on  2  degrees of freedom
## AIC: 168.89
## 
## Number of Fisher Scoring iterations: 3
1-pchisq(0.45051, 2)
## [1] 0.7983126
summary(fit6.two)
## 
## Call:
## glm(formula = count ~ B + factor(P) + R + S + B:factor(P) + B:R + 
##     B:S + factor(P):R + factor(P):S + R:S, family = poisson, 
##     data = BPRS)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.25008  -0.34687  -0.06744   0.37143   1.07662  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   11.8342     0.5136  23.040  < 2e-16 ***
## B             -3.5271     0.3313 -10.647  < 2e-16 ***
## factor(P)2    -1.4559     0.3604  -4.039 5.36e-05 ***
## factor(P)3    -2.8121     0.3941  -7.136 9.59e-13 ***
## R             -1.9728     0.2784  -7.085 1.39e-12 ***
## S             -4.6599     0.3794 -12.281  < 2e-16 ***
## B:factor(P)2   0.3048     0.1893   1.610 0.107377    
## B:factor(P)3   0.9288     0.1936   4.797 1.61e-06 ***
## B:R            0.5979     0.1625   3.678 0.000235 ***
## B:S            1.1468     0.1532   7.488 6.99e-14 ***
## factor(P)2:R   0.2583     0.1729   1.494 0.135261    
## factor(P)3:R   0.3441     0.1883   1.827 0.067658 .  
## factor(P)2:S   0.7200     0.1952   3.688 0.000226 ***
## factor(P)3:S   0.8018     0.2031   3.948 7.87e-05 ***
## R:S            1.1459     0.1698   6.749 1.49e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 477.7846  on 23  degrees of freedom
## Residual deviance:   6.9631  on  9  degrees of freedom
## AIC: 161.4
## 
## Number of Fisher Scoring iterations: 4
summary(fit6.three)
## 
## Call:
## glm(formula = count ~ B + factor(P) + R + S + B:factor(P) + B:R + 
##     B:S + factor(P):R + factor(P):S + R:S + B:factor(P):R + B:factor(P):S + 
##     factor(P):R:S + R:S:B, family = poisson, data = BPRS)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6         7  
##  0.02321  -0.05927  -0.02698   0.04629  -0.08080   0.11766   0.04725  
##        8         9        10        11        12        13        14  
## -0.04902  -0.10758   0.20965   0.09931  -0.15050   0.20965  -0.25018  
##       15        16        17        18        19        20        21  
## -0.12974   0.11978   0.09738  -0.15693  -0.09650   0.11613  -0.27243  
##       22        23        24  
##  0.20382   0.12136  -0.07361  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    12.55137    1.35499   9.263  < 2e-16 ***
## B              -3.81480    0.97060  -3.930 8.48e-05 ***
## factor(P)2     -3.21602    1.25846  -2.556  0.01060 *  
## factor(P)3     -2.06744    1.35516  -1.526  0.12711    
## R              -2.51571    0.77414  -3.250  0.00116 ** 
## S              -5.00447    1.10556  -4.527 5.99e-06 ***
## B:factor(P)2    0.81442    0.81079   1.004  0.31515    
## B:factor(P)3    0.78630    0.82827   0.949  0.34246    
## B:R             0.85685    0.53641   1.597  0.11018    
## B:S             1.15987    0.70770   1.639  0.10123    
## factor(P)2:R    1.23207    0.69670   1.768  0.07699 .  
## factor(P)3:R    0.36755    0.75159   0.489  0.62482    
## factor(P)2:S    1.70933    0.91491   1.868  0.06172 .  
## factor(P)3:S   -0.09378    1.03826  -0.090  0.92803    
## R:S             1.43108    0.58766   2.435  0.01488 *  
## B:factor(P)2:R -0.25060    0.41109  -0.610  0.54212    
## B:factor(P)3:R -0.22874    0.43119  -0.530  0.59578    
## B:factor(P)2:S -0.04888    0.40244  -0.121  0.90332    
## B:factor(P)3:S  0.37398    0.42158   0.887  0.37503    
## factor(P)2:R:S -0.53008    0.43718  -1.213  0.22532    
## factor(P)3:R:S  0.18751    0.48483   0.387  0.69894    
## B:R:S          -0.07139    0.35480  -0.201  0.84054    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 477.78464  on 23  degrees of freedom
## Residual deviance:   0.45051  on  2  degrees of freedom
## AIC: 168.89
## 
## Number of Fisher Scoring iterations: 3

The GoF test is favorable for the two-factor and three-factor models, which tells us that both of these models fit the data well. The three-factor model has a GoF test statistic=0.7983 vs 0.6409 for the two-factor model. Although the GoF value is slightly better for the three-factor model, the standard errors are the same as, or larger, than the beta estimates for many of the model effects. Additionally, a three-factor model adds another layer of complication as far as interpretability. For these reasons, I would choose the two-factor model.

b. Draw the independence graph for model (BP,BR,BS,PS,RS). Remark on conditional independence patterns. Are any fitted marginal and conditional associations identical?

fit6b <- glm(count ~ B + factor(P) + R + S + B:factor(P) + B:R + factor(P):S + R:S, family=poisson, data=BPRS)
summary(fit6b)
## 
## Call:
## glm(formula = count ~ B + factor(P) + R + S + B:factor(P) + B:R + 
##     factor(P):S + R:S, family = poisson, data = BPRS)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6329  -1.4454  -0.2666   1.3656   2.6186  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   10.8023     0.5276  20.473  < 2e-16 ***
## B             -2.6071     0.2894  -9.010  < 2e-16 ***
## factor(P)2    -1.4905     0.3318  -4.493 7.03e-06 ***
## factor(P)3    -2.9914     0.3550  -8.426  < 2e-16 ***
## R             -2.4409     0.2899  -8.419  < 2e-16 ***
## S             -3.5133     0.3172 -11.076  < 2e-16 ***
## B:factor(P)2   0.5145     0.1795   2.866  0.00416 ** 
## B:factor(P)3   1.1646     0.1838   6.335 2.37e-10 ***
## B:R            0.9107     0.1515   6.011 1.85e-09 ***
## factor(P)2:S   0.8527     0.1835   4.648 3.36e-06 ***
## factor(P)3:S   1.1097     0.1891   5.870 4.37e-09 ***
## R:S            1.3365     0.1625   8.223  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 477.785  on 23  degrees of freedom
## Residual deviance:  66.053  on 12  degrees of freedom
## AIC: 214.49
## 
## Number of Fisher Scoring iterations: 4

c. Fit the loglinear model that corresponds to the logistic model that predicts S using the other variables as main effects, without any interaction. Does it fit adequately?

bprs2<-data.frame(b=c(1,1,1,1,1,1,2,2,2,2,2,2),
                  p=c(1,1,2,2,3,3,1,1,2,2,3,3),
                  r=c(1,2,1,2,1,2,1,2,1,2,1,2),
                  notwrong=c(99,73,73,87,51,51,15,25,20,37,19,36),
                  wrong=c(8,24,20,20,6,33,4,22,13,60,12,88))

lrm <- glm(notwrong/(notwrong+wrong) ~ b + factor(p) + r, family=binomial, weights=wrong+notwrong,data=bprs2)
summary(lrm)
## 
## Call:
## glm(formula = notwrong/(notwrong + wrong) ~ b + factor(p) + r, 
##     family = binomial, data = bprs2, weights = wrong + notwrong)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2827  -0.6034   0.1032   0.6923   2.2381  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   4.6695     0.3842  12.154  < 2e-16 ***
## b            -1.3743     0.1579  -8.706  < 2e-16 ***
## factor(p)2   -0.4246     0.2030  -2.091 0.036510 *  
## factor(p)3   -0.7755     0.2049  -3.785 0.000154 ***
## r            -0.9635     0.1741  -5.535 3.11e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 190.922  on 11  degrees of freedom
## Residual deviance:  15.944  on  7  degrees of freedom
## AIC: 77.334
## 
## Number of Fisher Scoring iterations: 4
1-pchisq(15.944,7)
## [1] 0.02563294

With a deviance=15.944 on 7 degrees of freedom, the logistic model is an okay fit for the data (GoF=0.0256) but still indicates some lack of fit.

Question 7

[p. 224, ex 7.8] For a three-way contingency table, consider the independence graph, X-Z Y. Write the corresponding loglinear model. Which pairs of variables are conditionally independent? Which pairs of variables have the same marginal association as their conditional association?

See attachment for the loglinear model Conditionally independent: Y and X, Y and Z. The marginal association is equal to the conditional association for Y and X, Y and Z.

Question 8

[STAT-410 students only p. 224, ex 7.12] For the substance use data in Table 7.8, consider loglinear model (AC, AM, CM, AG, AR, GM, GR).

a. Explain why the AM conditional odds ratio is unchanged by collapsing over race, but it is not unchanged by collapsing over gender.

substance<-read.table("http://users.stat.ufl.edu/~aa/cat/data/Substance2.dat", header=T)

fit8 <- glm(count ~ A + C + M + R + G + A:C + A:M + C:M + A:G + A:R + G:M + G:R, family=poisson, data=substance)
summary(fit8)
## 
## Call:
## glm(formula = count ~ A + C + M + R + G + A:C + A:M + C:M + A:G + 
##     A:R + G:M + G:R, family = poisson, data = substance)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5357  -0.5772  -0.0178   0.5134   2.4676  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  26.31264    1.04852  25.095  < 2e-16 ***
## A           -11.69693    0.95958 -12.190  < 2e-16 ***
## C            -7.91818    0.34762 -22.778  < 2e-16 ***
## M            -5.97406    0.51168 -11.675  < 2e-16 ***
## R            -3.38269    0.34182  -9.896  < 2e-16 ***
## G            -0.01334    0.23569  -0.057  0.95486    
## A:C           2.05453    0.17406  11.803  < 2e-16 ***
## A:M           3.00592    0.46484   6.467    1e-10 ***
## C:M           2.84789    0.16384  17.382  < 2e-16 ***
## A:G           0.29229    0.12768   2.289  0.02207 *  
## A:R           0.59346    0.19268   3.080  0.00207 ** 
## M:G          -0.26929    0.09039  -2.979  0.00289 ** 
## R:G           0.12619    0.15866   0.795  0.42644    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4818.051  on 31  degrees of freedom
## Residual deviance:   19.909  on 19  degrees of freedom
## AIC: 182.15
## 
## Number of Fisher Scoring iterations: 5

The odds ratio does not change because alcohol or marijuana are conditionally independent on race. Alcohol and marijuana are conditionally dependent on gender, which is why the odds ratio changes when collapsing over gender.

b. Suppose we remove the GM term from the model. Construct the independence graph and show that {G, R} are separated from {C, M} by A. Explain why all conditional associations among A, C, and M are then identical to those in model (AC, AM, CM), collapsing over G and R.

fit8b <- glm(count ~ A + C + M + R + G + A:C + A:M + C:M + A:G + A:R + G:R, family=poisson, data=substance)
summary(fit8b)
## 
## Call:
## glm(formula = count ~ A + C + M + R + G + A:C + A:M + C:M + A:G + 
##     A:R + G:R, family = poisson, data = substance)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8342  -0.6160  -0.2801   0.4567   2.4638  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  26.6964     1.0373  25.737  < 2e-16 ***
## A           -11.4674     0.9530 -12.033  < 2e-16 ***
## C            -7.9182     0.3476 -22.778  < 2e-16 ***
## M            -6.3588     0.4957 -12.827  < 2e-16 ***
## R            -3.3827     0.3418  -9.896  < 2e-16 ***
## G            -0.2922     0.2162  -1.351  0.17663    
## A:C           2.0545     0.1741  11.803  < 2e-16 ***
## A:M           2.9860     0.4647   6.426 1.31e-10 ***
## C:M           2.8479     0.1638  17.382  < 2e-16 ***
## A:G           0.1644     0.1202   1.368  0.17133    
## A:R           0.5935     0.1927   3.080  0.00207 ** 
## R:G           0.1262     0.1587   0.795  0.42644    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4818.051  on 31  degrees of freedom
## Residual deviance:   28.805  on 20  degrees of freedom
## AIC: 189.04
## 
## Number of Fisher Scoring iterations: 5