Categorical Data Analysis

(1) Agresti pg 249, Exercise 8.2

A recent General Social Survey asked subjects whether they believed in heaven and whether they believed in hell.

rm(list = ls())
df <- as.table(x = matrix(data = c(833, 125, 2, 160), nrow = 2, byrow = TRUE))

# [2x2 Contingency Table]
attributes(df)$dimnames[[1]] <- list("Yes", "No")
attributes(df)$dimnames[[2]] <- list("Yes", "No")
# Columns: Believe in Hell & Rows: Believe in Heaven
df

    Yes  No
Yes 833 125
No    2 160

Test the hypothesis that the population proportions answering yes were identical for heaven and hell.
Find a 90% confidence interval for the difference between the population proportions. Interpret.
Estimate and interpret the odds ratio for a logistic model for the probability of a yes response as a function of the item (heaven or hell), using the (i) marginal model (8.2), (ii) subject specific model (8.3). Give a detailed explanation on why the estimates are different.

a. McNemar’s Test

\(H_{0}:\) The population proportions for “yes” are equal for heaven and hell
\(H_{A}:\) The population proportions for “yes” are unequal for heaven and hell

mtest <- mcnemar.test(df, correct = FALSE)
mtest


    McNemar's Chi-squared test

data:  df
McNemar's chi-squared = 119.13, df = 1, p-value <
0.00000000000000022

Test Stat: \(\chi^{2} = 119.1259843\)
Conclusion: With a p-value of 0 being less than \(\alpha = 0.05\), we reject the null with evidence in the direction of the alternative. That is, we have statistically significant evidence suggesting that the population proportions of people answering “yes” are unequal for heaven and hell.

b. 90% CI for Difference Proportions

ci.9 <- diffpropci.Wald.mp(df[2, 1], df[1, 2], sum(df), 0.9)
ci.9$conf.int

[1] 0.09417584 0.12546701
attr(,"conf.level")
[1] 0.9

90% Wald CI: [0.0941758, 0.125467]
We can be 90% confident that the difference between population proportions for those believing in hell and those believing in heaven is contained in the above interval.

c. OR Estimates

## Marginal df[1,1] + df[1,2] df[2,1] + df[2,2] df[1,1] + df[2,1] df[1,2] +
## df[2,2]

marg <- log(((df[1, 1] + df[1, 2])/(df[2, 1] + df[2, 2]))/((df[1, 1] + df[2, 
    1])/(df[1, 2] + df[2, 2])))
exp(marg)

[1] 2.018408

## Subject Specific
spec <- df[1, 2]/df[2, 1]
spec

[1] 62.5

Marginal Model OR Estimate: The estimated odds of answering yes to believing in heaven are 2.0184076 times the estimated odds of answering yes to believing in hell.

Case Specific Model OR Estimate: The estimated odds of answering yes to believing in heaven are 62.5 times the estimated odds of answering yes to believing in hell.

These estimates may be different because the subject specific model controls for a variable, subject, that the marginal model ignores because it uses only the cells in the contingency table rather than a more intricate data structure.

(2) Agresti pg 250, Exercise 8.8

The 2016 General Social Survey reported region of residence now and at age 16 for residents of the US in the following table

rm(list = ls())
df <- as.table(x = matrix(data = c(394, 17, 81, 38, 8, 596, 74, 59, 29, 32, 
    769, 35, 10, 24, 35, 417), nrow = 4, byrow = TRUE))
# [4x4 Contingency Table]
attributes(df)$dimnames[[1]] <- list("NE", "MW", "S", "W")
attributes(df)$dimnames[[2]] <- list("NE", "MW", "S", "W")
# Columns: Residence Now & Rows: Residence at Age 16
df

    NE  MW   S   W
NE 394  17  81  38
MW   8 596  74  59
S   29  32 769  35
W   10  24  35 417

df.2 <- data.frame(NE = c(1, 1, 1, 0, 0, 0), MW = c(-1, 0, 0, 1, 1, 0), S = c(0, 
    -1, 0, -1, 0, 1), W = c(0, 0, -1, 0, -1, -1), n.ij = c(17, 81, 38, 74, 59, 
    35), n.ji = c(8, 29, 10, 32, 24, 35))
df.2 <- df.2 %>% mutate(stdres = (n.ij - n.ji/sqrt(n.ij + n.ji)))

Test marginal homogeneity by comparing fits of two models.
Fit the quasi-symmetry (QS) model to these data where the \(\beta_{4}\) parameter (for “West”) is set to zero (as in SAS).
Clearly interpret in context the value \(e^{\hat{\beta_{1}}}\) where \(\beta_{1}\) multiplies “Northeast” in the QS model.
Clearly interpret in context the value \(e^{\hat{\beta_{2}} - \hat{\beta_{1}}}\) where \(\beta_{1}\) multiplies “Northeast” and \(\beta_{2}\) multiplies “Midwest” in the QS model.

a. Testing Marginal Homogeneity

Comparing the fits of the Symmetry & Quasi-Symmetry Model

## Symmetry Model
fit.sym <- glm((n.ij/(n.ij + n.ji)) ~ -1, family = binomial, weights = n.ij + 
    n.ji, data = df.2)
## Quasi-Symmetry Model
fit.qsym <- glm((n.ij/(n.ij + n.ji)) ~ -1 + NE + MW + S + W, family = binomial, 
    weights = n.ij + n.ji, data = df.2)

sym.sum <- summary(fit.sym)
qsym.sum <- summary(fit.qsym)

sym.sum


Call:
glm(formula = (n.ij/(n.ij + n.ji)) ~ -1, family = binomial, data = df.2, 
    weights = n.ij + n.ji)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
 0.000   2.341   4.019   4.164   5.059  

No Coefficients

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 78.657  on 6  degrees of freedom
Residual deviance: 78.657  on 6  degrees of freedom
AIC: 105.38

Number of Fisher Scoring iterations: 0

qsym.sum


Call:
glm(formula = (n.ij/(n.ij + n.ji)) ~ -1 + NE + MW + S + W, family = binomial, 
    data = df.2, weights = n.ij + n.ji)

Deviance Residuals: 
      1        2        3        4        5        6  
 0.8708  -0.6262   0.2526   0.2935   0.1705  -0.3423  

Coefficients: (1 not defined because of singularities)
   Estimate Std. Error z value      Pr(>|z|)    
NE  1.24600    0.20952   5.947 0.00000000273 ***
MW  0.85831    0.17989   4.771 0.00000182949 ***
S   0.08184    0.17330   0.472         0.637    
W        NA         NA      NA            NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 78.6573  on 6  degrees of freedom
Residual deviance:  1.4466  on 3  degrees of freedom
AIC: 34.172

Number of Fisher Scoring iterations: 3

## Test Statistics Symmetry Model
sym.sum$deviance

[1] 78.65733

sym.sum$df.residual

[1] 6

# Quasi-Symmetry Model
qsym.sum$deviance

[1] 1.446581

qsym.sum$df.residual

[1] 3

## Calculating p-values Symmetry Model
1 - pchisq(sym.sum$deviance, sym.sum$df.residual)

[1] 0.00000000000000677236

# Quasi-Symmetry Model
1 - pchisq(qsym.sum$deviance, qsym.sum$df.residual)

[1] 0.6946532

## Model Comparison: Difference in Deviances Test Statistic
sym.sum$deviance - qsym.sum$deviance

[1] 77.21075

sym.sum$df.residual - qsym.sum$df.residual

[1] 3

# P-value
1 - pchisq(sym.sum$deviance - qsym.sum$deviance, sym.sum$df.residual - qsym.sum$df.residual)

[1] 0.0000000000000001110223

Based on the test statistics & p-values printed above, the quasi-symmetry model seems to fit the data better than the symmetry model indicating marginal heterogeneity in the data; this is supported by the multiple, statistically significant, non-zero \(\hat{\beta}\)’s in the Quasi-Symmetry Model.

b. Quasi-Symmetry Model

summary(fit.qsym)


Call:
glm(formula = (n.ij/(n.ij + n.ji)) ~ -1 + NE + MW + S + W, family = binomial, 
    data = df.2, weights = n.ij + n.ji)

Deviance Residuals: 
      1        2        3        4        5        6  
 0.8708  -0.6262   0.2526   0.2935   0.1705  -0.3423  

Coefficients: (1 not defined because of singularities)
   Estimate Std. Error z value      Pr(>|z|)    
NE  1.24600    0.20952   5.947 0.00000000273 ***
MW  0.85831    0.17989   4.771 0.00000182949 ***
S   0.08184    0.17330   0.472         0.637    
W        NA         NA      NA            NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 78.6573  on 6  degrees of freedom
Residual deviance:  1.4466  on 3  degrees of freedom
AIC: 34.172

Number of Fisher Scoring iterations: 3

c. Interpretation of \(e^{\hat{\beta_{1}}}\)

exp(fit.qsym$coefficients[1])

      NE 
3.476408

The estimated probability of residents of the Northeast at age 16 living in the West currently is 347.6407551% of the estimated probability of residents of the West at age 16 living in the Northeast currently.

d. Interpretation of \(e^{\hat{\beta_{2}} - \hat{\beta_{1}}}\)

exp(fit.qsym$coefficients[1] - fit.qsym$coefficients[2])

      NE 
1.473568

The estimated probability of relocating from the Northeast to the Midwest is 147.3567805% of the estimated probability of the reverse.

(3) Agresti pg 251-252, Exercise 8.12

The following table displays diagnoses of multiple sclerosis for two neurologists. The categories are (1) Certain multiple sclerosis, (2) Probable multiple sclerosis, (3) Possible multiple sclerosis, (4) Doubtful, unlikely, or definitely not multiple sclerosis. Analyze the agreement.

rm(list = ls())
df <- as.table(x = matrix(data = c(38, 5, 0, 1, 33, 11, 3, 0, 10, 14, 5, 6, 
    3, 7, 3, 10), nrow = 4, byrow = TRUE))
# [4x4 Contingency Table]
attributes(df)$dimnames[[1]] <- list("1", "2", "3", "4")
attributes(df)$dimnames[[2]] <- list("1", "2", "3", "4")
# Columns: Neurologist B & Rows: Neurologist A
df

   1  2  3  4
1 38  5  0  1
2 33 11  3  0
3 10 14  5  6
4  3  7  3 10

df.2 <- data.frame(a = rep(1:4, each = 4), b = rep(1:4, times = 4), n = c(38, 
    5, 0, 1, 33, 11, 3, 0, 10, 14, 5, 6, 3, 7, 3, 10), diag = c(1, 0, 0, 0, 
    0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1))

df.3 <- data.frame(MS1 = c(1, 1, 1, 0, 0, 0), MS2 = c(-1, 0, 0, 1, 1, 0), MS3 = c(0, 
    -1, 0, -1, 0, 1), MS4 = c(0, 0, -1, 0, -1, -1), n.ij = c(5, 0, 1, 3, 0, 
    6), n.ji = c(33, 10, 3, 14, 7, 3), x = c(1, 2, 3, 1, 2, 1))

Use the independence model and residuals to study the pattern of agreement and interpret; report the appropriate residuals into a 4x4 table as done on pg 243.
Use several more complex models to study the pattern and strength of agreement between the neurologists; then, choosing your recommended final model, interpret the results – be sure to give a listing of each model fit, its deviance, and your comments on goodness of fit for these data.
Regardless of the quality of fit, fit the ordinal quasi-symmetry (OQS) model to these data and clearly interpret the estimated \(\beta\) parameter in context in terms of an odds ratio or probability.
Use kappa (whichever version makes the most sense here) to describe agreement and clearly interpret the estimated kappa in context.

a. Independence Model Residuals

stdres <- chisq.test(df)$stdres
kable(stdres)

1	2	3	4
4.778145	-2.4633322	-2.2309078	-2.270845
2.311914	-0.2738636	-0.3167425	-2.973579
-3.792088	2.3745061	1.7855553	1.219738
-4.556948	0.6762936	1.1290412	5.260539

If there is an association between Neurologist A & B’s level of diagnoses, it would be expected that the residuals along the diagonal are larger positive values and, as the difference between i and j increases (residual cells further from the diagonal), the magnitude of residuals decreases. The largest magnitude residuals are on the diagonal at i,j = 1 and i,j = 4 which intuitively makes sense because these cells represents the most extreme levels of diagnoses certainty and trained neurologists should probably agree on the most obvious cases. There seems to be a negative residual trend as the distance from the diagonal increases which may indicate higher rates of agreement since the observed frequencies of more extreme differences is fewer than the expected frequencies assuming independence between the diagnoses.

b. Quasi-Symmetry, Ordinal Quasi-Symmetry & Quasi Independence Models

# Quasi-Symmetry Model
fit.qsym <- glm((n.ij/(n.ij + n.ji)) ~ -1 + MS1 + MS2 + MS3 + MS4, family = binomial, 
    weights = n.ij + n.ji, data = df.3)

# Ordinal Quasi-Symmetry Model
fit.oqsym <- glm((n.ij/(n.ij + n.ji)) ~ -1 + x, family = binomial, weights = n.ij + 
    n.ji, data = df.3)

# Quasi-Independence Model
fit.qi <- glm(n ~ factor(a) + factor(b) + factor(diag), family = poisson, data = df.2)

qsym.sum <- summary(fit.qsym)
oqsym.sum <- summary(fit.oqsym)
qi.sum <- summary(fit.qi)

# 1 - pchisq(qsym.sum$deviance, qsym.sum$df.residual) 1 -
# pchisq(oqsym.sum$deviance, oqsym.sum$df.residual) 1 -
# pchisq(qi.sum$deviance, qi.sum$df.residual)

Quasi-Symmetry Model


Call:
glm(formula = (n.ij/(n.ij + n.ji)) ~ -1 + MS1 + MS2 + MS3 + MS4, 
    family = binomial, data = df.3, weights = n.ij + n.ji)

Deviance Residuals: 
      1        2        3        4        5        6  
-0.2731  -0.7188   1.4863   0.5215  -1.7293   0.3481  

Coefficients: (1 not defined because of singularities)
    Estimate Std. Error z value  Pr(>|z|)    
MS1  -3.1930     0.7633  -4.183 0.0000287 ***
MS2  -1.4349     0.6583  -2.180    0.0293 *  
MS3   0.4501     0.5840   0.771    0.4409    
MS4       NA         NA      NA        NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 56.442  on 6  degrees of freedom
Residual deviance:  6.184  on 3  degrees of freedom
AIC: 22.646

Number of Fisher Scoring iterations: 5

Deviance: 6.1840017 with 3 degrees of freedom
Comments: With a p-value of 0.1029934, the QS model seems to fit the data relatively well compared to the other two models.

Ordinal Quasi-Symmetry Model


Call:
glm(formula = (n.ij/(n.ij + n.ji)) ~ -1 + x, family = binomial, 
    data = df.3, weights = n.ij + n.ji)

Deviance Residuals: 
      1        2        3        4        5        6  
-1.4175  -1.2483   1.7952  -0.4599  -1.0444   2.8512  

Coefficients:
  Estimate Std. Error z value   Pr(>|z|)    
x  -1.2565     0.2458  -5.111 0.00000032 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 56.442  on 6  degrees of freedom
Residual deviance: 16.222  on 5  degrees of freedom
AIC: 28.684

Number of Fisher Scoring iterations: 5

Deviance: 16.221858 with 5 degrees of freedom
Comments: With a p-value of 0.0062384, the OQS model does not the data well.

Quasi Independence Model


Call:
glm(formula = n ~ factor(a) + factor(b) + factor(diag), family = poisson, 
    data = df.2)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.96051  -1.92253   0.07114   1.09564   2.49717  

Coefficients:
              Estimate Std. Error z value             Pr(>|z|)    
(Intercept)     2.6556     0.2094  12.683 < 0.0000000000000002 ***
factor(a)2      0.3723     0.2330   1.598                0.110    
factor(a)3      0.2577     0.2642   0.976                0.329    
factor(a)4     -0.2241     0.2836  -0.790                0.429    
factor(b)2     -0.9186     0.2100  -4.375       0.000012147716 ***
factor(b)3     -2.0983     0.3295  -6.368       0.000000000192 ***
factor(b)4     -1.5503     0.2717  -5.705       0.000000011604 ***
factor(diag)1   0.8576     0.1936   4.429       0.000009472832 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 161.740  on 15  degrees of freedom
Residual deviance:  49.678  on  8  degrees of freedom
AIC: 118.95

Number of Fisher Scoring iterations: 5

Deviance: 49.6779157 with 8 degrees of freedom
Comments: With a p-value of 0, the OQS model does not the data well.

c. Ordinal Quasi-Symmetry Model

oqsym.sum


Call:
glm(formula = (n.ij/(n.ij + n.ji)) ~ -1 + x, family = binomial, 
    data = df.3, weights = n.ij + n.ji)

Deviance Residuals: 
      1        2        3        4        5        6  
-1.4175  -1.2483   1.7952  -0.4599  -1.0444   2.8512  

Coefficients:
  Estimate Std. Error z value   Pr(>|z|)    
x  -1.2565     0.2458  -5.111 0.00000032 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 56.442  on 6  degrees of freedom
Residual deviance: 16.222  on 5  degrees of freedom
AIC: 28.684

Number of Fisher Scoring iterations: 5

\(\hat{\beta_{x}}\) Estimate Interpretation: the estimated probability of Neurologist A’s diagnoses are x levels higher than Neurologist B’s diagnoses is 0.2846518 times the estimated probability of Neurologist B’s diagnoses are x levels higher than Neurologist A’s diagnoses.

d. Kappa

kap <- cohen.kappa(df)
kap$weighted.kappa

[1] 0.5245765

Because the data are ordinal factors, the weighted kappa estimate of 0.5245765 makes most sense in this context and can be interpreted to mean that the difference between observed and expected agreements between the two neurologists is 52.4576464% the maximum possible difference between agreements, assuming independence between their diagnoses.

(4) Pro Tennis records

The table below summarizes the results of tennis matches for several women professional players between 2003 and 2005.

rm(list = ls())
df <- as.table(x = matrix(data = c(NA, 6, 3, 0, 2, 2, NA, 0, 2, 4, 1, 2, NA, 
    0, 1, 2, 2, 2, NA, 2, 3, 2, 2, 2, NA), nrow = 5, byrow = TRUE))
# [5x5 Contingency Table]
attributes(df)$dimnames[[1]] <- list("Clijsters", "Davenport", "Pierce", "S.Williams", 
    "V.Williams")
attributes(df)$dimnames[[2]] <- list("Clijsters", "Davenport", "Pierce", "S.Williams", 
    "V.Williams")
# Columns: Loser & Rows: Winner
df

           Clijsters Davenport Pierce S.Williams V.Williams
Clijsters                    6      3          0          2
Davenport          2                0          2          4
Pierce             1         2                 0          1
S.Williams         2         2      2                     2
V.Williams         3         2      2          2

df.2 <- data.frame(Clijsters = c(1, 1, 1, 1, 0, 0, 0, 0, 0, 0), Davenport = c(-1, 
    0, 0, 0, 1, 1, 1, 0, 0, 0), Pierce = c(0, -1, 0, 0, -1, 0, 0, 1, 1, 0), 
    S.Williams = c(0, 0, -1, 0, 0, -1, 0, -1, 0, 1), V.Williams = c(0, 0, 0, 
        -1, 0, 0, -1, 0, -1, -1), n.ij = c(6, 3, 0, 2, 0, 2, 4, 0, 1, 2), n.ji = c(2, 
        1, 2, 3, 2, 2, 2, 2, 2, 2))

Fit the Bradley-Terry (BT) model, report parameter estimates, and rank the players
Using the fitted BT model, estimate the probability that Serena Williams beats Venus Williams, and compare the model estimate with the actual sample proportion.
Showing relevant calculations and using the fitted BT model, construct the 90% confidence interval for the probability that Serena Williams beats Davenport, and interpret the results.
### a. Bradley-Terry Model

fit.bt <- glm(n.ij/(n.ij + n.ji) ~ -1 + Clijsters + Davenport + Pierce + S.Williams + 
    V.Williams, family = binomial, weights = n.ij + n.ji, data = df.2)
bt.sum <- summary(fit.bt)
bt.sum


Call:
glm(formula = n.ij/(n.ij + n.ji) ~ -1 + Clijsters + Davenport + 
    Pierce + S.Williams + V.Williams, family = binomial, data = df.2, 
    weights = n.ij + n.ji)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7732  -0.9924  -0.3752   0.7266   1.1643  

Coefficients: (1 not defined because of singularities)
           Estimate Std. Error z value Pr(>|z|)
Clijsters    0.1674     0.5960   0.281    0.779
Davenport   -0.2795     0.5796  -0.482    0.630
Pierce      -0.4575     0.7249  -0.631    0.528
S.Williams   0.5592     0.6957   0.804    0.422
V.Williams       NA         NA      NA       NA

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 12.678  on 10  degrees of freedom
Residual deviance: 10.117  on  6  degrees of freedom
AIC: 32.068

Number of Fisher Scoring iterations: 4

Player rankings based on the Bradley-Terry model coefficient estimates above:

S. Williams
Clijsters
V. Williams
Davenport
Pierce

b. Probability of Serena Williams beating Venus Williams

exp(bt.sum$coefficients[4, 1])/(1 + exp(bt.sum$coefficients[4, 1]))

[1] 0.6362715

The estimated probability of Serena winning against her twin from the BT model is 0.6362715.

c. 90% CI for Probability of Serena beating Davenport

fit.bt <- glm(n.ij/(n.ij + n.ji) ~ -1 + Clijsters + V.Williams + Pierce + S.Williams + 
    Davenport, family = binomial, weights = n.ij + n.ji, data = df.2)
bt.sum <- summary(fit.bt)
LB <- (bt.sum$coefficients[4, 1] - bt.sum$coefficients[2, 1]) - 1.65 * bt.sum$coefficients[4, 
    2]
UB <- (bt.sum$coefficients[4, 1] - bt.sum$coefficients[2, 1]) + 1.65 * bt.sum$coefficients[4, 
    2]

90% CI: [-0.5935502, 1.7119853]
We can be 90% confident that the probability of Serena Williams defeating Davenport is contained in the above interval.

(5) Agresti pg 223, Exercise 7.4

The MBTI data file cross-classifies the MBTI Step II National Sample on four binary scales of theMyers–Briggs personality test: Extroversion/Introversion (E/I), Sensing/iNtuitive (S/N), Thinking/Feeling (T/F), and Judging/Perceiving (J/P). Note that the counts here are recorded in the “n” variable. After fitting the HA model and answering the questions in the text (called part (a))

rm(list = ls())
# df <- read.table('/Users/Patrick/Dropbox/data/agresti/MBTI.dat', header =
# TRUE)
df <- read.table("http://users.stat.ufl.edu/~aa/cat/data/MBTI.dat", header = TRUE)
df <- as_tibble(df)
df

# A tibble: 16 x 7
   EI    SN    TF    JP    smoke drink     n
   <fct> <fct> <fct> <fct> <int> <int> <int>
 1 e     s     t     j        13    10    77
 2 e     s     t     p        11     8    42
 3 e     s     f     j        16     5   106
 4 e     s     f     p        19     7    79
 5 e     n     t     j         6     3    23
 6 e     n     t     p         4     2    18
 7 e     n     f     j         6     4    31
 8 e     n     f     p        23    15    80
 9 i     s     t     j        32    17   140
10 i     s     t     p         9     3    52
11 i     s     f     j        34     6   138
12 i     s     f     p        29     4   106
13 i     n     t     j         4     1    13
14 i     n     t     p         9     5    35
15 i     n     f     j         4     1    31
16 i     n     f     p        22     6    79

Fit the loglinear model of homogeneous association and conduct a goodness-of-fit test. Based on the fit, show that (i) the estimated conditional association is strongest between the S/N and J/P scales, (ii) there is not strong evidence of conditional association between the E/I and T/F scale or between the E/I and J/P scales.
Drop the two non-significant two-factor interaction terms, refit the new model, and clearly and thoroughly interpret the estimated odds ratio associated with the E/I by S/N interaction.
Clearly interpret in terms of odds ratios that the interaction between E/I and T/F is nonsignificant.

a. Loglinear Model of Homogenous Association

# Saturated loglinear model
fit.1 <- glm(n ~ EI + SN + TF + JP + EI:SN + EI:TF + EI:JP + SN:TF + SN:JP + 
    TF:JP, family = poisson, data = df)
fit.1.sum <- summary(fit.1)
fit.1.sum


Call:
glm(formula = n ~ EI + SN + TF + JP + EI:SN + EI:TF + EI:JP + 
    SN:TF + SN:JP + TF:JP, family = poisson, data = df)

Deviance Residuals: 
       1         2         3         4         5         6         7  
-0.72826   1.00215   0.05168  -0.01429   1.49947  -1.29325  -0.07596  
       8         9        10        11        12        13        14  
 0.00231   0.56850  -0.82975  -0.04948   0.01728  -1.57051   1.09960  
      15        16  
 0.08587  -0.00804  

Coefficients:
            Estimate Std. Error z value             Pr(>|z|)    
(Intercept)  3.44760    0.13793  24.994 < 0.0000000000000002 ***
EIi         -0.02907    0.15266  -0.190             0.848952    
SNs          1.21082    0.14552   8.320 < 0.0000000000000002 ***
TFt         -0.64194    0.16768  -3.828             0.000129 ***
JPp          0.93417    0.14594   6.401       0.000000000154 ***
EIi:SNs      0.30212    0.14233   2.123             0.033780 *  
EIi:TFt      0.19449    0.13121   1.482             0.138258    
EIi:JPp      0.01766    0.13160   0.134             0.893261    
SNs:TFt      0.40920    0.15243   2.684             0.007265 ** 
SNs:JPp     -1.22153    0.14547  -8.397 < 0.0000000000000002 ***
TFt:JPp     -0.55936    0.13512  -4.140       0.000034776300 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 399.944  on 15  degrees of freedom
Residual deviance:  10.162  on  5  degrees of freedom
AIC: 125

Number of Fisher Scoring iterations: 4

# GoF Test
chisq.1 <- fit.1.sum$deviance
chisq.df1 <- fit.1.sum$df[2]
pval <- 1 - pchisq(q = chisq.1, df = chisq.df1)

Test Stat: \(\chi^{2} = 10.1617063\) with 5 degrees of freedom
Conclusion: With a p-value of 0.0707809 being greater than \(\alpha = 0.05\), we fail to reject the null lacking statistically significant evidence in the direction of the alternative hypothesis. That is, the saturated loglinear model does not fit the data well.

The extremely low and insignificant p-values for the coefficient estimates shown above indicate respectively that the estimated conditional association is strongest between the S/N and J/P scales AND there is not strong evidence of conditional association between the E/I and T/F scale or between the E/I and J/P scales.

b. Refit without insignificant interaction terms

fit.2 <- glm(n ~ EI + SN + TF + JP + EI:SN + SN:TF + SN:JP + TF:JP, family = poisson, 
    data = df)
fit.2.sum <- summary(fit.2)
fit.2.sum


Call:
glm(formula = n ~ EI + SN + TF + JP + EI:SN + SN:TF + SN:JP + 
    TF:JP, family = poisson, data = df)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.65487  -0.46916   0.00529   0.54208   1.47431  

Coefficients:
            Estimate Std. Error z value             Pr(>|z|)    
(Intercept)  3.41362    0.12930  26.402 < 0.0000000000000002 ***
EIi          0.03871    0.11361   0.341             0.733287    
SNs          1.19414    0.14548   8.208 0.000000000000000224 ***
TFt         -0.54137    0.15282  -3.543             0.000396 ***
JPp          0.94292    0.13064   7.218 0.000000000000528411 ***
EIi:SNs      0.32190    0.13598   2.367             0.017922 *  
SNs:TFt      0.42366    0.15200   2.787             0.005318 ** 
SNs:JPp     -1.22021    0.14513  -8.408 < 0.0000000000000002 ***
TFt:JPp     -0.55853    0.13497  -4.138 0.000035027315884391 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 399.944  on 15  degrees of freedom
Residual deviance:  12.369  on  7  degrees of freedom
AIC: 123.2

Number of Fisher Scoring iterations: 4

exp(fit.2.sum$coefficients[6, 1])

[1] 1.379747

The estimated odds ratio associated with the E/I by S/N interaction can be interpreted to mean that the odds of being I given the odds of being S are 1.3797468

c. Interpreting the E/I:T/F OR interaction

(6) Agresti pg 223, Exercise 7.6

Table 7.15, from a General Social Survey, relates responses on R = religious service attendance (1 = at most a few times a year, 2 = at least several times a year), P = political views (1 = liberal, 2 = moderate, 3 = conservative), B = birth control availability to teenagers between the ages of 14 and 16 (1 = agree, 2 = disagree), S = sex relations before marriage (1 = wrong only sometimes or not wrong at all, 2 = always or almost always wrong).

rm(list = ls())
dat <- read.table("/Users/Patrick/Dropbox/data/agresti/BPRS.dat", header = TRUE)
df <- dat %>% as_tibble(df) %>% transmute_at(vars(-count), list(as.factor)) %>% 
    add_column(counts = as.numeric(dat$count))
df

# A tibble: 24 x 5
   B     P     R     S     counts
   <fct> <fct> <fct> <fct>  <dbl>
 1 1     1     1     1         99
 2 2     1     1     1         15
 3 1     1     2     1         73
 4 2     1     2     1         25
 5 1     1     1     2          8
 6 2     1     1     2          4
 7 1     1     2     2         24
 8 2     1     2     2         22
 9 1     2     1     1         73
10 2     2     1     1         20
# … with 14 more rows

Investigate the complexity needed for loglinear modeling by fitting models having only single-factor terms, all two-factor terms, and all three-factor terms. Select a model and interpret it by estimating conditional odds ratios.
Draw the independence graph for model (BP,BR,BS,PS,RS). Remark on conditional independence patterns. Are any fitted marginal and conditional associations identical?
Fit the loglinear model that corresponds to the logistic model that predicts S using the other variables as main effects, without any interaction. Does it fit adequately?

a. Model Fitting of Various Complexities

fit.1 <- glm(counts ~ B + P + R + S, family = poisson, data = df)
fit.2 <- glm(counts ~ B + P + R + S + B:P + B:R + B:S + P:R + P:S + R:S, family = poisson, 
    data = df)
fit.3 <- glm(counts ~ B + P + R + S + B:P + B:R + B:S + P:R + P:S + R:S + B:P:R + 
    B:R:S + P:R:S, family = poisson, data = df)
fitsum1 <- summary(fit.1)
fitsum2 <- summary(fit.2)
fitsum3 <- summary(fit.3)
OR.cond <- fitsum1$coefficients[, 1]

OR Estimates: 3.6624347, -0.4935838, 0.2876821, 0.0919375, 0.5443742, -0.5443742

b. Independence Graph for 2 Factor Interaction Model

(BP,BR,BS,PS,RS)

indygraph <- as.undirected(make_graph(edges = c("B", "P", "B", "R", "B", "S", 
    "P", "S", "R", "S")))
plot(indygraph)

c.

df <- data.frame(B = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), P = c(1, 1, 2, 2, 
    3, 3, 1, 1, 2, 2, 3, 3), R = c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2), right = c(99, 
    73, 73, 87, 51, 51, 15, 25, 20, 37, 19, 36), wrong = c(8, 24, 20, 20, 6, 
    33, 4, 22, 13, 60, 12, 88))

fit.lr <- glm((right/(wrong + right)) ~ B + P + R, family = binomial, weights = wrong + 
    right, data = df)
lr.sum <- summary(fit.lr)


1 - pchisq(lr.sum$deviance, lr.sum$df.residual)

[1] 0.0424363

(7) Agresti pg 224, Exercise 7.8

For a three-way contingency table, consider the independence graph,

Write the corresponding loglinear model. Which pairs of variables are conditionally independent? Which pairs of variables have the same marginal association as their conditional association?

Corresponding LogLinear Model: \(log(y_{ij}) = \lambda + \lambda_i^x + \lambda_j^z + \lambda_k^y +\lambda_{ij}^{xz}\)
Conditionally Independent: X & Y, Y & Z
Marginal Association same as Conditional Association: X & Y

(8) Agresti pg 224-225, Exercise 7.12

For the substance use data, consider loglinear model (AC,AM,CM, AG,AR,GM,GR).

rm(list = ls())
df <- read.table("/Users/Patrick/Dropbox/data/agresti/substance2.dat", header = TRUE)
head(df)

  A C M R G count
1 1 1 1 1 1   405
2 1 1 2 1 1   268
3 1 1 1 1 2   453
4 1 1 2 1 2   228
5 1 1 1 2 1    23
6 1 1 2 2 1    23

Explain why the AM conditional odds ratio is unchanged by collapsing over race, but it is not unchanged by collapsing over gender.
Suppose we remove the GM term from the model. Construct the independence graph and show that {G,R} are separated from {C,M} by A. Explain why all conditional associations among A, C, and M are then identical to those in model (AC, AM, CM), collapsing over G and R.

a. Collapsibility

fit <- glm(count ~ A + C + M + R + G + A:C + A:M + C:M + A:G + A:R + G:M + G:R, 
    family = poisson, data = df)
fit.sum <- summary(fit)
fit.sum


Call:
glm(formula = count ~ A + C + M + R + G + A:C + A:M + C:M + A:G + 
    A:R + G:M + G:R, family = poisson, data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5357  -0.5772  -0.0178   0.5134   2.4676  

Coefficients:
             Estimate Std. Error z value             Pr(>|z|)    
(Intercept)  26.31264    1.04852  25.095 < 0.0000000000000002 ***
A           -11.69693    0.95958 -12.190 < 0.0000000000000002 ***
C            -7.91818    0.34762 -22.778 < 0.0000000000000002 ***
M            -5.97406    0.51168 -11.675 < 0.0000000000000002 ***
R            -3.38269    0.34182  -9.896 < 0.0000000000000002 ***
G            -0.01334    0.23569  -0.057              0.95486    
A:C           2.05453    0.17406  11.803 < 0.0000000000000002 ***
A:M           3.00592    0.46484   6.467         0.0000000001 ***
C:M           2.84789    0.16384  17.382 < 0.0000000000000002 ***
A:G           0.29229    0.12768   2.289              0.02207 *  
A:R           0.59346    0.19268   3.080              0.00207 ** 
M:G          -0.26929    0.09039  -2.979              0.00289 ** 
R:G           0.12619    0.15866   0.795              0.42644    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4818.051  on 31  degrees of freedom
Residual deviance:   19.909  on 19  degrees of freedom
AIC: 182.15

Number of Fisher Scoring iterations: 5

The AM conditional odds ratio is unchanged by collapsing over race but it is not unchanged by collapsing over gender because the collapsibility conditions are not met for collapsing over race since race is conditionally dependent on alcohol.

b.

SCRAP CODE

Problem 1

df <- as.table(x = matrix(data = c(833, 125, 2, 160), nrow = 2, byrow = TRUE))
# [2x2 Contingency Table]
attributes(df)$dimnames[[1]] <- list("Yes", "No")
attributes(df)$dimnames[[2]] <- list("Yes", "No")

df

    Yes  No
Yes 833 125
No    2 160

df.c <- data.frame(belief = c(rep("Heaven", 2), rep("Hell", 2)), answer = c("Yes", 
    "No", "Yes", "No"), counts = c(df[1, 1], df[2, 1], df[2, 1], df[2, 2]))

### Is agresti's case specific data format available or does he provide a way
### to transform it?????
case.df <- read.table("/Users/Patrick/Dropbox/data/agresti/Afterlife.dat", header = TRUE)
sum(case.df$yes)

[1] 710

## Marginal df[1,1] + df[1,2] df[2,1] + df[2,2] df[1,1] + df[2,1] df[1,2] +
## df[2,2]

Problem 2

Problem 3

Problem 4

rm(list = ls())
df <- as.table(x = matrix(data = c(NA, 6, 3, 0, 2, 2, NA, 0, 2, 4, 1, 2, NA, 
    0, 1, 2, 2, 2, NA, 2, 3, 2, 2, 2, NA), nrow = 5, byrow = TRUE))
# [5x5 Contingency Table]
attributes(df)$dimnames[[1]] <- list("Clijsters", "Davenport", "Pierce", "S.Williams", 
    "V.Williams")
attributes(df)$dimnames[[2]] <- list("Clijsters", "Davenport", "Pierce", "S.Williams", 
    "V.Williams")
# Columns: Loser & Rows: Winner
df

           Clijsters Davenport Pierce S.Williams V.Williams
Clijsters                    6      3          0          2
Davenport          2                0          2          4
Pierce             1         2                 0          1
S.Williams         2         2      2                     2
V.Williams         3         2      2          2

df.2 <- data.frame(Clijsters = c(1, 1, 1, 1, 0, 0, 0, 0, 0, 0), Davenport = c(-1, 
    0, 0, 0, 1, 1, 1, 0, 0, 0), Pierce = c(0, -1, 0, 0, -1, 0, 0, 1, 1, 0), 
    S.Williams = c(0, 0, -1, 0, 0, -1, 0, -1, 0, 1), V.Williams = c(0, 0, 0, 
        -1, 0, 0, -1, 0, -1, -1), n.ij = c(6, 3, 0, 2, 0, 2, 4, 0, 1, 2), n.ji = c(2, 
        1, 2, 3, 2, 2, 2, 2, 2, 2))

df.sf <- countsToBinomial(df)
names(df.sf) <- c("p1", "p2", "wins", "losses")
mod <- BTm(cbind(wins, losses), p1, p2, ~player, id = "player", data = df.sf)
summary(mod)


Call:
BTm(outcome = cbind(wins, losses), player1 = p1, player2 = p2, 
    formula = ~player, id = "player", data = df.sf)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7732  -0.9924  -0.3752   0.7266   1.1643  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)
playerDavenport   -0.4470     0.5560  -0.804    0.421
playerPierce      -0.6249     0.7069  -0.884    0.377
playerS.Williams   0.3918     0.7290   0.537    0.591
playerV.Williams  -0.1674     0.5960  -0.281    0.779

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 12.678  on 10  degrees of freedom
Residual deviance: 10.117  on  6  degrees of freedom
AIC: 32.068

Number of Fisher Scoring iterations: 4