A recent General Social Survey asked subjects whether they believed in heaven and whether they believed in hell.
rm(list = ls())
df <- as.table(x = matrix(data = c(833, 125, 2, 160), nrow = 2, byrow = TRUE))
# [2x2 Contingency Table]
attributes(df)$dimnames[[1]] <- list("Yes", "No")
attributes(df)$dimnames[[2]] <- list("Yes", "No")
# Columns: Believe in Hell & Rows: Believe in Heaven
df Yes No
Yes 833 125
No 2 160
\(H_{0}:\) The population proportions for “yes” are equal for heaven and hell
\(H_{A}:\) The population proportions for “yes” are unequal for heaven and hell
McNemar's Chi-squared test
data: df
McNemar's chi-squared = 119.13, df = 1, p-value <
0.00000000000000022
Test Stat: \(\chi^{2} = 119.1259843\)
Conclusion: With a p-value of 0 being less than \(\alpha = 0.05\), we reject the null with evidence in the direction of the alternative. That is, we have statistically significant evidence suggesting that the population proportions of people answering “yes” are unequal for heaven and hell.
[1] 0.09417584 0.12546701
attr(,"conf.level")
[1] 0.9
90% Wald CI: [0.0941758, 0.125467]
We can be 90% confident that the difference between population proportions for those believing in hell and those believing in heaven is contained in the above interval.
## Marginal df[1,1] + df[1,2] df[2,1] + df[2,2] df[1,1] + df[2,1] df[1,2] +
## df[2,2]
marg <- log(((df[1, 1] + df[1, 2])/(df[2, 1] + df[2, 2]))/((df[1, 1] + df[2,
1])/(df[1, 2] + df[2, 2])))
exp(marg)[1] 2.018408
[1] 62.5
Marginal Model OR Estimate: The estimated odds of answering yes to believing in heaven are 2.0184076 times the estimated odds of answering yes to believing in hell.
Case Specific Model OR Estimate: The estimated odds of answering yes to believing in heaven are 62.5 times the estimated odds of answering yes to believing in hell.
These estimates may be different because the subject specific model controls for a variable, subject, that the marginal model ignores because it uses only the cells in the contingency table rather than a more intricate data structure.
The 2016 General Social Survey reported region of residence now and at age 16 for residents of the US in the following table
rm(list = ls())
df <- as.table(x = matrix(data = c(394, 17, 81, 38, 8, 596, 74, 59, 29, 32,
769, 35, 10, 24, 35, 417), nrow = 4, byrow = TRUE))
# [4x4 Contingency Table]
attributes(df)$dimnames[[1]] <- list("NE", "MW", "S", "W")
attributes(df)$dimnames[[2]] <- list("NE", "MW", "S", "W")
# Columns: Residence Now & Rows: Residence at Age 16
df NE MW S W
NE 394 17 81 38
MW 8 596 74 59
S 29 32 769 35
W 10 24 35 417
df.2 <- data.frame(NE = c(1, 1, 1, 0, 0, 0), MW = c(-1, 0, 0, 1, 1, 0), S = c(0,
-1, 0, -1, 0, 1), W = c(0, 0, -1, 0, -1, -1), n.ij = c(17, 81, 38, 74, 59,
35), n.ji = c(8, 29, 10, 32, 24, 35))
df.2 <- df.2 %>% mutate(stdres = (n.ij - n.ji/sqrt(n.ij + n.ji)))Comparing the fits of the Symmetry & Quasi-Symmetry Model
## Symmetry Model
fit.sym <- glm((n.ij/(n.ij + n.ji)) ~ -1, family = binomial, weights = n.ij +
n.ji, data = df.2)
## Quasi-Symmetry Model
fit.qsym <- glm((n.ij/(n.ij + n.ji)) ~ -1 + NE + MW + S + W, family = binomial,
weights = n.ij + n.ji, data = df.2)
sym.sum <- summary(fit.sym)
qsym.sum <- summary(fit.qsym)
sym.sum
Call:
glm(formula = (n.ij/(n.ij + n.ji)) ~ -1, family = binomial, data = df.2,
weights = n.ij + n.ji)
Deviance Residuals:
Min 1Q Median 3Q Max
0.000 2.341 4.019 4.164 5.059
No Coefficients
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 78.657 on 6 degrees of freedom
Residual deviance: 78.657 on 6 degrees of freedom
AIC: 105.38
Number of Fisher Scoring iterations: 0
Call:
glm(formula = (n.ij/(n.ij + n.ji)) ~ -1 + NE + MW + S + W, family = binomial,
data = df.2, weights = n.ij + n.ji)
Deviance Residuals:
1 2 3 4 5 6
0.8708 -0.6262 0.2526 0.2935 0.1705 -0.3423
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
NE 1.24600 0.20952 5.947 0.00000000273 ***
MW 0.85831 0.17989 4.771 0.00000182949 ***
S 0.08184 0.17330 0.472 0.637
W NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 78.6573 on 6 degrees of freedom
Residual deviance: 1.4466 on 3 degrees of freedom
AIC: 34.172
Number of Fisher Scoring iterations: 3
[1] 78.65733
[1] 6
[1] 1.446581
[1] 3
[1] 0.00000000000000677236
[1] 0.6946532
[1] 77.21075
[1] 3
# P-value
1 - pchisq(sym.sum$deviance - qsym.sum$deviance, sym.sum$df.residual - qsym.sum$df.residual)[1] 0.0000000000000001110223
Based on the test statistics & p-values printed above, the quasi-symmetry model seems to fit the data better than the symmetry model indicating marginal heterogeneity in the data; this is supported by the multiple, statistically significant, non-zero \(\hat{\beta}\)’s in the Quasi-Symmetry Model.
Call:
glm(formula = (n.ij/(n.ij + n.ji)) ~ -1 + NE + MW + S + W, family = binomial,
data = df.2, weights = n.ij + n.ji)
Deviance Residuals:
1 2 3 4 5 6
0.8708 -0.6262 0.2526 0.2935 0.1705 -0.3423
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
NE 1.24600 0.20952 5.947 0.00000000273 ***
MW 0.85831 0.17989 4.771 0.00000182949 ***
S 0.08184 0.17330 0.472 0.637
W NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 78.6573 on 6 degrees of freedom
Residual deviance: 1.4466 on 3 degrees of freedom
AIC: 34.172
Number of Fisher Scoring iterations: 3
NE
3.476408
The estimated probability of residents of the Northeast at age 16 living in the West currently is 347.6407551% of the estimated probability of residents of the West at age 16 living in the Northeast currently.
NE
1.473568
The estimated probability of relocating from the Northeast to the Midwest is 147.3567805% of the estimated probability of the reverse.
The following table displays diagnoses of multiple sclerosis for two neurologists. The categories are (1) Certain multiple sclerosis, (2) Probable multiple sclerosis, (3) Possible multiple sclerosis, (4) Doubtful, unlikely, or definitely not multiple sclerosis. Analyze the agreement.
rm(list = ls())
df <- as.table(x = matrix(data = c(38, 5, 0, 1, 33, 11, 3, 0, 10, 14, 5, 6,
3, 7, 3, 10), nrow = 4, byrow = TRUE))
# [4x4 Contingency Table]
attributes(df)$dimnames[[1]] <- list("1", "2", "3", "4")
attributes(df)$dimnames[[2]] <- list("1", "2", "3", "4")
# Columns: Neurologist B & Rows: Neurologist A
df 1 2 3 4
1 38 5 0 1
2 33 11 3 0
3 10 14 5 6
4 3 7 3 10
df.2 <- data.frame(a = rep(1:4, each = 4), b = rep(1:4, times = 4), n = c(38,
5, 0, 1, 33, 11, 3, 0, 10, 14, 5, 6, 3, 7, 3, 10), diag = c(1, 0, 0, 0,
0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1))
df.3 <- data.frame(MS1 = c(1, 1, 1, 0, 0, 0), MS2 = c(-1, 0, 0, 1, 1, 0), MS3 = c(0,
-1, 0, -1, 0, 1), MS4 = c(0, 0, -1, 0, -1, -1), n.ij = c(5, 0, 1, 3, 0,
6), n.ji = c(33, 10, 3, 14, 7, 3), x = c(1, 2, 3, 1, 2, 1))| 1 | 2 | 3 | 4 |
|---|---|---|---|
| 4.778145 | -2.4633322 | -2.2309078 | -2.270845 |
| 2.311914 | -0.2738636 | -0.3167425 | -2.973579 |
| -3.792088 | 2.3745061 | 1.7855553 | 1.219738 |
| -4.556948 | 0.6762936 | 1.1290412 | 5.260539 |
If there is an association between Neurologist A & B’s level of diagnoses, it would be expected that the residuals along the diagonal are larger positive values and, as the difference between i and j increases (residual cells further from the diagonal), the magnitude of residuals decreases. The largest magnitude residuals are on the diagonal at i,j = 1 and i,j = 4 which intuitively makes sense because these cells represents the most extreme levels of diagnoses certainty and trained neurologists should probably agree on the most obvious cases. There seems to be a negative residual trend as the distance from the diagonal increases which may indicate higher rates of agreement since the observed frequencies of more extreme differences is fewer than the expected frequencies assuming independence between the diagnoses.
# Quasi-Symmetry Model
fit.qsym <- glm((n.ij/(n.ij + n.ji)) ~ -1 + MS1 + MS2 + MS3 + MS4, family = binomial,
weights = n.ij + n.ji, data = df.3)
# Ordinal Quasi-Symmetry Model
fit.oqsym <- glm((n.ij/(n.ij + n.ji)) ~ -1 + x, family = binomial, weights = n.ij +
n.ji, data = df.3)
# Quasi-Independence Model
fit.qi <- glm(n ~ factor(a) + factor(b) + factor(diag), family = poisson, data = df.2)
qsym.sum <- summary(fit.qsym)
oqsym.sum <- summary(fit.oqsym)
qi.sum <- summary(fit.qi)
# 1 - pchisq(qsym.sum$deviance, qsym.sum$df.residual) 1 -
# pchisq(oqsym.sum$deviance, oqsym.sum$df.residual) 1 -
# pchisq(qi.sum$deviance, qi.sum$df.residual)Quasi-Symmetry Model
Call:
glm(formula = (n.ij/(n.ij + n.ji)) ~ -1 + MS1 + MS2 + MS3 + MS4,
family = binomial, data = df.3, weights = n.ij + n.ji)
Deviance Residuals:
1 2 3 4 5 6
-0.2731 -0.7188 1.4863 0.5215 -1.7293 0.3481
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
MS1 -3.1930 0.7633 -4.183 0.0000287 ***
MS2 -1.4349 0.6583 -2.180 0.0293 *
MS3 0.4501 0.5840 0.771 0.4409
MS4 NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 56.442 on 6 degrees of freedom
Residual deviance: 6.184 on 3 degrees of freedom
AIC: 22.646
Number of Fisher Scoring iterations: 5
Deviance: 6.1840017 with 3 degrees of freedom
Comments: With a p-value of 0.1029934, the QS model seems to fit the data relatively well compared to the other two models.
Ordinal Quasi-Symmetry Model
Call:
glm(formula = (n.ij/(n.ij + n.ji)) ~ -1 + x, family = binomial,
data = df.3, weights = n.ij + n.ji)
Deviance Residuals:
1 2 3 4 5 6
-1.4175 -1.2483 1.7952 -0.4599 -1.0444 2.8512
Coefficients:
Estimate Std. Error z value Pr(>|z|)
x -1.2565 0.2458 -5.111 0.00000032 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 56.442 on 6 degrees of freedom
Residual deviance: 16.222 on 5 degrees of freedom
AIC: 28.684
Number of Fisher Scoring iterations: 5
Deviance: 16.221858 with 5 degrees of freedom
Comments: With a p-value of 0.0062384, the OQS model does not the data well.
Quasi Independence Model
Call:
glm(formula = n ~ factor(a) + factor(b) + factor(diag), family = poisson,
data = df.2)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.96051 -1.92253 0.07114 1.09564 2.49717
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.6556 0.2094 12.683 < 0.0000000000000002 ***
factor(a)2 0.3723 0.2330 1.598 0.110
factor(a)3 0.2577 0.2642 0.976 0.329
factor(a)4 -0.2241 0.2836 -0.790 0.429
factor(b)2 -0.9186 0.2100 -4.375 0.000012147716 ***
factor(b)3 -2.0983 0.3295 -6.368 0.000000000192 ***
factor(b)4 -1.5503 0.2717 -5.705 0.000000011604 ***
factor(diag)1 0.8576 0.1936 4.429 0.000009472832 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 161.740 on 15 degrees of freedom
Residual deviance: 49.678 on 8 degrees of freedom
AIC: 118.95
Number of Fisher Scoring iterations: 5
Deviance: 49.6779157 with 8 degrees of freedom
Comments: With a p-value of 0, the OQS model does not the data well.
Call:
glm(formula = (n.ij/(n.ij + n.ji)) ~ -1 + x, family = binomial,
data = df.3, weights = n.ij + n.ji)
Deviance Residuals:
1 2 3 4 5 6
-1.4175 -1.2483 1.7952 -0.4599 -1.0444 2.8512
Coefficients:
Estimate Std. Error z value Pr(>|z|)
x -1.2565 0.2458 -5.111 0.00000032 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 56.442 on 6 degrees of freedom
Residual deviance: 16.222 on 5 degrees of freedom
AIC: 28.684
Number of Fisher Scoring iterations: 5
\(\hat{\beta_{x}}\) Estimate Interpretation: the estimated probability of Neurologist A’s diagnoses are x levels higher than Neurologist B’s diagnoses is 0.2846518 times the estimated probability of Neurologist B’s diagnoses are x levels higher than Neurologist A’s diagnoses.
[1] 0.5245765
Because the data are ordinal factors, the weighted kappa estimate of 0.5245765 makes most sense in this context and can be interpreted to mean that the difference between observed and expected agreements between the two neurologists is 52.4576464% the maximum possible difference between agreements, assuming independence between their diagnoses.
The table below summarizes the results of tennis matches for several women professional players between 2003 and 2005.
rm(list = ls())
df <- as.table(x = matrix(data = c(NA, 6, 3, 0, 2, 2, NA, 0, 2, 4, 1, 2, NA,
0, 1, 2, 2, 2, NA, 2, 3, 2, 2, 2, NA), nrow = 5, byrow = TRUE))
# [5x5 Contingency Table]
attributes(df)$dimnames[[1]] <- list("Clijsters", "Davenport", "Pierce", "S.Williams",
"V.Williams")
attributes(df)$dimnames[[2]] <- list("Clijsters", "Davenport", "Pierce", "S.Williams",
"V.Williams")
# Columns: Loser & Rows: Winner
df Clijsters Davenport Pierce S.Williams V.Williams
Clijsters 6 3 0 2
Davenport 2 0 2 4
Pierce 1 2 0 1
S.Williams 2 2 2 2
V.Williams 3 2 2 2
df.2 <- data.frame(Clijsters = c(1, 1, 1, 1, 0, 0, 0, 0, 0, 0), Davenport = c(-1,
0, 0, 0, 1, 1, 1, 0, 0, 0), Pierce = c(0, -1, 0, 0, -1, 0, 0, 1, 1, 0),
S.Williams = c(0, 0, -1, 0, 0, -1, 0, -1, 0, 1), V.Williams = c(0, 0, 0,
-1, 0, 0, -1, 0, -1, -1), n.ij = c(6, 3, 0, 2, 0, 2, 4, 0, 1, 2), n.ji = c(2,
1, 2, 3, 2, 2, 2, 2, 2, 2))fit.bt <- glm(n.ij/(n.ij + n.ji) ~ -1 + Clijsters + Davenport + Pierce + S.Williams +
V.Williams, family = binomial, weights = n.ij + n.ji, data = df.2)
bt.sum <- summary(fit.bt)
bt.sum
Call:
glm(formula = n.ij/(n.ij + n.ji) ~ -1 + Clijsters + Davenport +
Pierce + S.Williams + V.Williams, family = binomial, data = df.2,
weights = n.ij + n.ji)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7732 -0.9924 -0.3752 0.7266 1.1643
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
Clijsters 0.1674 0.5960 0.281 0.779
Davenport -0.2795 0.5796 -0.482 0.630
Pierce -0.4575 0.7249 -0.631 0.528
S.Williams 0.5592 0.6957 0.804 0.422
V.Williams NA NA NA NA
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 12.678 on 10 degrees of freedom
Residual deviance: 10.117 on 6 degrees of freedom
AIC: 32.068
Number of Fisher Scoring iterations: 4
Player rankings based on the Bradley-Terry model coefficient estimates above:
[1] 0.6362715
The estimated probability of Serena winning against her twin from the BT model is 0.6362715.
fit.bt <- glm(n.ij/(n.ij + n.ji) ~ -1 + Clijsters + V.Williams + Pierce + S.Williams +
Davenport, family = binomial, weights = n.ij + n.ji, data = df.2)
bt.sum <- summary(fit.bt)
LB <- (bt.sum$coefficients[4, 1] - bt.sum$coefficients[2, 1]) - 1.65 * bt.sum$coefficients[4,
2]
UB <- (bt.sum$coefficients[4, 1] - bt.sum$coefficients[2, 1]) + 1.65 * bt.sum$coefficients[4,
2]90% CI: [-0.5935502, 1.7119853]
We can be 90% confident that the probability of Serena Williams defeating Davenport is contained in the above interval.
The MBTI data file cross-classifies the MBTI Step II National Sample on four binary scales of theMyers–Briggs personality test: Extroversion/Introversion (E/I), Sensing/iNtuitive (S/N), Thinking/Feeling (T/F), and Judging/Perceiving (J/P). Note that the counts here are recorded in the “n” variable. After fitting the HA model and answering the questions in the text (called part (a))
rm(list = ls())
# df <- read.table('/Users/Patrick/Dropbox/data/agresti/MBTI.dat', header =
# TRUE)
df <- read.table("http://users.stat.ufl.edu/~aa/cat/data/MBTI.dat", header = TRUE)
df <- as_tibble(df)
df# A tibble: 16 x 7
EI SN TF JP smoke drink n
<fct> <fct> <fct> <fct> <int> <int> <int>
1 e s t j 13 10 77
2 e s t p 11 8 42
3 e s f j 16 5 106
4 e s f p 19 7 79
5 e n t j 6 3 23
6 e n t p 4 2 18
7 e n f j 6 4 31
8 e n f p 23 15 80
9 i s t j 32 17 140
10 i s t p 9 3 52
11 i s f j 34 6 138
12 i s f p 29 4 106
13 i n t j 4 1 13
14 i n t p 9 5 35
15 i n f j 4 1 31
16 i n f p 22 6 79
# Saturated loglinear model
fit.1 <- glm(n ~ EI + SN + TF + JP + EI:SN + EI:TF + EI:JP + SN:TF + SN:JP +
TF:JP, family = poisson, data = df)
fit.1.sum <- summary(fit.1)
fit.1.sum
Call:
glm(formula = n ~ EI + SN + TF + JP + EI:SN + EI:TF + EI:JP +
SN:TF + SN:JP + TF:JP, family = poisson, data = df)
Deviance Residuals:
1 2 3 4 5 6 7
-0.72826 1.00215 0.05168 -0.01429 1.49947 -1.29325 -0.07596
8 9 10 11 12 13 14
0.00231 0.56850 -0.82975 -0.04948 0.01728 -1.57051 1.09960
15 16
0.08587 -0.00804
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.44760 0.13793 24.994 < 0.0000000000000002 ***
EIi -0.02907 0.15266 -0.190 0.848952
SNs 1.21082 0.14552 8.320 < 0.0000000000000002 ***
TFt -0.64194 0.16768 -3.828 0.000129 ***
JPp 0.93417 0.14594 6.401 0.000000000154 ***
EIi:SNs 0.30212 0.14233 2.123 0.033780 *
EIi:TFt 0.19449 0.13121 1.482 0.138258
EIi:JPp 0.01766 0.13160 0.134 0.893261
SNs:TFt 0.40920 0.15243 2.684 0.007265 **
SNs:JPp -1.22153 0.14547 -8.397 < 0.0000000000000002 ***
TFt:JPp -0.55936 0.13512 -4.140 0.000034776300 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 399.944 on 15 degrees of freedom
Residual deviance: 10.162 on 5 degrees of freedom
AIC: 125
Number of Fisher Scoring iterations: 4
# GoF Test
chisq.1 <- fit.1.sum$deviance
chisq.df1 <- fit.1.sum$df[2]
pval <- 1 - pchisq(q = chisq.1, df = chisq.df1)Test Stat: \(\chi^{2} = 10.1617063\) with 5 degrees of freedom
Conclusion: With a p-value of 0.0707809 being greater than \(\alpha = 0.05\), we fail to reject the null lacking statistically significant evidence in the direction of the alternative hypothesis. That is, the saturated loglinear model does not fit the data well.
The extremely low and insignificant p-values for the coefficient estimates shown above indicate respectively that the estimated conditional association is strongest between the S/N and J/P scales AND there is not strong evidence of conditional association between the E/I and T/F scale or between the E/I and J/P scales.
fit.2 <- glm(n ~ EI + SN + TF + JP + EI:SN + SN:TF + SN:JP + TF:JP, family = poisson,
data = df)
fit.2.sum <- summary(fit.2)
fit.2.sum
Call:
glm(formula = n ~ EI + SN + TF + JP + EI:SN + SN:TF + SN:JP +
TF:JP, family = poisson, data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.65487 -0.46916 0.00529 0.54208 1.47431
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.41362 0.12930 26.402 < 0.0000000000000002 ***
EIi 0.03871 0.11361 0.341 0.733287
SNs 1.19414 0.14548 8.208 0.000000000000000224 ***
TFt -0.54137 0.15282 -3.543 0.000396 ***
JPp 0.94292 0.13064 7.218 0.000000000000528411 ***
EIi:SNs 0.32190 0.13598 2.367 0.017922 *
SNs:TFt 0.42366 0.15200 2.787 0.005318 **
SNs:JPp -1.22021 0.14513 -8.408 < 0.0000000000000002 ***
TFt:JPp -0.55853 0.13497 -4.138 0.000035027315884391 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 399.944 on 15 degrees of freedom
Residual deviance: 12.369 on 7 degrees of freedom
AIC: 123.2
Number of Fisher Scoring iterations: 4
[1] 1.379747
The estimated odds ratio associated with the E/I by S/N interaction can be interpreted to mean that the odds of being I given the odds of being S are 1.3797468
Table 7.15, from a General Social Survey, relates responses on R = religious service attendance (1 = at most a few times a year, 2 = at least several times a year), P = political views (1 = liberal, 2 = moderate, 3 = conservative), B = birth control availability to teenagers between the ages of 14 and 16 (1 = agree, 2 = disagree), S = sex relations before marriage (1 = wrong only sometimes or not wrong at all, 2 = always or almost always wrong).
rm(list = ls())
dat <- read.table("/Users/Patrick/Dropbox/data/agresti/BPRS.dat", header = TRUE)
df <- dat %>% as_tibble(df) %>% transmute_at(vars(-count), list(as.factor)) %>%
add_column(counts = as.numeric(dat$count))
df# A tibble: 24 x 5
B P R S counts
<fct> <fct> <fct> <fct> <dbl>
1 1 1 1 1 99
2 2 1 1 1 15
3 1 1 2 1 73
4 2 1 2 1 25
5 1 1 1 2 8
6 2 1 1 2 4
7 1 1 2 2 24
8 2 1 2 2 22
9 1 2 1 1 73
10 2 2 1 1 20
# … with 14 more rows
fit.1 <- glm(counts ~ B + P + R + S, family = poisson, data = df)
fit.2 <- glm(counts ~ B + P + R + S + B:P + B:R + B:S + P:R + P:S + R:S, family = poisson,
data = df)
fit.3 <- glm(counts ~ B + P + R + S + B:P + B:R + B:S + P:R + P:S + R:S + B:P:R +
B:R:S + P:R:S, family = poisson, data = df)
fitsum1 <- summary(fit.1)
fitsum2 <- summary(fit.2)
fitsum3 <- summary(fit.3)
OR.cond <- fitsum1$coefficients[, 1]OR Estimates: 3.6624347, -0.4935838, 0.2876821, 0.0919375, 0.5443742, -0.5443742
(BP,BR,BS,PS,RS)
indygraph <- as.undirected(make_graph(edges = c("B", "P", "B", "R", "B", "S",
"P", "S", "R", "S")))
plot(indygraph)df <- data.frame(B = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), P = c(1, 1, 2, 2,
3, 3, 1, 1, 2, 2, 3, 3), R = c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2), right = c(99,
73, 73, 87, 51, 51, 15, 25, 20, 37, 19, 36), wrong = c(8, 24, 20, 20, 6,
33, 4, 22, 13, 60, 12, 88))
fit.lr <- glm((right/(wrong + right)) ~ B + P + R, family = binomial, weights = wrong +
right, data = df)
lr.sum <- summary(fit.lr)
1 - pchisq(lr.sum$deviance, lr.sum$df.residual)[1] 0.0424363
For a three-way contingency table, consider the independence graph,
Write the corresponding loglinear model. Which pairs of variables are conditionally independent? Which pairs of variables have the same marginal association as their conditional association?
Corresponding LogLinear Model: \(log(y_{ij}) = \lambda + \lambda_i^x + \lambda_j^z + \lambda_k^y +\lambda_{ij}^{xz}\)
Conditionally Independent: X & Y, Y & Z
Marginal Association same as Conditional Association: X & Y
For the substance use data, consider loglinear model (AC,AM,CM, AG,AR,GM,GR).
rm(list = ls())
df <- read.table("/Users/Patrick/Dropbox/data/agresti/substance2.dat", header = TRUE)
head(df) A C M R G count
1 1 1 1 1 1 405
2 1 1 2 1 1 268
3 1 1 1 1 2 453
4 1 1 2 1 2 228
5 1 1 1 2 1 23
6 1 1 2 2 1 23
fit <- glm(count ~ A + C + M + R + G + A:C + A:M + C:M + A:G + A:R + G:M + G:R,
family = poisson, data = df)
fit.sum <- summary(fit)
fit.sum
Call:
glm(formula = count ~ A + C + M + R + G + A:C + A:M + C:M + A:G +
A:R + G:M + G:R, family = poisson, data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5357 -0.5772 -0.0178 0.5134 2.4676
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 26.31264 1.04852 25.095 < 0.0000000000000002 ***
A -11.69693 0.95958 -12.190 < 0.0000000000000002 ***
C -7.91818 0.34762 -22.778 < 0.0000000000000002 ***
M -5.97406 0.51168 -11.675 < 0.0000000000000002 ***
R -3.38269 0.34182 -9.896 < 0.0000000000000002 ***
G -0.01334 0.23569 -0.057 0.95486
A:C 2.05453 0.17406 11.803 < 0.0000000000000002 ***
A:M 3.00592 0.46484 6.467 0.0000000001 ***
C:M 2.84789 0.16384 17.382 < 0.0000000000000002 ***
A:G 0.29229 0.12768 2.289 0.02207 *
A:R 0.59346 0.19268 3.080 0.00207 **
M:G -0.26929 0.09039 -2.979 0.00289 **
R:G 0.12619 0.15866 0.795 0.42644
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4818.051 on 31 degrees of freedom
Residual deviance: 19.909 on 19 degrees of freedom
AIC: 182.15
Number of Fisher Scoring iterations: 5
The AM conditional odds ratio is unchanged by collapsing over race but it is not unchanged by collapsing over gender because the collapsibility conditions are not met for collapsing over race since race is conditionally dependent on alcohol.
df <- as.table(x = matrix(data = c(833, 125, 2, 160), nrow = 2, byrow = TRUE))
# [2x2 Contingency Table]
attributes(df)$dimnames[[1]] <- list("Yes", "No")
attributes(df)$dimnames[[2]] <- list("Yes", "No")
df Yes No
Yes 833 125
No 2 160
df.c <- data.frame(belief = c(rep("Heaven", 2), rep("Hell", 2)), answer = c("Yes",
"No", "Yes", "No"), counts = c(df[1, 1], df[2, 1], df[2, 1], df[2, 2]))
### Is agresti's case specific data format available or does he provide a way
### to transform it?????
case.df <- read.table("/Users/Patrick/Dropbox/data/agresti/Afterlife.dat", header = TRUE)
sum(case.df$yes)[1] 710
rm(list = ls())
df <- as.table(x = matrix(data = c(NA, 6, 3, 0, 2, 2, NA, 0, 2, 4, 1, 2, NA,
0, 1, 2, 2, 2, NA, 2, 3, 2, 2, 2, NA), nrow = 5, byrow = TRUE))
# [5x5 Contingency Table]
attributes(df)$dimnames[[1]] <- list("Clijsters", "Davenport", "Pierce", "S.Williams",
"V.Williams")
attributes(df)$dimnames[[2]] <- list("Clijsters", "Davenport", "Pierce", "S.Williams",
"V.Williams")
# Columns: Loser & Rows: Winner
df Clijsters Davenport Pierce S.Williams V.Williams
Clijsters 6 3 0 2
Davenport 2 0 2 4
Pierce 1 2 0 1
S.Williams 2 2 2 2
V.Williams 3 2 2 2
df.2 <- data.frame(Clijsters = c(1, 1, 1, 1, 0, 0, 0, 0, 0, 0), Davenport = c(-1,
0, 0, 0, 1, 1, 1, 0, 0, 0), Pierce = c(0, -1, 0, 0, -1, 0, 0, 1, 1, 0),
S.Williams = c(0, 0, -1, 0, 0, -1, 0, -1, 0, 1), V.Williams = c(0, 0, 0,
-1, 0, 0, -1, 0, -1, -1), n.ij = c(6, 3, 0, 2, 0, 2, 4, 0, 1, 2), n.ji = c(2,
1, 2, 3, 2, 2, 2, 2, 2, 2))
df.sf <- countsToBinomial(df)
names(df.sf) <- c("p1", "p2", "wins", "losses")
mod <- BTm(cbind(wins, losses), p1, p2, ~player, id = "player", data = df.sf)
summary(mod)
Call:
BTm(outcome = cbind(wins, losses), player1 = p1, player2 = p2,
formula = ~player, id = "player", data = df.sf)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7732 -0.9924 -0.3752 0.7266 1.1643
Coefficients:
Estimate Std. Error z value Pr(>|z|)
playerDavenport -0.4470 0.5560 -0.804 0.421
playerPierce -0.6249 0.7069 -0.884 0.377
playerS.Williams 0.3918 0.7290 0.537 0.591
playerV.Williams -0.1674 0.5960 -0.281 0.779
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 12.678 on 10 degrees of freedom
Residual deviance: 10.117 on 6 degrees of freedom
AIC: 32.068
Number of Fisher Scoring iterations: 4