Categorical Data Analysis, P-Set 6

Question 1. (a) Fit the loglinear model (DV, DP, PV), and test its goodness of fit.

filename <- "http://users.stat.ufl.edu/~aa/cat/data/DeathPenalty.dat"
library(readr)
Data <- read.table(file=filename, header=T)
DV.DP.PV <- glm(count ~ (D+V+P)^2, data=Data, family=poisson)
DV.DP.PV

## 
## Call:  glm(formula = count ~ (D + V + P)^2, family = poisson, data = Data)
## 
## Coefficients:
##   (Intercept)         Dwhite         Vwhite           Pyes  Dwhite:Vwhite  
##        4.9358        -2.1746        -1.3298        -3.5961         4.5950  
##   Dwhite:Pyes    Vwhite:Pyes  
##       -0.8678         2.4044  
## 
## Degrees of Freedom: 7 Total (i.e. Null);  1 Residual
## Null Deviance:       1225 
## Residual Deviance: 0.3798    AIC: 52.42

DV.DP.PV1 <- glm(count ~ (D+V+P)^3, data=Data, family=poisson)
anova(DV.DP.PV, DV.DP.PV1, "Chisq")

## Analysis of Deviance Table
## 
## Model 1: count ~ (D + V + P)^2
## Model 2: count ~ (D + V + P)^3
##   Resid. Df Resid. Dev Df Deviance
## 1         1    0.37984            
## 2         0    0.00000  1  0.37984

1-(pchisq(0.3798, 7))

## [1] 0.9997785

P-Value: 0.9997785. Fail to Reject the null that a more complex model is needed.

Using the model in part (a), report the estimated conditional odds ratio between D and P at each category of V. Interpret.

exp(-0.8678)

## [1] 0.4198743

The conditional odds for DP at each level of V is 0.4198743. Controlling for the victims race, the estimated odds for a white defendant receiving the death penalty is 0.42 times the estimated odds of the black defendant.

Find the marginal odds ratio between D and P, contrast this value with the odds ratios obtained in part (b), and remark on how Simpson’s paradox occurs for these data.

(53/430)/(15/176)

## [1] 1.446202

The estimated odds of a white person receiving the death penalty is 1.45 times the odds of a black person person receiving the death penalty.

Simpson’s paradox is made evident as when we control for the victims race, black defendants have a greater odss of the death penalty by prportions of victims’ race, as opposed to when we look at the data marginally.

Referring to the log linear model in part (a), specify the corresponding logistic regres-sion model with P as the response. Fit this model and show how the estimated effects of D and V relate to loglinear model estimates.

V <- c("white", "white", "black", "black")
D <- c("white", "black", "white", "black")
Yes <- c(53, 11, 0, 4)
No <- c(414, 37, 16, 139)
Data2 <- data.frame(D, V, Yes, No)
Data2

##       D     V Yes  No
## 1 white white  53 414
## 2 black white  11  37
## 3 white black   0  16
## 4 black black   4 139

D.V <- glm(cbind(Yes,No)~ D+V, data=Data2, family=binomial)
D.V

## 
## Call:  glm(formula = cbind(Yes, No) ~ D + V, family = binomial, data = Data2)
## 
## Coefficients:
## (Intercept)       Dwhite       Vwhite  
##     -3.5961      -0.8678       2.4044  
## 
## Degrees of Freedom: 3 Total (i.e. Null);  1 Residual
## Null Deviance:       22.27 
## Residual Deviance: 0.3798    AIC: 19.3

Dwhite & DWhite:Pyes are the same coefficents: -0.8678

Vwhite & Vwhite:Pyes: are the same coefficients: 2.4044

Question 2. At the website www.stat.ufl.edu/~aa/intro-cda/data/for Agresti’s second edition,the MBTI data file cross-classifies the MBTI Step II National Sample on four binary scalesof the Myers-Briggs personality test: Extroversion/Introversion (E/I), Sensing/iNtuitive(S/N), Thinking/Feeling (T/F), and Judging/Perceiving (J/P).

Fit the loglinear model by which the variables are mutually independent. Report the results of the goodness-of-fit test.

filename2 <- "http://users.stat.ufl.edu/~aa/intro-cda/data/MBTI.dat"
library(readr)
PersonalityData <- read.table(file=filename2, header=T)
fit <- glm(n ~ EI + SN + TF + JP, family = poisson, data = PersonalityData)
fit1 <- glm(n ~ 1, family = poisson, data = PersonalityData)
summary(fit)

## 
## Call:
## glm(formula = n ~ EI + SN + TF + JP, family = poisson, data = PersonalityData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.3550  -2.1182  -1.0628   0.8506   5.7457  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.79255    0.07674  49.422  < 2e-16 ***
## EIi          0.26439    0.06226   4.246 2.17e-05 ***
## SNs          0.87008    0.06765  12.861  < 2e-16 ***
## TFt         -0.48551    0.06355  -7.640 2.17e-14 ***
## JPp         -0.12971    0.06185  -2.097    0.036 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 399.94  on 15  degrees of freedom
## Residual deviance: 135.87  on 11  degrees of freedom
## AIC: 238.7
## 
## Number of Fisher Scoring iterations: 4

anova(fit1, fit, "Chisq")

## Analysis of Deviance Table
## 
## Model 1: n ~ 1
## Model 2: n ~ EI + SN + TF + JP
##   Resid. Df Resid. Dev Df Deviance
## 1        15     399.94            
## 2        11     135.87  4   264.08

1-pchisq(135.87, 11)

## [1] 0

The P-value for the goodness of fit test is approximately 0, therfore the model is a good fit.

Fit the loglinear model of homogeneous association, and conduct a goodness-of-fittest. Based on the fit, show that

the estimated conditional association is strongest between the S/N and J/P scales; and
There is not strong evidence of conditional association between the E/I and T/Fscales or between the E/I and J/P scales.

fit2 <- glm(n ~ (EI + SN + TF + JP)^2, family = poisson, data = PersonalityData)
fit3 <- glm(n ~ 1, family = poisson, data = PersonalityData)
summary(fit2)

## 
## Call:
## glm(formula = n ~ (EI + SN + TF + JP)^2, family = poisson, data = PersonalityData)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6         7         8  
## -0.72826   1.00215   0.05168  -0.01429   1.49947  -1.29325  -0.07596   0.00231  
##        9        10        11        12        13        14        15        16  
##  0.56850  -0.82975  -0.04948   0.01728  -1.57051   1.09960   0.08587  -0.00804  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.44760    0.13793  24.994  < 2e-16 ***
## EIi         -0.02907    0.15266  -0.190 0.848952    
## SNs          1.21082    0.14552   8.320  < 2e-16 ***
## TFt         -0.64194    0.16768  -3.828 0.000129 ***
## JPp          0.93417    0.14594   6.401 1.54e-10 ***
## EIi:SNs      0.30212    0.14233   2.123 0.033780 *  
## EIi:TFt      0.19449    0.13121   1.482 0.138258    
## EIi:JPp      0.01766    0.13160   0.134 0.893261    
## SNs:TFt      0.40920    0.15243   2.684 0.007265 ** 
## SNs:JPp     -1.22153    0.14547  -8.397  < 2e-16 ***
## TFt:JPp     -0.55936    0.13512  -4.140 3.48e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 399.944  on 15  degrees of freedom
## Residual deviance:  10.162  on  5  degrees of freedom
## AIC: 125
## 
## Number of Fisher Scoring iterations: 4

anova(fit3, fit2, "Chisq")

## Analysis of Deviance Table
## 
## Model 1: n ~ 1
## Model 2: n ~ (EI + SN + TF + JP)^2
##   Resid. Df Resid. Dev Df Deviance
## 1        15     399.94            
## 2         5      10.16 10   389.78

1-pchisq(10.162, 5)

## [1] 0.07077304

The P-value is 0.07077304. This model (saturated) is not a better fit than the simpler model.

Show that the estimated conditional association is strongest between the S/N and J/P scales:

The conditional association between S/N and J/P are very strong, evident by their extremely low P-values.

Show that there is not strong evidence of conditional association between the E/I and T/F scales or between the E/I and J/P scales:

E/I and T/F scales have large P-values, which accordingly shows that the conditional association between the scales are insignificant. E/I and J/P also do not have a strong conditional association.

Fit the model that assumes conditional independence between E/I and T/F, and between E/I and J/P, but has the other pairwise associations.

fit4 <- glm(n ~ EI + SN + TF + JP + EI:SN + SN:TF + SN:JP + TF:JP, family = poisson, data = PersonalityData)
summary(fit4)

## 
## Call:
## glm(formula = n ~ EI + SN + TF + JP + EI:SN + SN:TF + SN:JP + 
##     TF:JP, family = poisson, data = PersonalityData)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.65487  -0.46916   0.00529   0.54208   1.47431  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.41362    0.12930  26.402  < 2e-16 ***
## EIi          0.03871    0.11361   0.341 0.733287    
## SNs          1.19414    0.14548   8.208 2.24e-16 ***
## TFt         -0.54137    0.15282  -3.543 0.000396 ***
## JPp          0.94292    0.13064   7.218 5.28e-13 ***
## EIi:SNs      0.32190    0.13598   2.367 0.017922 *  
## SNs:TFt      0.42366    0.15200   2.787 0.005318 ** 
## SNs:JPp     -1.22021    0.14513  -8.408  < 2e-16 ***
## TFt:JPp     -0.55853    0.13497  -4.138 3.50e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 399.944  on 15  degrees of freedom
## Residual deviance:  12.369  on  7  degrees of freedom
## AIC: 123.2
## 
## Number of Fisher Scoring iterations: 4

Compare this fit to the fit of the model containing all pairwise associations. What do you conclude?

anova(fit4,fit2, test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: n ~ EI + SN + TF + JP + EI:SN + SN:TF + SN:JP + TF:JP
## Model 2: n ~ (EI + SN + TF + JP)^2
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1         7     12.369                     
## 2         5     10.162  2    2.207   0.3317

P-value is 0.3317, so the model that assumes conditional independence between E/I and T/F, and between E/I and J/P gives adequate fit.

Use this model fit to construct a 95% likelihood-ratio confidence interval for the conditional odds ratio between the S/N and J/P scales. Interpret.

exp(confint(fit4, method="profile"))

## Waiting for profiling to be done...

##                  2.5 %     97.5 %
## (Intercept) 23.4067710 38.8667659
## EIi          0.8319067  1.2992469
## SNs          2.4923799  4.4102180
## TFt          0.4299779  0.7832723
## JPp          1.9947113  3.3306215
## EIi:SNs      1.0568719  1.8015497
## SNs:TFt      1.1363105  2.0630723
## SNs:JPp      0.2214587  0.3913279
## TFt:JPp      0.4385878  0.7446507

The confidence interval for the conditional odds ratio between the S/N and J/P scales: is between 0.2214587 and 0.3913279.

The estimated odds ratio associated with the S/N by J/P interaction can be interpreted to mean that the odds of being N given the odds of being J are between 0.2214587 and 0.3913279.

Write the loglinear model of mutual independence, the model of homogeneous asso-ciation, and the model containing all three-factor interaction terms, and show thatthe numbers of parameters are 5, 11, and 15, respectively.

fit <- glm(n ~ EI + SN + TF + JP, family = poisson, data = PersonalityData)
summary(fit)

## 
## Call:
## glm(formula = n ~ EI + SN + TF + JP, family = poisson, data = PersonalityData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.3550  -2.1182  -1.0628   0.8506   5.7457  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.79255    0.07674  49.422  < 2e-16 ***
## EIi          0.26439    0.06226   4.246 2.17e-05 ***
## SNs          0.87008    0.06765  12.861  < 2e-16 ***
## TFt         -0.48551    0.06355  -7.640 2.17e-14 ***
## JPp         -0.12971    0.06185  -2.097    0.036 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 399.94  on 15  degrees of freedom
## Residual deviance: 135.87  on 11  degrees of freedom
## AIC: 238.7
## 
## Number of Fisher Scoring iterations: 4

fit2 <- glm(n ~ (EI + SN + TF + JP)^2, family = poisson, data = PersonalityData)
summary(fit2)

## 
## Call:
## glm(formula = n ~ (EI + SN + TF + JP)^2, family = poisson, data = PersonalityData)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6         7         8  
## -0.72826   1.00215   0.05168  -0.01429   1.49947  -1.29325  -0.07596   0.00231  
##        9        10        11        12        13        14        15        16  
##  0.56850  -0.82975  -0.04948   0.01728  -1.57051   1.09960   0.08587  -0.00804  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.44760    0.13793  24.994  < 2e-16 ***
## EIi         -0.02907    0.15266  -0.190 0.848952    
## SNs          1.21082    0.14552   8.320  < 2e-16 ***
## TFt         -0.64194    0.16768  -3.828 0.000129 ***
## JPp          0.93417    0.14594   6.401 1.54e-10 ***
## EIi:SNs      0.30212    0.14233   2.123 0.033780 *  
## EIi:TFt      0.19449    0.13121   1.482 0.138258    
## EIi:JPp      0.01766    0.13160   0.134 0.893261    
## SNs:TFt      0.40920    0.15243   2.684 0.007265 ** 
## SNs:JPp     -1.22153    0.14547  -8.397  < 2e-16 ***
## TFt:JPp     -0.55936    0.13512  -4.140 3.48e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 399.944  on 15  degrees of freedom
## Residual deviance:  10.162  on  5  degrees of freedom
## AIC: 125
## 
## Number of Fisher Scoring iterations: 4

fit5 <- glm(n ~ (EI + SN + TF + JP)^3, family = poisson, data = PersonalityData)
summary(fit5)

## 
## Call:
## glm(formula = n ~ (EI + SN + TF + JP)^3, family = poisson, data = PersonalityData)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6        7        8  
## -0.4805   0.6865   0.4228  -0.4746   0.9580  -0.9412  -0.7382   0.4889  
##       9       10       11       12       13       14       15       16  
##  0.3666  -0.5798  -0.3618   0.4228  -1.0803   0.7577   0.8099  -0.4746  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.56370    0.16197  22.002  < 2e-16 ***
## EIi         -0.27880    0.23337  -1.195   0.2322    
## SNs          1.05839    0.18535   5.710 1.13e-08 ***
## TFt         -0.63483    0.25356  -2.504   0.0123 *  
## JPp          0.76316    0.19243   3.966 7.31e-05 ***
## EIi:SNs      0.61460    0.25451   2.415   0.0157 *  
## EIi:TFt      0.20026    0.30833   0.650   0.5160    
## EIi:JPp      0.37430    0.26332   1.421   0.1552    
## SNs:TFt      0.41081    0.27510   1.493   0.1353    
## SNs:JPp     -0.96288    0.22994  -4.187 2.82e-05 ***
## TFt:JPp     -0.58773    0.29782  -1.973   0.0484 *  
## EIi:SNs:TFt -0.02364    0.30704  -0.077   0.9386    
## EIi:SNs:JPp -0.51039    0.29275  -1.743   0.0813 .  
## EIi:TFt:JPp  0.02440    0.27403   0.089   0.9290    
## SNs:TFt:JPp  0.01922    0.30880   0.062   0.9504    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 399.9439  on 15  degrees of freedom
## Residual deviance:   7.0963  on  1  degrees of freedom
## AIC: 129.93
## 
## Number of Fisher Scoring iterations: 4

Question 3. True or false? (a) With a single categorical response variable, logistic regression models are more ap-propriate than loglinear models.

Answer: True

To model the association and interaction structure among several categorical response variables, logistic regression models are more appropriate than loglinear models.

Answer: False

The logistic model is a GLM assuming a binomial random component, whereas the loglinear model is a GLM assuming a Poisson random component. When both are fitted to a contingency table having 50 cells with a binary response, the logistic model treats the cell counts as 25 binomial observations, whereas the loglinear model treats the cell counts as 50 Poisson observations.

Answer: True

Question 4. A recent General Social Survey asked subjects whether they believed in heaven and whether they believed in hell; the following table shows the results.

Heaven_Hell <- matrix(c(833,125,2,160),ncol=2,byrow=TRUE)
colnames(Heaven_Hell) <- c("Hell (Yes)","Hell (No)")
rownames(Heaven_Hell) <- c("Heaven (Yes)","Heaven (No)")
Heaven_Hell <- as.table(Heaven_Hell)
Heaven_Hell

##              Hell (Yes) Hell (No)
## Heaven (Yes)        833       125
## Heaven (No)           2       160

Test the hypothesis that the population proportions answering yes were the same for heaven and hell.

Null Hypothesis: the population proportions answering yes were the same for heaven and hell (P_heaven=P_hell) Alternative Hypothesis: the population proportions answering yes were different for heaven and hell

mcnemar.test(Heaven_Hell, correct=F)

## 
##  McNemar's Chi-squared test
## 
## data:  Heaven_Hell
## McNemar's chi-squared = 119.13, df = 1, p-value < 2.2e-16

Since p-value < 2.2e-16, which is approximately 0, there is strong evidence against the null hypothesis, therefore the population proportions answering yes were not the same for heaven and hell.

Find a 90% confidence interval for the difference between the population proportions.Interpret.

SE <- 1/1120 * sqrt( 125 + 2 - (125-2)^2 / 1120)
SE

## [1] 0.00951184

CI <- ((125 - 2) / 1120) + c(-1,1) * 1.64 * SE
CI

## [1] 0.09422201 0.12542085

The probability of a “Yes” response is between 0.0912 less and 0.1285 more for belief in Heavan than belief in Hell.

Categorical Data Analysis, P-Set 6

Barovier Allybose

4/29/2020