Categorical Data Analysis

(Exercise 1) Educational researchers compared three different approaches to mathematics instruction for third graders. During the year, students were rotated through three different styles:

A self-instructional model that was largely based on computer use (Y = 1)
A team approach in which students solved problems in groups of four students (Y = 2)
A traditional class approach (Y = 3)

Researchers were interested in how other school programs influenced the effectiveness of the styles, as well as how they influenced the students’ perceptions of the different styles. The data, cross-classified by the school program students are in (“Regular”, in which a regular school schedule is used, and “After”, which supplements the regular school day with an afternoon school program involving the same classmates), and involving three different schools, are given below. The question of interest is whether and how students’ learning style preference (Y) is associated with their school day program, after adjusting for any effects of individual school. Using the attached Minitab/SAS output, answer the following:

a. Comment on any such associations using the CMH test and using the Baseline-Category Logit (BCL) model - this included testing for school and program effects if possible (give calculated test statistics, degrees of freedom, p-value, and clear interpretations)

Test statistic: 10.9577 with 2 degrees of freedom Since the variables in this example are nominal rather than ordinal, the general association test is the only one appropriate for evaluating associations. At a confidence level of \(\alpha = 0.05\) after controlling for school, we reject the null (p-value = 0.0042) for the CMH test for general association in type of program and learning styles after adjusting for a school effect. That is, the data suggest there is an association between program and learning style preferemce.

The Baseline-Category Logit (BCL) model showed each variable (aside from the school2 dummy variable) to be statistically significant (or very nearly statistically significant in terms of the school1 dummy in logit 1, which has a p-value of 0.053

b. Clearly and thoroughly identify which group is the baseline (for Y) in the BCL model

The traditional classroom setting is the baseline group for Y in the BCL model; this can be clearly identified in the SAS variable information output, where learning_style_pref value “c_class” is referred to as the “reference event”. Additionally, c_class is used as the logit denominator in both fitted models, suggesting that it is the baseline level being compared to new learning styles. The way this model is fit makes intuitive sense for evaluating less-traditional approaches.

c. Clearly interpret the program odds ratio estimates in the BCL model

The program coefficient estimates were 0.742632 for logit1 & 0.74744 for logit2. The odds ratio estimates from the BCL model are \(e^{0.742632} = 2.1015\) for logit1 & \(e^{0.74744} = 2.1116\) for logit2.

Conditioned on the event that the learning preference was either working with teams or traditional classrooms, the estimated odds of learning preference being teamwork versus traditional class for a student in a regular program is 2.1015 times the estimated odds of preference being teamwork for a student in a supplementary afterschool program.

Conditioned on the event that the learning preference was either working alone or traditional classrooms, the estimated odds of learning preference being working alone versus traditional class for a student in a regular program is 2.1116 times the estimated odds of preference being teamwork for a student in a supplementary afterschool program.

d. Showing all hand calculations, use the predictive equations to predict the probabilities for (1) a student from school 2 and regular program and (2) a student from school 3 in a supplementary after school program.

school1 <- 0
school2 <- 1
program <- 1
# Probabilities
pi_s1 <- exp(-0.656042 - (0.231888*1) + (0.747440*1))/(1 + exp(-0.653231 + (-0.475489*1) + (0.742632*1)) + exp(-0.656042 - (0.231888*1) + (0.747440*1)))
pi_t1 <- exp(-0.653231 - (0.475489*1) + (0.742632*1))/(1 + exp(-0.653231 + (-0.475489*1) + (0.742632*1)) + exp(-0.656042 - (0.231888*1) + (0.747440*1)))
pi_c1 <- 1/(1 + exp(-0.653231 + (-0.475489*1) + (0.742632*1)) + exp(-0.656042 - (0.231888*1) + (0.747440*1)))

data.frame(self = pi_s1, 
           team = pi_t1,
           class = pi_c1)

##        self      team     class
## 1 0.3409392 0.2666951 0.3923657

sum(pi_s1, pi_t1, pi_c1)

## [1] 1

pi_s2 <- exp(-0.656042 + (0.747440*1))/(1 + exp(-0.653231 + (0.742632*1)) + exp(-0.656042 + (0.747440*1)))
pi_t2 <- exp(-0.656042 + (0.742632*1))/(1 + exp(-0.653231 + (0.742632*1)) + exp(-0.656042 + (0.747440*1)))
pi_c2 <- 1/(1 + exp(-0.656042 + (0.747440*1)) + exp(-0.656042 + (0.742632*1)))

data.frame(self = pi_s2, 
           team = pi_t2,
           class = pi_c2)

##        self      team    class
## 1 0.3435648 0.3419169 0.313858

sum(pi_s2, pi_t2, pi_c2)

## [1] 0.9993396

Predicted Probabilities of preference for a student from school 2 without a supplementary program

\(\hat{\pi}_{self} = \frac{e^{-0.656042 - (0.231888*school2) + (0.747440*regular)}}{1 + e^{-0.653231 + (-0.475489*school2) + (0.742632*regular)) + exp(-0.656042 - (0.231888*school2) + (0.747440*regular)}} = 0.3409392\)

\(\hat{\pi}_{team} = \frac{e^{-0.653231 - (0.475489*school2) + (0.742632*regular)}}{1 + e^{-0.653231 + (-0.475489*school2) + (0.742632*regular)) + exp(-0.656042 - (0.231888*school2) + (0.747440*regular)}} = 0.2666951\)

\(\hat{\pi}_{class} = \frac{1}{1 + e^{-0.653231 + (-0.475489*school2) + (0.742632*regular)} + e^{-0.656042 - (0.231888*school2) + (0.747440*regular)}} = 0.3923657\)

Predicted Probability of preference for a student from school 3 in a supplementary program

\(\hat{\pi}_{self} = \frac{e^{-0.656042 + (0.747440*regular)}}{1 + e^{-0.653231 + (0.742632*regular)} + e^{-0.656042 + (0.747440*regular)}} = 0.3435648\)

\(\hat{\pi}_{self} = \frac{e^{-0.656042 + (0.742632*regular)}}{1 + e^{-0.653231 + (0.742632*regular)} + e^{-0.656042 + (0.747440*regular)}} = 0.3419169\)

\(\hat{\pi}_{self} = \frac{1}{1 + e^{-0.656042 + (0.747440*regular)} + e^{-0.656042 + (0.742632*regular)}} = 0.313858\)

(Exercise 2) R.T. Senie et al (American Journal of Public Health, 1981) report the data given below relating Frequency of Breast Self-Examination and Age in a sample of women. In the SAS output below, the PO model fit to these data uses Y = 1 (Never), Y = 2 (Occasionally), and Y = 3 (Monthly). First, it was fit with a dummy variable for the “Age < 45” age group and another dummy variable for the “Age: 45 - 49” age group and then fit using the derived “ageb” variable.

Data

Models

a. Using the 1st dummy variable approach, clearly test for the Age effect, reporting your hypotheses, derived test statistic, degrees of freedom, p-value and clear conclusion.

LR Test for Age effect
\(H_{0}: \beta_{age} = 0\) There is no age effect on probability of frequency of self-exam
\(H_{A}: \beta_{age} \neq 0\) Age has some effect on probability of frequency of self-exam
Test Statistic: \(\chi^{2}_{df = 2} = 24.4799\)
With a p-value less than 0.0001, we reject the null hypothesis indicating that age does have some effect on frequency of self-exam.

b. Clearly interpret the odds ratio for the last run of SAS/LOGISTIC (the one using ageb).

Derivation of ageb: [40(age = ‘< 45’)] + [52.5(age=‘45-59’)] + [65(age = ‘60+’)];
Recodes young to age 40, mid to age 52.5, and older to 65

The estimated odds ratio from the last model fit (using the derived ageb variable) was 1.029, which can be interpreted to mean that the odds of higher frequency of self-examination for women ageb = 40 are 1.029 times that of women who are ageb = 52.5, and the odds of higher frequency self-exams for women ageb = 52.5 are 1.029 times that of women who are ageb = 65.

c. Clearly determine which approach (dummy variable or ageb) best fits these data.

Based on significance of MLE’s, it seems as if the model using the dummy variables fits these data best. The model using the derived ageb variable had an MLE estimate for intercept 2 that deviated from the other estimates in direction and was substantially less significant relative to the other models’ estimates.

(Exercise 3) Using R and attaching your R program/output, complete p.189, exercise 6.8, completing these parts:

df <- data.frame(alt.trt = c(0, 0, 1, 1), 
                 fem = c(0, 1, 0, 1), 
                 y1 = c(28, 4, 41, 12), 
                 y2 = c(45, 12, 44, 7), 
                 y3 = c(29, 5, 20, 3), 
                 y4 = c(26, 2, 20, 1))
kable(df)

alt.trt	fem	y1	y2	y3	y4
0	0	28	45	29	26
0	1	4	12	5	2
1	0	41	44	20	20
1	1	12	7	3	1

null.model <- vglm(as.matrix(df[,3:6]) ~ 1, 
                   family = cumulative(parallel = TRUE), 
                   data = df)
summary(null.model)

## 
## Call:
## vglm(formula = as.matrix(df[, 3:6]) ~ 1, family = cumulative(parallel = TRUE), 
##     data = df)
## 
## Pearson residuals:
##   logitlink(P[Y<=1]) logitlink(P[Y<=2]) logitlink(P[Y<=3])
## 1            -1.2988            -1.3437            -0.7374
## 2            -1.4692             0.5292             1.0511
## 3             0.9841             0.7042            -0.1800
## 4             2.2391             0.9990             1.1082
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1  -0.9233     0.1282  -7.202 5.95e-13 ***
## (Intercept):2   0.5993     0.1209   4.957 7.16e-07 ***
## (Intercept):3   1.6296     0.1562  10.431  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]), 
## logitlink(P[Y<=3])
## 
## Residual deviance: 16.479 on 9 degrees of freedom
## 
## Log-likelihood: -30.9973 on 9 degrees of freedom
## 
## Number of Fisher scoring iterations: 4 
## 
## No Hauck-Donner effect found in any of the estimates

a. Include only the main effects into the model using a dummy variable for the alternating therapy group and another dummy variable for the females. Clearly interpret the respective odds ratios.

fit <- vglm(as.matrix(df[,3:6]) ~ alt.trt + fem, 
            family = cumulative(parallel = TRUE), 
            data = df)
f.sum <- summary(fit); f.sum

## 
## Call:
## vglm(formula = as.matrix(df[, 3:6]) ~ alt.trt + fem, family = cumulative(parallel = TRUE), 
##     data = df)
## 
## Pearson residuals:
##   logitlink(P[Y<=1]) logitlink(P[Y<=2]) logitlink(P[Y<=3])
## 1             0.1720            0.06056             0.2809
## 2            -1.6543            0.16312             0.8425
## 3             0.2655           -0.13909            -0.9386
## 4             0.6174           -0.01519             0.6444
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1  -1.3180     0.1801  -7.319  2.5e-13 ***
## (Intercept):2   0.2492     0.1621   1.538  0.12412    
## (Intercept):3   1.3001     0.1852   7.021  2.2e-12 ***
## alt.trt         0.5807     0.2119   2.741  0.00613 ** 
## fem             0.5414     0.2953   1.834  0.06671 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]), 
## logitlink(P[Y<=3])
## 
## Residual deviance: 5.5677 on 7 degrees of freedom
## 
## Log-likelihood: -25.5417 on 7 degrees of freedom
## 
## Number of Fisher scoring iterations: 5 
## 
## No Hauck-Donner effect found in any of the estimates
## 
## 
## Exponentiated coefficients:
##  alt.trt      fem 
## 1.787262 1.718403

The estimated odds ratio for the alternate therapy effect indcates that the odds of a poor response from a patient in an alternating therapy group are \(e^{0.6493} = 1.914009\) times the odds of poor response from a patient in a sequential group.

The model estimated odds ratio for the gender indicates that the odds of a poor response in a female are \(e^{0.2368} = 1.2671877\) times the odds of poor response in a male.

The alternate therapy effect was found to be significant with a p-value of 0.0066 while the gender effect was not found to be significant (p-value = 0.45).

b. Include a therapy-by-gender interaction term into the model by including also the product of the above dummy variables into the model. Test whether the interaction term is statistically significant using = 0.05. Use the best test and report all hypotheses, test statistics, p-values, and clear conclusion.

int.fit <- vglm(as.matrix(df[,3:6]) ~ alt.trt + fem + alt.trt*fem,  
                family = cumulative(parallel = T), 
                data = df)
summary(int.fit)

## 
## Call:
## vglm(formula = as.matrix(df[, 3:6]) ~ alt.trt + fem + alt.trt * 
##     fem, family = cumulative(parallel = T), data = df)
## 
## Pearson residuals:
##   logitlink(P[Y<=1]) logitlink(P[Y<=2]) logitlink(P[Y<=3])
## 1            0.02471           -0.13434             0.1387
## 2           -1.30855            0.56305             1.0374
## 3            0.48236            0.01789            -0.7993
## 4            0.05105           -0.39051             0.4948
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1  -1.2757     0.1843  -6.920 4.52e-12 ***
## (Intercept):2   0.2957     0.1681   1.760   0.0785 .  
## (Intercept):3   1.3452     0.1909   7.045 1.85e-12 ***
## alt.trt         0.4881     0.2288   2.133   0.0329 *  
## fem             0.2742     0.4094   0.670   0.5030    
## alt.trt:fem     0.5904     0.5935   0.995   0.3199    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]), 
## logitlink(P[Y<=3])
## 
## Residual deviance: 4.5209 on 6 degrees of freedom
## 
## Log-likelihood: -25.0183 on 6 degrees of freedom
## 
## Number of Fisher scoring iterations: 5 
## 
## No Hauck-Donner effect found in any of the estimates
## 
## 
## Exponentiated coefficients:
##     alt.trt         fem alt.trt:fem 
##    1.629170    1.315475    1.804727

lrtest(fit, int.fit)

## Likelihood ratio test
## 
## Model 1: as.matrix(df[, 3:6]) ~ alt.trt + fem
## Model 2: as.matrix(df[, 3:6]) ~ alt.trt + fem + alt.trt * fem
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1   7 -25.542                     
## 2   6 -25.018 -1 1.0468     0.3062

Test Statistic: \(\chi^{2} = 1.1677\)
At a confidence level of \(\alpha = 0.05\), the likelihood ratio test indicates with a p.value = 0.2799 that the interaction effect between alternating therapy and gender is not statistically significant and thus negligble.

c. Returning to the main effects model in part (a), test whether the treatment therapy and the gender effects are significant, using \(\alpha = 0.05\). Use the best test and report all hypotheses, test statistics, p-values, and clear conclusion.

gender.fit <- vglm(as.matrix(df[,3:6]) ~ fem, 
                    family = cumulative(parallel = TRUE), 
                    data = df)
lrtest(fit, gender.fit)

## Likelihood ratio test
## 
## Model 1: as.matrix(df[, 3:6]) ~ alt.trt + fem
## Model 2: as.matrix(df[, 3:6]) ~ fem
##   #Df  LogLik Df  Chisq Pr(>Chisq)   
## 1   7 -25.542                        
## 2   8 -29.327  1 7.5702   0.005934 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

trt.fit <- vglm(as.matrix(df[,3:6]) ~ alt.trt, 
                family = cumulative(parallel = TRUE), 
                data = df)
lrtest(fit, trt.fit)

## Likelihood ratio test
## 
## Model 1: as.matrix(df[, 3:6]) ~ alt.trt + fem
## Model 2: as.matrix(df[, 3:6]) ~ alt.trt
##   #Df  LogLik Df  Chisq Pr(>Chisq)  
## 1   7 -25.542                       
## 2   8 -27.340  1 3.5965     0.0579 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Test Statistic: \(\chi^{2} = 7.5702\)
At a confidence level of \(\alpha = 0.05\), the likelihood ratio test indicates with a p.value = 0.0059 that the effect of alternating therapy is statistically significant in the model and should be kept.

Test Statistic: \(\chi^{2} = 3.5965\)
At a confidence level of \(\alpha = 0.05\), the likelihood ratio test indicates with a p.value = 0.058 that the gender effect is not statistically significant.